Architectures and Algorithms for Mitigation of Soft Errors in Nanoscale VLSI Circuits by Bhattacharya, Koustav
University of South Florida 
Scholar Commons 
Graduate Theses and Dissertations Graduate School 
10-22-2009 
Architectures and Algorithms for Mitigation of Soft Errors in 
Nanoscale VLSI Circuits 
Koustav Bhattacharya 
University of South Florida 
Follow this and additional works at: https://scholarcommons.usf.edu/etd 
 Part of the American Studies Commons 
Scholar Commons Citation 
Bhattacharya, Koustav, "Architectures and Algorithms for Mitigation of Soft Errors in Nanoscale VLSI 
Circuits" (2009). Graduate Theses and Dissertations. 
https://scholarcommons.usf.edu/etd/1858 
This Dissertation is brought to you for free and open access by the Graduate School at Scholar Commons. It has 
been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Scholar 
Commons. For more information, please contact scholarcommons@usf.edu. 
Architectures and Algorithms for Mitigation of Soft Errors in Nanoscale VLSI Circuits
by
Koustav Bhattacharya
A dissertation submitted in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
Department of Computer Science and Engineering
College of Engineering
University of South Florida
Major Professor: Nagarajan Ranganathan, Ph.D.
Srinivas Katkoori, Ph.D.
Hao Zheng, Ph.D.
Sanjukta Bhanja, Ph.D.
Kandethody M. Ramachandran, Ph.D.
Date of Approval:
October 22, 2009
Keywords: Transient Faults; Design Flow; VLSI CAD; Reliable Architecture Design;
Reliability-aware Design Automation
© Copyright 2009,Koustav Bhattacharya
DEDICATION
To my mother
who lives in my heart
ACKNOWLEDGEMENTS
I would like to take this opportunity to thank Professor N. Ranganathan for providing me the
opportunity to work with him. He introduced me to various interesting problems and I am ex-
tremely fortunate for having worked with such a distinguished scholar and an eminent researcher
like him. He has always guided me with valuable suggestions whenever I was in need. His con-
tinuous encouragement and emotional support during my difficult times has been instrumental in
shaping me to become a better researcher, and more importantly, a better person. I would also like
to thank Professor Srinivas Katkoori, Professor Hao Zheng, Professor Sanjukta Bhanja and Profes-
sor Kandethody M. Ramachandran for their time in reviewing this manuscript and their valuable
suggestions for improving its quality. I am also thankful to Semiconductor Research Corporation
(SRC) for supporting this research, in part, by a grant under the contract 2007-HJ-1596 and to Na-
tional Science Foundation (NSF) Computing Research Infrastructure, in part, by a grant under the
contract CNS-0551621.
My peers and friends from the lab especially Soumyaroop, Ziad, Mahalingam, Himanshu and
Ransford have made the years spent in pursuing this degree feel shorter. Their constructive ideas
and discussions have been extremely useful in improving the quality of this research. On a personal
front, I would like to thank my dad, who at an early age ignited my mind with the will to succeed.
My belated mother had always been a constant source of inspiration for my work and inculcated the
attitude in me that nothing is impossible with strong determination and willpower.
TABLE OF CONTENTS
LIST OF TABLES iii
LIST OF FIGURES iv
ABSTRACT vii
CHAPTER 1 INTRODUCTION 1
1.1 Motivation 2
1.2 Contributions and Significance 5
1.3 Outline of Dissertation 8
CHAPTER 2 RELATED WORK 10
2.1 Previous Work 10
2.2 Dissertation Context in Light of Past Works 16
CHAPTER 3 ESTIMATION MODELS 18
3.1 Background 18
3.2 Metrics to Estimate Soft Error Masking Effects 21
3.2.1 Estimating Logical Masking Effects 21
3.2.2 Estimating Electrical Masking Effects 24
3.2.3 Estimating Timing Window Masking Effects 24
3.2.4 Estimation of Timing Slack 26
CHAPTER 4 RADIATION IMMUNITY AT PHYSICAL DESIGN LEVEL 27
4.1 Glitch Filtering in Interconnects 28
4.2 Placement for Radiation Immunity using Simulated Annealing 30
4.3 Fast SER Aware Placement using Quadratic Programming 34
4.4 Experimental Results 39
CHAPTER 5 SOFT ERROR MITIGATION AT CIRCUIT LEVEL 48
5.1 Radiation Induced Glitch Blocker Circuit 49
5.2 Selective Insertion Algorithm 52
5.3 Experimental Results 55
5.4 Comparison with Related Works 58
CHAPTER 6 LOGIC LEVEL RELIABILITY-CENTRIC GATE SIZING 61
6.1 Interaction of Various Noise Sources under Process Uncertainty 62
6.2 Logic Level Modeling of the Design Metrics 64
i
6.2.1 First Order Modeling of Glitch Masking Effects 65
6.2.2 Crosstalk Noise Modeling at the Logic Level 65
6.2.3 Power and Timing Models 67
6.3 Gate Sizing Formulation 68
6.4 Experimental Results 72
CHAPTER 7 SOFT ERROR TOLERANCE AT ARCHITECTURAL LEVEL 78
7.1 Characterization of Multi-bit Errors in Conventional Caches 79
7.1.1 Probabilistic Characterization of Multi-bit Error Rate 80
7.1.2 Vulnerability of Conventional Cache Organizations 81
7.2 Redundancy-based Error Protection 82
7.2.1 Exploiting L1/L2 Redundancy 82
7.2.2 Fine Grain Dirtiness 84
7.3 Improving Reliability by Controlling Redundancy 85
7.3.1 Reliability-centric Replacement Policy 86
7.3.2 Exploiting Small Data Value Size 88
7.4 Experimental Setup and Results 90
7.4.1 Experimental Setup 91
7.4.2 Simulation Results 92
7.5 Comparison with Related Works 101
CHAPTER 8 CONCLUSIONS 105
REFERENCES 109
LIST OF PUBLICATIONS
ABOUT THE AUTHOR End Page
ii
LIST OF TABLES
Table 4.1 Simulated Annealing Parameters 40
Table 4.2 Comparison of SER Aware QP Based Placement with Timing Driven
Placement: Weight Combination 0.99 and 0.01 40
Table 4.3 SER Reduction for Radiation Immune SA Based Placement with
Associated Delay and Area Overhead 43
Table 4.4 Comparison of SER Aware QP Based Placement with Timing Driven
Placement: Weight Combination 0.9 and 0.1 46
Table 4.5 Comparison of SER Aware QP Based Placement with Timing Driven
Placement: Weight Combination 0.9999 and 0.0001 47
Table 5.1 Overhead of Adding Radiation Blocker Circuit for Various Standard Cells 53
Table 5.2 Experimental Results for ISCAS’85 Benchmark Circuits 57
Table 6.1 Experimental Results on Benchmark Circuits 74
Table 7.1 Description of the Schemes Used in Experiments 90
Table 7.2 Baseline Processor Configuration 92
Table 7.3 Comparison with Recent Works in Literature 103
iii
LIST OF FIGURES
Figure 1.1 Impact of SER due to Gate Length Scaling in nm (Source Intel) [8] 2
Figure 1.2 Design Flow for Soft Error Tolerance in VLSI Systems 3
Figure 1.3 List of Contributions 6
Figure 2.1 Taxonomy Diagram: Works Related to Transient Faults on Caches. 11
Figure 2.2 Soft Error Protection Using Circuit Level Optimizations 13
Figure 2.3 Taxonomy Diagram of Works in Gate Sizing Based on Optimization Metrics 14
Figure 3.1 Masking in Logic Circuits 20
Figure 3.2 Illustrating Computation of Logic Observability: (A) Signal Proba-
bilities of Nets, (B) GEP Values at Internal Gate Inputs, (C) Logical
Observability Values at Various Gate Outputs 23
Figure 3.3 NRC for an Inverter at Varying Capacitive Loads [59] 25
Figure 4.1 Interconnects Modeled as RC Ladder 28
Figure 4.2 Effect of Wirelengths on Glitch Reduction 29
Figure 4.3 Cost Function with Penalty for High Area and Total Wirelength.
(Note: All values are in generic units) 32
Figure 4.4 Placement of c432 Benchmark Using QP (A) w1=0.1 w2=0.9, (B)
w1=0.9 w2=0.1 37
Figure 4.5 Placement of c432 Benchmark Using QP (A) w1=0.01 w2=0.99,
(B) w1=0.99 w2=0.01 38
Figure 4.6 Area Comparison for Different Placement Schemes. (Note: All val-
ues are in generic units) 41
Figure 4.7 Wirelength Comparison for Different Placement Schemes. (Note:
All values are in generic units) 41
Figure 4.8 Placement of c432 Benchmark Using QP (A) w1=0.0001 w2=0.9999,
(B) w1=0.9999 w2=0.0001 44
iv
Figure 4.9 Speedup Comparison of QP Based and SA Based Radiation Immune
Placement Schemes 47
Figure 5.1 Schematic of Radiation Induced Glitch Blocker Circuit 49
Figure 5.2 Plotting the Voltages across M1 50
Figure 5.3 (A) Transient Pulses on Inverter Cell for Radiation Strikes of Vary-
ing strength, (B) Corresponding Results on an Inverter Cell Pro-
tected with Radiation Blocker Circuit 51
Figure 5.4 Simulation Flow: SER Reduction by Using Radiation Blocker Circuits 55
Figure 5.5 Layout of the Radiation Blocker Circuitry 56
Figure 5.6 Comparison of SER Reduction for Different User Defined Parameters 58
Figure 5.7 Comparison of Delay Overhead for Different User Defined Parameters 59
Figure 5.8 Comparison of Area Overhead for Different User Defined Parameters 60
Figure 6.1 Interaction of Soft Errors and Crosstalk Noise 63
Figure 6.2 Soft Error Mitigation under Process Uncertainty 64
Figure 6.3 First Order Model on Soft Errors of Logic Circuits with Varying
Gate Size 64
Figure 6.4 Modeling Crosstalk Noise using Graph Clustering based on Rent’s
Exponent Values 66
Figure 6.5 Simulation Flow for Reliability-centric Gate Sizing Under Process Variations 73
Figure 6.6 Average Timing Yield at Different Timing Margins 75
Figure 6.7 SER Reduction for ISCAS85 benchmarks 76
Figure 6.8 Improvement in SER, Crosstalk Noise and Power 76
Figure 7.1 Vulnerability of Different Cache Organizations for SPECINT2000. 79
Figure 7.2 Vulnerability of Different Cache Organizations for SPECFP2000. 80
Figure 7.3 Illustrating Inclusion Property and Fine Grain Dirtiness 83
Figure 7.4 Illustrating Reliability-centric Replacement and Small Value Duplication 85
Figure 7.5 Hardware Architecture for Small Value Detection and Duplication 89
Figure 7.6 Vulnerability of the L2 Cache for Various Schemes Proposed for
SPECINT2000. 93
v
Figure 7.7 Vulnerability of the L2 Cache for Various Schemes Proposed for
SPECFP2000. 93
Figure 7.8 Global Miss Rates of the L2 Cache for Various Schemes Proposed
for SPECINT2000. 95
Figure 7.9 Global Miss Rates of the L2 Cache for Various Schemes Proposed
for SPECFP2000. 95
Figure 7.10 IPCs for Various Schemes Proposed for SPECINT2000. 97
Figure 7.11 IPCs for Various Schemes Proposed for SPECFP2000. 98
Figure 7.12 Write Back Traffic Rate to the Main Memory for Various Schemes
Proposed for SPECINT2000. 99
Figure 7.13 Write Back Traffic Rate to the Main Memory for Various Schemes
Proposed for SPECFP2000. 99
Figure 7.14 Area Overhead for a L2 Cache with Redundancy Based Error Pro-
tection Compared to a Baseline L2 Cache without Error Protection 100
Figure 7.15 Average Dynamic Power Consumption for a L2 Cache with a 8KB
ECC Cache Compared with Baseline L2 Cache without Error Protection 101
Figure 7.16 Average Leakage Power Consumption for a L2 Cache with Small
ECC Cache with Fixed Number of Blocks for Different Sized Multi-
bit Errors 102
vi
ARCHITECTURES AND ALGORITHMS FOR MITIGATION OF SOFT ERRORS IN
NANOSCALE VLSI CIRCUITS
Koustav Bhattacharya
ABSTRACT
The occurrence of transient faults like soft errors in computer circuits poses a significant chal-
lenge to the reliability of computer systems. Soft error, which occurs when the energetic neutrons
coming from space or the alpha particles arising out of packaging materials hit the transistors, may
manifest themselves as a bit flip in the memory element or as a transient glitch generated at any
internal node of combinational logic, which may subsequently propagate to and be captured in a
latch. Although the problem of soft errors was earlier only a concern for space applications, ag-
gressive technology scaling trends have exacerbated the problem to modern VLSI systems even for
terrestrial applications.
In this dissertation, we explore techniques at all levels of the design flow to reduce the vulnera-
bility of VLSI systems against soft errors without compromising on other design metrics like delay,
area and power. We propose new models for estimating soft errors for storage structures and com-
binational logic. While soft errors in caches are estimated using the vulnerability metric, soft errors
in logic circuits are estimated using two new metrics called the glitch enabling probability (GEP)
and the cumulative probability of observability (CPO). These metrics, based on signal probabilities
of nets, accurately model soft errors in radiation-aware synthesis algorithms and helps in efficient
exploration of the design solution space during optimization. At the physical design level, we lever-
age the use of larger netlengths to provide larger RC ladders for effectively filtering out the transient
glitches. Towards this, a new heuristic has been developed to selectively assign larger wirelengths
to certain critical nets. This reduces the delay and area overhead while improving the immunity to
vii
soft errors. Based on this, we propose two placement algorithms based on simulated annealing and
quadratic programming which significantly reduce the soft error rates of circuits.
At the circuit level, we develop techniques for hardening circuit nodes using a novel radiation
jammer technique. The proposed technique is based on the principles of a RC differentiator and is
used to isolate the driven cell from the driving cell which is being hit by a radiation strike. Since
the blind insertion of radiation blocker cells on all circuit nodes is expensive, candidate nodes are
selected for insertion of these cells using a new metric called the probability of radiation blocker
circuit insertion (PRI). We investigate a gate sizing algorithm, at the logic level, in which we si-
multaneously optimize both the soft error rate (SER) and the crosstalk noise besides the power and
performance of circuits while considering the effect of process variations. The reliability centric
gate sizing technique has been formulated as a mathematical program and is efficiently solved.
At the architectural level, we develop solutions for the correction of multi-bit errors in large L2
caches by controlling or mining the redundancy in the memory hierarchy and methods to increase
the amount of redundancy in the memory hierarchy by employing a redundancy-based replacement
policy, in which the amount of redundancy is controlled using a user defined redundancy threshold.
The novel architectures and the new reliability-centric synthesis algorithms proposed for the
various design abstraction levels have been shown to achieve significant reduction of soft error rates
in current nanometer circuits. The design techniques, algorithms and architectures can be integrated
into existing design flows. A VLSI system implementation can leverage on the architectural solu-
tions for the reliability of the caches while the custom hardware synthesized for the VLSI system
can be protected against radiation strikes by utilizing the circuit level, logic level and layout level
optimization algorithms that have been developed.
viii
CHAPTER 1
INTRODUCTION
The race to innovate has led to unprecedented progress in the field electronic computing. This
has attributed to the ubiquitous use of VLSI systems in personal computers and large scale servers,
portable and mobile electronic systems like laptop computers, cellular phones, music players and
various embedded computing systems deployed in televisions, cars and in almost every consumer
electronic systems. As this revolution continues, the cost of computing decreases even further
and applications which were economically infeasible slowly become practical. The high rate of
growth in VLSI technology is sustained by scaling the minimum feature sizes of transistors to
smaller and smaller dimensions along with the continuous reduction in the operating supply and
threshold voltages. This trend in technology scaling has helped the design of modern VLSI systems
for higher performance and lower power consumption. Higher integration densities, increase in
operating frequencies and reduction of supply voltages, however, make reliability of these systems
a key concern.
High electric fields in scaled devices which occur due to effects like hot carrier injection (HCI)
and negative bias temperature instability (NBTI) often manifest themselves as an increase in the
threshold voltage and can lead to device slowdown and eventually result in the timing failure of the
circuit. The strong electric fields in wires often cause momentum transfer during collisions between
conducting electrons and metal atoms which can lead to ”shorts” or ”opens” in interconnects. How-
ever, HCI, NBTI, electromigration and other effects due to high electric fields generate over time
causing permanent failures and hence can only impact the availability and lifetime of the designs.
Transient faults, on the other hand, creates intermittent faults in VLSI systems. These can occur
due to several reasons like soft errors, power supply and interconnect noise, and electromagnetic
interference. Soft errors occur when the energetic neutrons coming from space or the alpha particles
1
Figure 1.1 Impact of SER due to Gate Length Scaling in nm (Source Intel) [8]
arising out of packaging materials hit the transistors. When such high energy ions strike the diffusion
region of VLSI circuits a voltage spike may be generated on the affected circuit node. A voltage
spike of high magnitude can result in a soft error on the circuit. A soft error may manifest itself as
a bit flip in a latch or memory element. Additionally, soft errors can occur in any internal node of
a combinational logic and subsequently propagate to and be captured in a latch. Radiation induced
soft error is one of the biggest contributors to transient faults and present the biggest challenges
towards the reliability of VLSI systems implemented with the current nanometer technologies.
1.1 Motivation
In the past, soft errors have been a significant problem only in space applications. However, the
recent trends in technology scaling have hugely decreased the radiation immunity of electronic cir-
cuits and have made nanometer designs highly susceptible to transient faults. Figure 1.1 illustrates
that soft error rate of several VLSI processor systems designed at Intel, has grown exponentially
due to technology scaling trends. Moreover, while space applications could use advanced fabrica-
tion technology and packaging material to reduce soft errors, such techniques are typically quite
expensive to implement in low cost commercial systems.
VLSI systems are increasingly being designed as System-on-Chip implementations. System-
on-Chips are being designed for a wide range of applications, from general purpose computing
to special purpose VLSI systems. General purpose computing implementations on System-on-Chip
2
.
.
.
.
.
.
Constraints
Off−the−shelf components
Commercial Tools and
of this Dissertation
Developed as part
Place and Route Flow
ASIC/FPGA
ASIC/FPGA
High Level Synthesis
Processor Core(s)
Off−the−shelf
Delay, Area and 
Soft Error Rate
Power Metrics with
On−chip L2 Cache
into Hardware
protected against
System Specifications
HW/SW Partioning
Behavioral Code
to be executed on
Custom Processor
Behaviorial Code
to be synthesized
Software Compiler
Assembly/Machine
Code Generation
Sy
ste
m
 B
us
Memory and I/O
Multi−bit errors
Logic Level Optimizations
Technology mapping
Transistor Level Optimizations
Figure 1.2 Design Flow for Soft Error Tolerance in VLSI Systems
3
require large caches and off-the-shelf processors to enhance the performance of the applications. On
the other hand, special purpose applications implemented on System-on-Chips have large amount
of special purpose hardware which are synthesized from system level RTL code. The chip manufac-
turers typically set budgets on soft error rates on such systems which should be met by the resulting
design with low overheads in performance and power.
An effective approach for mitigation of soft error effects for such a system is to implement steps
for reliability against soft errors in the design flow itself. However, such a unified approach for soft
error mitigation at various design abstraction levels has never been tried before in any prior research.
Addressing soft error issues in such a unified design flow gives the system designer the opportunity
to weigh up the implications of dedicating more resources for soft error detection and prevention
against the correlating impact on delay, power and area.
Memory structures have been considered as dominant sources of transient errors in VLSI sys-
tems [1], [2], [3], [4], [47]. These include on-chip caches, DRAMs, register files, and other on-chip
memory structures in VLSI systems. Although, with technology scaling the Soft Error Rate in
SRAMs has remained constant for a given cache size, the rate of multi-bit errors has increased
significantly with the shrinking device geometries. The rate of multi-bit errors increases accross
technology generations as device feature sizes shrink. Moreover, multi-bit errors tend to develop
over time in large caches which are typical in current VLSI systems with high memory integra-
tion density. Architectural strategies for prevention of soft errors in such large caches has not been
explored before in prior research.
Further, technology trends like smaller feature sizes, lower voltage levels, higher operating fre-
quency and reduced logic depth, are projected to increase the soft-error rate (SER) in combinational
logic as well [28], [8]. In a recent study [32], the SER of logic circuits was quantified in tech-
nology nodes from 600nm to 50nm and it was projected that by 2011, the SER in logic circuits
will increase by nine orders of magnitude and will essentially be comparable to that of unprotected
memory. Thus, there is an imminent need for novel algorithms for synthesizing soft error tolerant
combinational logic circuits in a design flow. The current work fills a major gap in this direction.
4
Thus, in summary, the motivation for this dissertation is to explore the core issues in problem
of soft errors, and develop a design flow framework that optimizes soft error rate at various design
abstraction levels and in particular encompass,
  Models for efficient estimation and optimization of soft errors at all the design abstraction
levels.
  Layout level optimization schemes geared towards mitigation soft errors during automated
physical design.
  Novel circuit level techniques for mitigation soft errors with low overheads in delay, area and
power.
  Low overhead techniques for mitigation of soft errors in the logic level especially under the
influence of other uncertain noise effects.
  Efficient architectural solutions that handle multi-bit errors in hardware storage structures
found in current VLSI systems.
A design flow framework incorporating the architectural solutions for the reliability of the stor-
age structures and novel soft-error aware synthesis algorithms at various design abstraction levels
can be used to implement VLSI systems which are completely immune to radiation induced relia-
bility issues.
1.2 Contributions and Significance
In this dissertation, we investigate the development of a unified design flow framework for
mitigation of soft errors. Several design and circuit optimization techniques applicable at various
levels of hardware design are explored to improve the reliability of computing systems. We have
experimented significantly in developing novel techniques at each level of the design flow to reduce
the impact of soft errors in VLSI systems. Figure 1.2 illustrates a design flow framework in VLSI
System on Chips. As shown in the figure, the techniques developed as part of this dissertation can
5
Selective Insertion for Soft Error Mitigation
Framework for Correction of Multi−bit Soft
Errors in L2 Caches based on Redundancy
Reliability−centric Gate Sizing considering 
Uncertainty due to Process Variations
Optimization
Soft Error
Layout Level
Circuit Level
Optimization
Optimization
Modeling
Major Contributions
Optimization
Logic Level
Architectural Level
Placement for Radiation Immunity in Cell
Errors at Various Design Abstraction Levels
Efficient and Accurate Modeling of Soft
Based Design of Nanometer Circuits
Novel Radiation Blocker Circuit and its
Figure 1.3 List of Contributions
be integrated into such a unified design flow framework to significantly reduce the SER of a VLSI
system with low overheads in delay, area and power.
The theme and the major research works pertaining to this dissertation are summarized in Figure
1.3. The key contributions of this dissertation can be described as follows,
  We propose efficient metrics for estimation of soft errors at various design abstraction levels.
Soft errors of storage structures like caches are estimated using the vulnerability metric while
soft errors in logic circuits are estimated by using two new metrics called the glitch enabling
probability (GEP) and the cumulative probability of observability (CPO) defined based on the
signal probabilities of the nets. These metrics accurately model soft errors in radiation-aware
synthesis algorithms and helps in efficient exploration of the design solution space.
  We develop new algorithms for radiation aware automatic physical design by intelligently
modifying the placement stage in cell based designs. Larger netlengths can provide larger
6
RC ladders to effectively filter out the transient glitches. Towards this, a new heuristic has
been developed to selectively assign larger wirelengths to certain critical nets to increase
the radiation immunity of circuits with low delay and area overhead. Based on this, we
propose two placement algorithms based on simulated annealing and quadratic programming
that significantly reduce the SER in standard cell based designs of logic circuits.
  We propose a transistor level circuit optimization technique based on a radiation blocker cir-
cuit which significantly reduces the propagation of random glitches due to radiation strikes.
The radiation blocker circuit can fight transient glitches on standard cell outputs due to ran-
dom radiation strikes by using a RC differentiator circuit. The circuit is used to isolate the
driven cell from the driving cell which is being hit by a radiation strike. An algorithm based
on ranking circuit nodes using a new metric called the probability of radiation blocker circuit
insertion (PRI) has been developed. The radiation blocker cells are inserted only on the top
few nodes in the sorted list of PRI values.
  We develop a gate sizing algorithm at the logic level of design abstraction that can jointly
optimize the circuit against both radiation induced soft errors along with compounded noise
effects of capacitive crosstalk. Based on a first order model towards size dependence of logic
gates for soft errors and efficient modeling of crosstalk noise and process variations a relia-
bility centric gate sizing algorithm under process variation has been formulated. This multi-
metric gate sizing framework has been formulated as a non-linear mathematical program and
is efficiently solved.
  We propose architectural level techniques for correction of multi-bit errors in the L2 cache by
exploiting the redundancy existing between the write-through L1 cache and the L2 cache and
the redundancy existing between the clean data lines of the L2 cache and the main memory.
We also develop methods to increase the amount of redundancy in the memory hierarchy
by employing a redundancy-based replacement policy, in which the amount of redundancy
being controlled is based on a redundancy threshold. We also investigate techniques to mine
7
redundancy at the word level by duplicating small memory values in the upper half of the
memory word.
Thus, we have explored both novel architectures and novel reliability-centric synthesis algo-
rithms for improving the vulnerability against soft errors and achieved significant reduction of soft
error rates in VLSI systems. The design techniques, algorithms and architectures developed can be
integrated into existing design flows. A VLSI system implementation can leverage on the architec-
tural solutions for the reliability of the caches while the custom hardware synthesized for the VLSI
system can be protected against radiation strikes by utilizing the circuit level, logic level and layout
level optimization algorithms that that has been developed.
1.3 Outline of Dissertation
The remainder of this dissertation is organized as follows: Chapter 2 describes the related work
pertaining to our research. In Chapter 3, we develop novel metrics for modeling of soft errors in
VLSI circuits. These metrics are used extensively throughout the dissertation. In Chapter 4, we
show that higher wirelengths for nets can act as a larger RC ladder and can also effectively filter
out transient glitches due to radiation strikes. Based on this, we present two placement algorithms
that place standard cells in a way to provide higher wirelengths for soft error critical nets while
simultaneously constraining the chip area and the total wirelength. We show that such placement
algorithms can significantly reduce the SER of logic circuits. Chapter 5 we propose a circuit level
technique based on a RC differentiator circuit which can be inserted at the output of a logic cell to
prevent the generation of transient glitches due to radiation strikes. The radiation blocker circuit
has the configuration of a RC differentiator and is used to cutoff the driven cell from the driving
cell which is hit by a radiation strike. Further in that chapter, we propose an algorithm to insert
radiation blocker cells only on selected nodes in a logic circuit. The algorithm is based on ranking
the circuit nodes based on a new metric called the probability of radiation blocker circuit insertion
(PRI) and inserting these cells only on the top few nodes in the sorted list of PRI values. In Chapter
6, we develop a first order model of the soft error phenomenon in logic circuits and incorporate
power and delay metrics to formulate a convex programming based reliability-centric gate sizing
8
technique. In Chapter 7, we investigate in detail the multi-bit soft error rates in large L2 caches and
propose a framework of solutions for their correction based on the amount of redundancy present
in the memory hierarchy. We investigate several new techniques for reducing multi-bit errors in
large L2 caches, in which, the multi-bit errors are detected using simple error detection codes and
corrected using the data redundancy in the memory hierarchy. We also propose several techniques
to control/mine the redundancy in the memory hierarchy to further improve the reliability of the
L2 cache. The concluding remarks and the suggested future work in terms of extensions to the
problems addressed in this dissertation, and other ideas for further refinements are given in Chapter
8.
9
CHAPTER 2
RELATED WORK
The trends in technology scaling have helped the design of modern microprocessors for higher
performance and lower power consumption through the rapid shrinking of the minimum feature size
as well as the reduction of supply voltages [8]. At the same time, microprocessors are being built
with higher degree of spatial parallelism and deeper pipelines to increase the clock frequency [6].
Unfortunately, however, these trends make them more susceptible to transient faults [7–11]. Several
different strategies have been investigated in the past to avoid, detect and recover from soft errors
[12]. These solutions are applied at various levels of the system, from process technology, circuit
to microarchitecture levels. In this chapter, we first review the various previous works found in
literature for soft error mitigation at various design abstraction levels and then present the context
of this dissertation in the light of these previous works.
2.1 Previous Work
Memory structures have been considered as dominant sources of transient errors in computer
systems [1–4, 47]. These include on-chip caches, DRAMs, register files, and other on-chip mem-
ory structures. As shown in the taxonomy diagram given in Figure 2.1, the L2 caches have been
traditionally protected against soft errors using Error correction codes (ECC) codes [1, 2, 4]. The
tasks of detecting and correcting soft errors using ECC codes, however, incur a large penalty in area.
For example, double error correction and double error detection (DECDED) codes require 14 bits,
for each 64-bit memory word, corresponding to a 22% area overhead. Multi-bit error protection
using sophisticated ECC protection will also require more bit lines and wider sense amplifiers thus
increasing the cache access latency and power consumption. Spatial multi-bit errors can also be
avoided by using layout level techniques like physical interleaving [15]. However, with higher in-
10
UltraSparc T1, 2005
Slayman, 2005
Bit Interleaving
Zhang, 2004
Replication Cache
Zhang et. al., 2003
In−cache replication
This Work
Gold et. al., 2007
Using Last Store Prediction
2D Error Coding
Kim et. al., 2007
Transient faults in Caches
Priority based SECDED
Multi−bit errors
Parity Caching
Kim and Somani, 1999
Temporal multi−bit errorsSpatial multi−bit errors
Cache Line
Mitra et. al., 2005
Wider ECC words 
Li et. al., 2004
Adaptive Error Coding
Bossen et. al, 2002
POWER 4,
Itanium processor
Quach, 2000
Yeager, 1996
MIPS R10000
Kim and Somani, 1999
Selective Checking
Kim and Somani, 1999
Shadow Checking
Wang et. al., 2008
Self Adaptive Caches
Hu et. al, 2009
Compiler Assisted Detection
Scrubbing
Simple Parity/SECDED
Single−bit errors
 
Figure 2.1 Taxonomy Diagram: Works Related to Transient Faults on Caches.
terleaving factors multiple word lines are needed to be driven and data need to re-grouped or routed
for read/write operations, thus increasing the cache access latency. Multi-bit errors can be avoided
by correcting single-bit errors during scrubbing, before they develop into temporal multi-bit errors
by another particle strike. However, choosing the right scrub interval is often difficult [16]. Most
importantly, scrubbing cannot eliminate spatial multi-bit errors since spatial multi-bit errors occur
due to a single particle strike rather than evolving over time.
Several schemes have been proposed in the literature to reduce the area overhead associated
with protecting memory by ECC codes [17]. In [18], error protection is suggested for frequently
accessed cache lines. In [19], the authors described the use of a dead block prediction technique
to hold the copy of data found in active cache blocks. A larger ECC word can also be used to
compensate for the area overhead [20]. However, since the unit of memory read/write is based on
word granularity, each memory read/write requires reading several data words to generate SECDED
check bits. In [21], a small fully associative ”replication cache” is maintained to maintain replicas
11
of writes which are used to detect and correct errors. In [22], the authors have mentioned of using
redundancy for area efficient error protection. However, detailed results in the context of multi-bit
errors have not been provided. Recently, in [23–25], several techniques have been proposed for
area efficient multi-bit error correction. In [23], the authors have proposed to reduce the multi-bit
soft-errors of L1 caches using last store prediction. In [24], the authors have proposed the use of
two-dimensional error codes which can correct clustered 32x32 errors with significantly smaller
overheads in area, performance and power.
The soft errors that do not affect the program output are considered benign as no error is ob-
served by the user. This situation can occur, for example, in branch prediction logic or in the
instructions from the mis-speculated execution sequences which never commit and thus, will never
lead to visible error states. Soft errors which affect the program output are typically defined in terms
of failures in time (FIT) [13, 14]. The chip manufacturers typically set budgets on soft error rates
which should be met by the design.
Single event transients can also occur in any internal node of a combinational logic and subse-
quently propagate to and be captured in a latch. Although, soft errors have been a greater concern
for memory elements, technology trends like smaller feature sizes, lower voltage levels, higher op-
erating frequency and reduced logic depth, are projected to increase the soft-error rate (SER) in
combinational logic beyond that of unprotected memory elements [8, 28]. A taxonomy diagram
illustrating the various approaches for protecting logic circuits against soft errors is shown in Figure
2.2. As shown, soft errors can be prevented in logic circuits by using various circuit level opti-
mization techniques. In [35], time redundancy is exploited to detect and recover from soft-errors.
In [47], a technique for correction of logic soft errors using c-elements has been proposed. In [85],
concurrent error detection circuits are added to nodes in logic circuits which have high soft error
susceptibility. In [36], soft error protection in domino logic and sequential cells is achieved by
explicitly adding capacitors to the feedback node. In [40], the idea have been extended to combina-
tional logic circuits. However, as the stored charge in the keeper becomes smaller due to technology
scaling, the technique becomes inefficient in fighting transient glitches due to radiation strikes.
In [39], gates are locally duplicated and the duplicate nodes are connected by a voltage clamper
12
Simultaneous Dual−Vdd and Sizing,
for Radiation Hardness
Simultaneous Optimization 
SER Reduction
Gate sizing for Sizing using Convex Programming,
Zhou and Mohanram, 2006
Mohanram and Touba, 2003
Time Redundancy,
Nicolaidis, 1999
Selective Cell Duplication,Temporal Redundancy
 
 
Choudhury et. al., 2006
This Work
Nagpal et. al., 2008
Using CWSP elements,
Circuit Level Node Hardening
for SER Reduction
Garg et. al., 2006
Shadow Gate with Voltage Clamper Circuit,
Sasaki et. al., 2006
Masking using Schmitt Trigger Circuit,
Kumar and Tahoori, 2005
Glitch filtering by Pass transistors,
Karnik et. al., 2002
Explicit Capacitor Feedback
Rao et. al., 2006
Simultaneous Sizing and F/F Selection,
Exploiting Spatial or
SER Reduction in Logic Circuits 
Figure 2.2 Soft Error Protection Using Circuit Level Optimizations
circuit. This prevents the output node of the gate and its duplicate node not to deviate in voltage
due to a radiation strike. This technique, however, doubles the area and power overhead. The effect
is especially severe for complex cells or cells with higher drive strengths. Adding shadow gates
for such cells with large silicon footprint makes the area and power overhead significant. In [37]
the logic gates that are bombarded by radiation strikes are isolated using complimentary pass gates.
The complimentary pass gates act as a low pass filter and filter out transient voltage pulses due to
a radiation strike. In [41] a class of soft error masking circuits is proposed using a schmitt trigger
circuit. These techniques, however, can achieve a marginal reduction in the radiation induced glitch
magnitude but cannot completely eliminate the transient.
Placement is the process of arranging the circuit components on a layout surface to minimize
a certain cost metric. This cost metric may be the overall chip area, which is simply the area of
13
Murugavel and Ranganathan, ’04
Game Theory based
Sapatnekar et. al., ’93
Convex Optimization
Sinha and Zhou, ’04
Game Theory
Optimization
Security Aware Path balancing
Bhattacharya and Ranganathan, ’08
Power Optimization
Crosstalk and Total
Iterative Sizing
Fishburn and Dunlop, ’85
Sizing on selective nodes
Unconstrained Delay 
Minimization
under Delay constraints
Power Minimization
Soft Error Rate
Minimization
Linear Programming
Minimization
Crosstalk Noise Xiao and Marek−Sadowska, ’99
Lagrangian Relaxation
Rao et. al., ’06
Sizing and FF selection
Berkelaar, ’90
Power Minimization
under Uncertainty
Delay Uncertainty
Geometric Programming
Hashimoto and Onodera, ’00
Crosstalk Noise and
Crosstalk Minimization
Soft Error Rate and
Ranganathan et. al., ’08
Fuzzy Programming
Stochastic Games
Hanchate and Ranganathan, ’07
This Work
Zhou and Mohanram, ’06
SSTA based
Linear Programming Delay Minimization
 under Uncertainty
Gate Sizing Metrics
Traditional Optimization Process Uncertainty
Hanchate and Ranganathan, ’06
Bhardwaj et. al., ’06
Leakage Power optimization
Mahalingam and Ranganathan, ’06
Fuzzy Programming
Optimization Considering
Singh et. al., ’05
Mani and Orshansky, ’04 
Stochastic Programming
Figure 2.3 Taxonomy Diagram of Works in Gate Sizing Based on Optimization Metrics
bounding box enclosing the circuit components, or the total wirelength. Good total wirelength costs
not only predict routability and routing area but also provide ”easy to compute” rough estimate of
the circuit delay. In general, the placement problem is a NP-complete problem even for the simplest
case of 1D placement [49]. Therefore, several heuristic algorithms have been proposed that can give
”good” solutions in reasonable amount of time. An efficient and unique representation of a valid
placement configuration is through the use of sequence pairs [53]. Placement algorithms have been
used for improving power, delay [49], crosstalk [57], routability [51], parametric yield [50] etc.
However, we show that placement can be used for optimizing netlength distribution and can be used
as an effective tool for reducing circuit SER. The reduction of SER can be achieved by selectively
optimizing netlengths for soft error critical nets. Towards this, we have developed new placement
algorithms for radiation immunity of logic circuits using a standard cell based design flow. To the
best of our knowledge, SER reduction using optimizations at the placement stage is attempted for
the first time in this dissertation.
14
Gate sizing is a simple yet effective technique for optimizing circuits for performance, power
and reliability. Figure 2.3 summarizes the various metrics used in circuit optimization during gate
sizing. Traditionally, the gate sizing problem has been formulated as an unconstrained delay mini-
mization problem or as area and power minimization problem under delay constraints in [81]. Many
gate sizing formulations have attempted at improving power, area or noise under an acceptable para-
metric yield. In [83], the optimization uses a penalty function to improve the slacks of critical paths
to improve yield. A stochastic programming approach with chance constraints is used in [76] to
incorporate yield in the gate sizing problem formulation. Recently in [72], the joint optimization of
power and delay under process variations has been attempted. The continuous shrinking of noise
margins due to feature size scaling have made nanometer circuits increasingly vulnerable to reliabil-
ity issues like soft errors and crosstalk noise [103]. In [84], asymmetric logical masking probability
of nodes in a logic circuit is exploited to selectively resize gates. In the pioneering works [100,104],
the authors have proposed the use of probabilistic computation using Markov Random Fields to
provide immunity against soft faults. Flip-flop selection is used to reduce the impact of soft errors
in [30]. In [97], an iterative gate sizing algorithm has been proposed to perform coupling noise
reduction. In [99], the authors propose a two pass method to resize the gates such that the noise
constraints are satisfied without violating the timing constraints. It is pointed out in [56] that the
above techniques involve high design overhead and lack scalability. More importantly, despite ex-
tensive research in single noise sources, few works have focused on development of analysis and
joint optimization techniques for crosstalk noise and soft error effects. However, as we discuss
later, a deeper inter-relationship exist between radiation induced noise and capacitively coupled
noise. All the above techniques, in general, apply to a single noise source and cannot address the
evolving reality of multiple interacting noise sources under process variations.
The impact of parameter variations on performance, power and reliability has been increasing
with each technology generation. The main causes of the variations are either due to environmental
effects like changes in power supply voltage and temperature or due to physical effects like changes
in transistor width, channel length, oxide thickness and interconnect dimensions. The uncertainty
in the process parameters due to the imprecision of the fabrication process deeply impacts timing,
15
power and noise characteristics of circuits. Thus, identically designed circuits can have huge dif-
ferences in these characteristics leading to loss in the parametric timing, power, and noise yield in
these circuits. In this dissertation, we investigate a challenging problem to address the evolving
reality of multiple interacting noise sources under process variations using gate sizing.
2.2 Dissertation Context in Light of Past Works
As the SER in current nanometer circuits are increasing exponentially, there is an imminent
need for novel algorithms and architectures for synthesizing soft error tolerant circuits in a design
flow. The correlating impact of soft error mitigation along with the associated overhead in a unified
manner is not considered in previous works. This is especially true for current nanometer chips
since VLSI systems are increasingly being designed as System-on-Chip implementations. An ef-
fective approach for mitigation of soft error effects for such a system is to implement steps for
reliability against soft errors in the design flow itself. However, such a unified approach for soft er-
ror mitigation at various design abstraction levels has never been tried before in any prior research.
Addressing soft error issues in such a unified design flow gives the system designer the opportunity
to weigh up the implications of dedicating more resources for soft error detection and prevention
against the correlating impact on delay, power and area. Moreover, studying the multi-bit errors
in large caches in these systems and strategies for prevention of such multi-bit soft errors have not
been explored before in prior research.
In this dissertation, we explore techniques at all levels in the design flow to improve the vulnera-
bility of VLSI systems against soft errors without compromising on other design metrics like delay,
area and power. We propose new metrics to model the glitch masking in a circuit using the signal
probabilities of the nets. We leverage the use of larger netlengths as we propose two placement
algorithms based on simulated annealing and quadratic programming that significantly reduce the
soft error rates of circuits. To the best of our knowledge, this is the first work for SER reduction
using optimizations at the placement stage. Further, we explore approaches for hardening of se-
lective nodes within a circuit which significantly reduces the probability of generation of random
radiation induced glitches. The technique achieves superior reduction in SER at very low overheads
16
than any of the works listed above. We developed a reliability-centric gate sizing algorithm consid-
ering compound noise effects under process variation using a multi-metric optimization framework.
Simultaneous optimization of soft errors along with other design metrics under process variations
is a challenging problem and has not been attempted before. We also explore efficient architectural
solutions that handle multi-bit errors in hardware storage structures. Unlike the previous works, our
architectural techniques target multi-bit errors in large caches and achieve high SER reduction at
minimum area and power overheads and with virtually no performance penalty. The design tech-
niques, algorithms and architectures can be integrated into existing design flows for VLSI systems
using system-on-chips. The current work is thus unique and fills the void in designing reliable and
soft error tolerant VLSI systems implemented as system on chips in an unified manner.
17
CHAPTER 3
ESTIMATION MODELS
In this chapter, we describe the preliminaries on soft errors and develop models for estimation
of soft errors in circuits. We also describe the estimation of timing slack on circuit nodes which we
use extensively throughout this dissertation.
3.1 Background
The occurrences of random radiation induced energetic neutron strikes are generally distributed
fairly uniformly in space and time. The probability of a particle strike in a circuit node is thus
roughly proportional to its active area. The device level raw SER can be expressed by the following
emperical model [27],
SERdevice ∝ F  Ad  K  e

QcritQs (3.1)
where F is the total neutron flux within the whole energy spectrum, Ad are diffusion areas
which are sensitive to the particle strikes, K is a technology dependent fitting parameter, Qcrit is the
critical charge, and Qs is the charge collection efficiency of the device. The threshold critical charge,
Qcrit , marks on the onset of the double exponential current pulse behavior described above. As the
technology scales the Qcrit charge required to create a soft error upset is considerably decreased. For
memory elements this glitch voltage is fed back creating a metastable condition and finally results
in a bit flip in the stored information in the memory element. Memories has been traditional victims
of radiation induced soft errors. This is due to the dense layouts of memory cells comprising of
small transistors leading to lower capacitances and very less charge being held to represent a state.
Although, with technology scaling the SER in SRAMs has remained constant for a given cache size,
18
the rate of multi-bit errors has increased significantly with the shrinking device geometries. Spatial
multi-bit errors occur when a single particle strike upsets multiple adjacent cells. A higher packing
of the cells in the same active area can now cause a single radiation strike to affect multiple cells
simultaneously, potentially leading to multi-bit errors. Temporal multi-bit errors can also occur in
large caches when multiple independent particles affect bits in the same word at different times,
primarily due to the larger mean lifetime of cache lines in bigger caches.
For combinational logic circuits, a particle strike on a circuit node can only manifest into a soft
error depending on the circuit topology. Though the characteristics of a transient pulse at a node
depends on the energy distribution of the incident particle, the drive strength of the gate, and the
critical charge, various masking factors determine whether the transient pulse can actually propagate
to the primary outputs/latches/flip-flops and result in a soft error [84]. The charge deposition at a
particular circuit node is traditionally modeled in simulations by a double exponential current pulse
Iin

t  [27], which can be represented as,
Iin  t 
Q
τα 	 τβ

e

t
τα
	
e

t
τβ
 (3.2)
where Q is the charge deposited as a result of a particle strike, τα is the collection time-constant
of the junction, and τβ is the ion-track establishment time constant. τα and τβ are generally defined
by process parameters.
Next, we describe the three primary factors that can potentially mask radiation induced transient
current pulses.
  Logical masking occurs when there is no sensitized path from the gate node where the tran-
sient pulse occurs to any of the primary outputs. The transient pulse is filtered when it arrives
to an input of a gate whose any of the other inputs are at a controlling logic value.
  Electrical masking occurs due to electrical attenuation of the transient pulse in a sensitized
path, from its occurrence at a particular gate node to any of the primary outputs. Thus, the
extent of electrical masking depends on the electrical property of the gates in the sensitized
path.
19
I7
I6
I5
I4
I3
I2
I1 A
G6
G7
G5
G4
G3
G1
G2
0
D
1
1
1
0
1
0
1
Radiation
Strike
O2
O1
E
C
B
Figure 3.1 Masking in Logic Circuits
  Timing-window masking occurs when the transient pulse does arrive at the primary outputs
with sufficient strength to cause a soft error but is sufficiently separated in time from the
arrival of the clock edge. As the latch only samples its input on the clock edge, and as the
transient pulse is momentary it does not effectively lead to a soft error.
We illustrate these masking effects using the example circuit shown in Figure 3.1. As shown
in the figure, a transient pulse is generated on net B due to a radiation strike on the active area of
gate G2. The transient pulse is logically masked at the output of gate G5 as its other input is at its
controlling value (0). However, the transient pulse is sensitized through gate G4 and suffers some
electrical attenuation. The pulse is further attenuated through gate G6. If the transient pulse at the
primary output is of sufficient strength it may lead to a soft error provided it is within a timing
window of the clock edge, i.e., the pulse must arrive some time before the clock edge (set-up time
constraint) and stay till some time after the clock edge (hold-time constraint).
These masking effects thus make the internal circuit nodes to have varying levels of suscep-
tibility to soft errors [84]. Thus, the SER of the overall circuit is often quite different from the
accumulated device level SER as given by equation 3.1. The system level SER at the architectural
level can be calculated as,
SERsystem  SERckt  Vulnerability (3.3)
20
Vulnerability depends on the target applications of the system and the architectural choices used
to implement the system. The vulnerability of cache structures is studied in detail in chapter 7. Soft
error rates also depend on environmental factors like altitude; however, we do not model this in our
formulation.
3.2 Metrics to Estimate Soft Error Masking Effects
The observability metric is inversely proportional to the masking effect of each circuit node.
Thus, the nodes with high observability have lower soft error masking ability than the nodes with
low observability and vice-versa.
3.2.1 Estimating Logical Masking Effects
The glitch enabling probability (GEP) of each net connected to a gate input is defined as the
probability that a glitch at the gate input will propagate to the gate output. The GEP of gate input is
computed as the product of the probabilities that all other inputs of a gate are at the gate’s enabling
value. Thus, mathematically, the GEP of input i of gate j can be computed as,
GEPi j  ∏
kεinputs 
 j  k  i
Penab  k  (3.4)
where inputs
 j  is the set of all inputs to gate j and Penab

k  is the probability that input k is at
its enabling value. The enabling value for a gate input depends on the type of gate. For example,
for an AND function the enabling value is logic 1 and the enabling probability given is,
Penab  k  Ps  k  (3.5)
where Ps

k  is the signal probability of input k, i.e., the probability that the input k is at logic 1.
For the OR function the enabling value is logic 0 and the enabling probability is given by,
21
Penab  k  1 	 Ps  k  (3.6)
Given the signal probabilities of the primary inputs, the signal probabilities of the internal circuit
nodes can be calculated [44]. Thus, depending on the function of a particular input and the type of
gate itself, the GEP values of each gate input can be calculated. Thus, the signal probabilities and
the GEP values of all internal nodes can be calculated using a forward pass through the circuit
by visiting circuit nodes in the topologically sorted order from the primary inputs to the primary
outputs.
It should be noted that in the above formulation we have implicitly ignored the corellation
in signal probabilities of the nets [44]. However, we would like to mention that calculation of
GEP values by considering signal correlations makes the estimation metrics sufficiently compute
intensive. Thus, in order to reduce the computational complexity of our optimization schemes using
these metrics, we have assumed independence of signal probabilities of nets at the cost of slight
inaccuracy [59]. More accurate computation of GEP values by considering corellation in signal
probabilities, can always be added in our proposed metrics without significant modification in our
formulation.
The logical observability of each net is defined as the probability that a glitch on that net will
propagate to any primary output of the circuit. The computation of the logical observability of a net
is based on the GEP values for gate inputs. The logical observability of a net is 1 for primary output
nets. Given the logical observability of the output net of a gate, the logical observability at a input
net i of the gate j is given as,
LogicalObserv

i  GEPi j  LogicalObserv  j  (3.7)
Thus, the logical observability of each input of a gate is calculated recursively by multiplying
the logical observability of its output net with the GEP of the corresponding input net. The logical
observability of the stem of a fanout node is computed by considering the maximum logical observ-
22
0.75
0.75
0.25
0.25
0.25
0.875
0.625
0.8125
O2
O1
0.75
(A)
0.5
0.5
0.5
0.5
0.5
I1
I2
I3
I4
I5
0.5
0.5
0.5
I6
I7
I8
0.625
0.625
G6
G1
G3
G5
G7
G4
G2
0.156
O2
O1
0.203
(B)
0.75
0.5
0.25
I1
I2
I3
I4
I5
I6
I7
I8
0.508
0.625
0.875
G2
G6
G3
G1
G5
G7
G4
I6
I5
I4
I3
I2
I1
I7
(C)
1.0
1.0
O1
0.508
0.625
0.875
0.508
0.203
0.04
0.117
0.437
0.437
0.875
0.156
O2
I8
G7
G4
G2
G6
G1
G5
G3
Figure 3.2 Illustrating Computation of Logic Observability: (A) Signal Probabilities of Nets, (B)
GEP Values at Internal Gate Inputs, (C) Logical Observability Values at Various Gate Outputs
ability of all its branches. Thus, the logical observability of a net is computed using a backward
pass of the structural netlist in the topologically sorted order from the primary output towards the
primary inputs. The logical observability thus obtained is finally normalized by dividing it with the
maximum logical observability of all nets in the circuit.
In Figure 3.2, we illustrate the computation of the logical observability for a example circuit.
The signal probabities of internal nets are computed using the signal probabilities of the primary
inputs. So signal probablity of G2 is computed by taking the product of signal probabilities of inputs
I3 and I4, which is 0.25. As G3 is a NAND gate, the product of the signal probabilities gives the
probability of the output at logic 0. Since the probability of the output of gate G3 to be at logic
0 is 0.125, the signal probability of that net has the value 0.875. Thus a forward pass through the
structural netlist in the topologically sorted order provides the signal probabilities of all nets and is
shown in Figure 3.2(A). Since the circuit consists of just NAND and AND the enabling value of
all gate inputs is logic 1. So for this circuit, the probability of a input to be at its enabling value is
same as its signal probability. The GEP of each gate input can thus be computed by using equation
3.4 and using the previously computed signal probabilies. The computed GEP values of all gates
with internal nets as inputs are shown Figure 3.2(B). The logical observability values for gates with
internal nets as their outputs are then computed using a backward pass of the structural netlist, using
equation 3.7. As previously discussed, the logic observability for a stem is computed by taking the
23
maximum of the logical observability of the branches. Thus, the logical observability of the output
of gate G1 is computed by taking the maximum of 0.508 and 0.117 which is 0.508. The computed
logical observability values of all gates with outputs as internal nets is shown in Figure 3.2(C).
3.2.2 Estimating Electrical Masking Effects
The strength of electrical masking for a particular gate can be estimated by creating noise re-
jection curves (NRC) for that gate type. The NRC for an inverter cell with gate length of 180 nm,
reported in [59], is shown in Figure 3.3. The x-axis denotes the input noise width and the y-axis
denotes the input height. All radiation induced SET which are below the NRC curve are noise-
immune. In other words, either they have a width below the corresponding NRC or has height to
the left of the NRC. Radiation induced SET which are above the NRC are noise-sensitive. Thus,
the area under the NRC divided by the area over the NRC corresponds to the electrical masking of
a gate. It should be noted that, for a particular noise pulse with given width and height the electrical
masking is higher for a gate with higher fanout load. We estimate the electrical observability (which
has a inverse relation to electrical masking) of a gate i at its output net as follows,
ElectricalObserv

i 
W

i 
CL

i 
(3.8)
where W

i  is the size of gate i and CL

i  is the capacitive load at node i. In general, the NRC
curve can be expressed analytically as well [60] and the inverse relationship to fanout load can
be shown mathematically as well under some simplifying assumptions. The maximum electrical
observabillity in the circuit is used to normalize the node electrical observability.
3.2.3 Estimating Timing Window Masking Effects
We determine a pessimistic estimate of a timing window such that noise existing in that timing
window (TW) will reach the primary outputs and get latched in the output flops. We estimate the TW
observability at each node by computing the difference between the maximum and the minimum
24
Figure 3.3 NRC for an Inverter at Varying Capacitive Loads [59]
delay from that node to any primary output. Thus, the TW observability of a gate i at its output net
can be expressed mathematically as follows,
TwObserv

i  max
jεPO 
PathToPO

i j 
	
min
jεPO 
PathToPO

i j  (3.9)
where PathToPO

i j  is the path delay from any primary output j to the node i and PO is the set
of primary outputs of the circuit. TW observability can be computed recursively by computing the
maximum and minimum PathToPO

i j  from the sink(primary outputs) to source(primary inputs)
while visiting nodes in the reverse topological order. The maximum and minimum PathToPO

i j  at
the gate outputs connected directly to the primary outputs are set to 0. The gate delay is added while
going from the gate output to the gate input. When a stem is encountered, the maximum(minimum)
of the max(min) of PathToPO  i j  at the branches is computed. Thus, this pessimistic metric assigns
higher values of TW observability to nodes which have different path delays to the primary outputs.
Intuitively, this makes sense, as the radiation induced noise pulse can occur in a wider time window
and still get captured in the output flops making the node more vulnerable. This metric is also
normalized by dividing the TW observability at a gate output by the maximum TW observability
found in the circuit.
25
Finally, a cumulative probability of observability (CPO) for each gate output is computed which
captures all the three masking effects cumulatively. The CPO of gate i at its output can thus be
expressed as,
CPO

i  LogicalObserv

i  ElectricalObserv

i  TwObserv

i  (3.10)
It should be noted, that while the logical observability has higher values for gates near the
primary outputs, the TW observability is quite less. The internal gates which are farther away from
the primary outputs have more unbalanced delay paths to the primary outputs and hence have higher
values of TW observability. However, for these nodes, the logical observability is quite less.
3.2.4 Estimation of Timing Slack
A combinational circuit without feedback can be modeled as a directed acyclic graph (DAG).
The DAG can be made polar by assigning a dummy source node connected to all primary inputs and
a sink node connected to all primary outputs. The earliest arrival time (EAT) of each net can now be
computed by traversing the DAG in the topologically sorted order from the source and assigning the
EAT of a gate output as the maximum of the EATs of its inputs plus the delay of the gate. Similarly,
the latest arrival time (LAT) of each net can be computed by traversing the DAG in the topologically
sorted order from the sink and assigning the LAT of a gate input as the minimum of the LATs of its
outputs minus the delay of the gate. The difference of the LAT and the EAT provides the slack for
each net.
26
CHAPTER 4
RADIATION IMMUNITY AT PHYSICAL DESIGN LEVEL
The rates of radiation induced soft errors have been significantly increasing due to the aggressive
scaling trends in the nanometer regime. Several circuit optimization techniques have been proposed
in literature for preventing such transient faults in logic circuits. These include, the inclusion of
concurrent error detection circuits on selective nodes, selective gate sizing, dual-VDD assignment.
As described in Chapter 5, selective node hardening at the transistor level can also be used.
In this chapter, we show that transient glitches due to radiation strikes can be sufficiently reduced
by intelligently modifying the placement stage in cell based designs to selectively assign larger
wirelengths to certain critical nets. Larger netlengths can provide larger RC ladders to effectively
filter out the transient glitches. Towards this, we propose two placement algorithms based on (i)
simulated annealing and (ii) quadratic programming that significantly reduce the soft error rates
of logic circuits. The soft error masking effects are captured by using a new metric called the
cumulative probability of observability (CPO) of each net which is defined as the probability that
a transient glitch at the net will result in a soft error for the logic circuit. The cost function for
simulated annealing (SA) is modeled as the summation of the masking probability weighted with
the netlength for each net, while simultaneously constraining the total area and the total wirelength.
The quadratic programming based placement algorithm for radiation immunity provides a more
computationally efficient alternative for soft error reduction during placement. Both the algorithms
try to assign higher wirelengths for nets with low masking probability for higher glitch reduction,
while maintaining low delay and area penalty for the overall circuit. To the best of our knowledge,
the reduction of soft error rate during placement is being attempted for the first time. The proposed
algorithms were implemented using the FreePDK 45nm Process Design Kit and the OSU standard
cell library and tested on the ISCAS85 benchmark circuits. Experimental results indicate that the
27
C CC
RR R
V(n1) V(n2) V(n3) V(n4)
VDD
Fanout
Load 
 (CL)C
R
Input
Figure 4.1 Interconnects Modeled as RC Ladder
proposed algorithms significantly improve the radiation immunity in logic circuits without much
delay and area overheads.
The rest of the chapter is organized as follows. In Section 6.1, we explain how interconnect
length can be an effective way to reduce the propagation of transients due to radiation. Section 6.2
describes the SA based placement algorithm to reduce circuit SER. Section 6.3 provides a faster
alternative implementation of radiation immune placement using quadratic programming. Finally,
Section 6.4 describes our experimental setup and illustrates the results.
4.1 Glitch Filtering in Interconnects
In this section, we show how the interconnect length can be leveraged to filter out glitches
resulting from random radiation strikes. Let us consider the case in which a simple inverter cell
is driving a fixed fanout load. The wire connecting the driving inverter cell to the driven fanout
load can be approximated as a RC ladder as shown in Figure 4.1. The RC ladder is modeled
using four resistance/capacitance elements in series with each block to the right, thus modeling the
increasing interconnect length. For example, the effective RC ladder at node n1 models 1X times
the interconnect length and the effective RC ladder at node n2 models 2X times the interconnect
length. Similarly, the effective RC ladders at nodes n3 and n4 model interconnect lengths of 3X and
4X respectively.
28
S
y
m
b
o
l 
W
a
v
e
D
0
:t
r0
:v
(n
1
)
D
0
:t
r0
:v
(n
2
)
D
0
:t
r0
:v
(n
3
)
D
0
:t
r0
:v
(n
4
)
Voltages (lin)
0
5
0
m
1
0
0
m
1
5
0
m
2
0
0
m
2
5
0
m
3
0
0
m
3
5
0
m
4
0
0
m
4
5
0
m
5
0
0
m
5
5
0
m
6
0
0
m
6
5
0
m
T
im
e
 (
li
n
) 
(T
IM
E
)
0
1
u
2
u
3
u
4
u
5
u
6
u
7
u
*
 
#
 g
e
n
e
ra
te
d
 f
o
r:
 h
s
p
ic
e
s
.
Fi
gu
re
4.
2
Ef
fe
ct
o
fW
ire
le
ng
th
so
n
G
lit
ch
R
ed
uc
tio
n
29
A radiation strike on a cell can be modeled as a transient current source. We modeled radiation
strikes of deposited charges in the range of [60fC, 135fC] with current sources as defined in equation
1 with a τα of 10ps and τβ of 5ps. The range of deposited charges were considered based on typical
radiation flux at the sea-level [29]. We experimented with varying the interconnect lengths for the
circuit in Figure 4.1 in SPICE to estimate the reduction in glitches due to random radiation strikes.
We used the FreePDK 45nm technology kit and the unit resistance and capacitance available from
the technology library. The results of our experiments are shown in Figure 4.2. As shown in the
figure, greater interconnect lengths acts as a higher order RC ladder and has effectively higher low
pass filtering capacities. The transient response also shifts progressively towards the right due to the
higher RC delay incurred. We note that the glitch reduction is more pronounced when interconnect
length is increased from 1X to 2X, or 2X to 3X than for 3X to 4X. Thus, higher interconnect
lengths beyond a certain threshold does not greatly reduce the circuit SER but does worsen the
circuit delay. Also, as discussed in the previous section, different signal nets have different CPO
values and therefore, increasing the interconnect lengths for nets having low CPO only reduces the
circuit performance but does not effectively reduce the circuit SER. Based on this observation, in
the subsequent sections, we describe placement algorithms that selectively increases interconnect
length for soft error critical nets while not impacting the delay critical nets.
4.2 Placement for Radiation Immunity using Simulated Annealing
All block placement algorithms which are based on sequence pairs use SA and efficient algo-
rithms have been proposed in the literature to compute the unique placement configuration from
finite sequence pairs by computing the longest common subsequence [54] of the sequence pair.
In this section, we describe how the SA based placement algorithm can be used to generate a ra-
diation immune placement of standard cells while simultaneously constraining the total area and
wirelength. Given a sequence pair, a vertical and horizontal constraint graph can be obtained by
applying the  leftof, rightof, aboveof, belowof  relations on the sequence pair [49]. The weighted
constraint graphs can be made polar by assigning a dummy source and sink node. The longest path
from the source to any of the nodes in the horizontal and vertical constraint graph gives, respectively
30
the horizontal and vertical co-ordinates of the block in the corresponding placement. The longest
path from the source to the sink gives the width and height of the placement and hence the resulting
area. The wirelength (WL) of a signal net can be computed by taking semi-perimeter of the bound-
ing box enclosing the blocks that are connected by the net. We note that the wirelength can also
be estimated more accurately using the spanning tree method or the steiner tree method. However,
the semi-perimeter metric is fast and is the closest approximate to the most accurate steiner tree
method, especially for nets with smaller fanout.
Simulated annealing is used in placement as an iterative improvement algorithm. Each place-
ment configuration is represented as a sequence pair and moves in the space of sequence pairs are
probabilistically accepted depending upon the cost gradient and the current temperature. Given an
initial sequence pair we have allowed three types of moves: (a) exchange of 2 block positions in the
first sequence alone and (b) exchange of 2 block positions in both the sequences. Each placement
configuration is represented as a sequence pair and the moves in the space of sequence pairs are
probabilistically accepted depending upon the cost gradient and the iteration count. Higher cost
moves have a higher probability of acceptance at initial iterations for better state space exploration,
while at later iterations the algorithm greedily tries to minimize the cost.
Traditionally, the cost function is simply the bounding box area or the total wirelength. We have
modified the cost function in a manner that it maximizes the CPO weighted wirelength while simul-
taneously constraining the total area and the total wirelength. At higher temperatures, the radiation
immune placement algorithm that we have developed minimizes the following cost function,
CostFunc1  0  5AreaTotal  0  5
AREAOPT
WLOPT
WLTotal (4.1)
where AREAOPT is the optimal cost obtained by the placement algorithm for minimizing just
the bounding box area, while WLOPT is the optimal cost obtained by the placement algorithm for
minimizing just the total wirelength, and AreaTotal and WLTotal is the bounding box area and total
wirelength for the current placement configuration. Thus initially, the cost function is just a normal-
31
-300
-200
-100
 0
 100
 200
 300
 400
 0
 200
 400
 600
 800
 1000 0
 1000
 2000
 3000
 4000
 5000
-300
-200
-100
 0
 100
 200
 300
 400
Cost Function
Total Wirelength
Area
Figure 4.3 Cost Function with Penalty for High Area and Total Wirelength. (Note: All values are in
generic units)
ized and weighted combination of area and wirelength. At lower temperatures, the cost function is
changed as follows,
CostFunc2 
	 ∑
iεNets
CPOiWLi

e


AreaTotal
AREAOPT

M

e


WLTotal
WLOPT

N (4.2)
where CPOi and WLi are the CPO value and wirelength for the net i in the current placement
configuration, while Nets is the set of all signal nets in the circuit. M and N are user defined con-
stants controlling the penalty for high area and wirelength and are set to values of 5  0 and 14  0
respectively after extensive experimentation. As shown in figure 4.3, the cost increases exponen-
tially if the area and total wirelength is too high compared to the optimal area and total wirelength
costs obtained during placement for just area or wirelength. Minimizing the CPO weighted wire-
length selectively improves the SER for soft error critical nets while not affecting the delay critical
nets.
The overall SA based radiation immune placement algorithm is shown in Algorithm 5. Cost-
Func1 and CostFunc2 are the same as defined in equations 4.1 and 4.2. The algorithm assumes that
32
Algorithm 1 Radiation Immune Placement Using SA
temp=INIT-TEMP;
place=INIT-PLACEMENT;
AnnealTime=INIT-ANNEAL-TIME;
while temp  FREEZING-TEMP do
CurAnnealTime = 0;
while CurAnnealTime  AnnealTime do
new-place = PERTURB(place);
if temp  LOW-TEMP-THRESHOLD then
∆C = CostFunc1(new-place) - CostFunc1(place);
else
∆C = CostFunc2(new-place) - CostFunc2(place);
end if
if ∆C  0 then
place = new-place;
else if (RANDOM(0,1)  e ∆Ctemp ) then
place = new-place;
end if
CurAnnealTime = CurAnnealTime + 1;
end while
AnnealTime = AnnealTime ﬀ ANNEAL-RATE;
temp = temp ﬀ COOLING-RATE;
end while
33
OPTAREA and OPTWL has been previously obtained by using the cost function of just the area or just
the total wirelength. The pin assignment to I/O terminals is done by first dividing the placement area
into three horizontal regions. If most position centers of the blocks connected to the I/O pin lied in
the top-most horizontal region, the I/O pin was assigned a position at the center of the top edge of
the placement boundary. Similarly, if most block centers lied in the bottom-most horizontal region,
the I/O pin was assigned a position at the center of the bottom edge of the placement. Otherwise, the
middle horizontal region was further divided into left and right regions. If most block centers lied in
the left vertical region, the I/O pin was assigned a position at the center of the left edge; otherwise
the I/O pin is assigned a position at the center of the right edge.
4.3 Fast SER Aware Placement using Quadratic Programming
As shown in the previous section, SA based placement can be tuned to SER optimization by
appropriately choosing the cost function used. However, SA based placement has large computa-
tional overhead especially that involving optimization of wirelength or CPO weighted wirelength.
The execution time can be slightly improved using the notion of ”floorplan slacks” as proposed
in [62]. Thus, while SA based approach provides good results in terms of SER reduction, due to its
large runtimes we investigate a computationally efficient placement algorithm for SER optimization
based on Quadratic Programming (QP).
The objective function for the placement problem can be formulated as a weighted sum of the
squared distance among the connected cells and can be expressed as,
f

x ﬁ y 
1
2 ∑iεN ∑jεN wi jci j ﬂﬃ xi 	 x j 
2


yi 	 y j  2  (4.3)
where N denotes the set of modules to be placed, ci j denotes the connectivity between modules i
and j, xi and yi denotes the co-ordinates of the center of module i, while wi j denote the user defined
weights for connection between modules i and j. Traditionally for timing driven placement, the
weights wi j are chosen to be a function of the criticality of the corresponding net joining modules i
and j. To incorporate a SER driven placement scheme we modified the weight function as,
34
wi j  ﬂw1   1 	 slacki j   w2   1 	 CPOi j   (4.4)
where CPOi j is the CPO of the net connecting modules i and j, while wi j is the corresponding
timing slack at that net. w1 and w2 are user defined constants in the range [0, 1] controlling the
relative weighting for delay and SER optimization respectively.
The objective function f  x  y  given in equation 4.3 is a separable function and can be re-written
as,
f

x ﬁ y ! f

x 

f

y  (4.5)
which makes analysis of f  x  and f  y  identical. The function f  x  can be expressed in a
compact matrix form as,
f

x 
1
2
x " Qx

d " x

constant (4.6)
where x is a vector denoting the x co-ordinates of cell locations, Q is the positive definite con-
nectivity matrix weighted with SER and delay metrics as in equation 4.4, while vector d originates
from the contribution of the I/O pad cells which can be treated as fixed modules.
The allowable placement regions for a set of modules are updated after each iteration of bipar-
titioning. The centers of the placement regions at the t th level of partitioning is given by,
A 
 t  x  b 
 t  (4.7)
where the vector b # t $ denote the center coordinates of the placement regions at the corresponding
iteration step t, and the entries acr of matrix A # t $ are computed as follows,
35
acr 
areac
∑c areac
ﬁ i f c ε Rr
 0 ﬁ otherwise 
where areac is the area of a cell c and Rtr is a partitioned region r of the placement region R in
the partition iteration t. Since, by construction, matrix Q is positive definite and the constraints in
the form equation 4.7 is linear, the overall placement problem is a quadratic optimization problem
which is convex and has a unique global minimum f  x#  .
Algorithm 2 Radiation Immune Placement Using QP
R = whole chip area.
Compute the CPO of all nets.
Compute the slack of all nets.
Compute the weight function for all nets (equation 4.4).
Compute the Q matrix using the weighted connectivity matrix.
Solve the initial unconstrained QP for x.
Solve the initial unconstrained QP for y.
while each cell is not assigned a region do
Alternate between sorting cells using x or y co-ordinates.
Bipartition the placement region R into Rt using sorted co-ordinates.
Construct the constraint matrix A and vector b using the bipartition.
Solve CQOP for x.
Solve CQOP for y.
end while
Legalize the final placement.
The overall algorithm for QP based placement is given in Algorithm 2. The algorithm progresses
by alternating global optimization and partitioning phases. However, unlike other partitioning based
placers the algorithm maintains simultaneity accross all optimization steps [63]. The QP based
placement scheme is based upon solving a series of constrained quadratic optimization problems
(CQOP). The algorithm initially solves the global optimization problem by imposing one constraint
on all modules, forcing the centroid of the cells to the chip center. The solution of this provides the
initial spatial co-ordinates of the cell locations. These spatial co-ordinates are then sorted based on
x or y co-ordinates and the placement region is partitioned into two regions.
36
 
0
 
50
00
 
10
00
0
 
15
00
0
 
20
00
0
 
25
00
0
 
30
00
0
 
35
00
0
 
40
00
0  
0
 
50
00
 
10
00
0
 
15
00
0
 
20
00
0
 
25
00
0
 
30
00
0
 
35
00
0
 
40
00
0
c4
32
: S
ER
 O
pt
im
ize
d, 
w1
=0
.1 
w2
=0
.9 U1
U1
0
U1
00
U1
01
U1
02
U1
03
U1
04
U1
05
U1
06
U1
07
U1
08
U1
09
U1
1
U1
10
U1
11
U1
12
U1
13
U1
14
U1
15
U1
16
U1
17
U1
18
U1
19
U1
2
U1
20
U1
21
U1
22
U1
23
U1
24
U1
25
U1
26
U1
27
U1
28
U1
29
U1
3
U1
30
U1
31
U1
32
U1
33
U1
34
U1
35
U1
36
U1
4
U1
5
U1
6
U1
7
U1
8
U1
9
U2
U2
0
U2
1
U2
2
U2
3
U2
4
U2
5
U2
6
U2
7
U2
8
U2
9
U3
U3
0
U3
1
U3
2
U3
3
U3
4
U3
5
U3
6
U3
7
U3
8
U3
9
U4
U4
0
U4
1
U4
2
U4
3
U4
4
U4
5 U
46
U4
7
U4
8
U4
9
U5
U5
0
U5
1
U5
2
U5
3
U5
4
U5
5
U5
6
U5
7
U5
8 U5
9
U6
U6
0
U6
1
U6
2
U6
3
U6
4
U6
5
U6
6
U6
7
U6
8
U6
9
U7
U7
0
U7
1
U7
2
U7
3
U7
4
U7
5
U7
6
U7
7U7
8
U7
9
U8
U8
0
U8
1
U8
2
U8
3
U8
4
U8
5
U8
6
U8
7
U8
8
U8
9
U9
U9
0
U9
1
U9
2
U9
3
U9
4
U9
5
U9
6
U9
7
U9
8
U9
9
 
0
 
50
00
 
10
00
0
 
15
00
0
 
20
00
0
 
25
00
0
 
30
00
0
 
35
00
0
 
40
00
0  
0
 
50
00
 
10
00
0
 
15
00
0
 
20
00
0
 
25
00
0
 
30
00
0
 
35
00
0
 
40
00
0
c4
32
: D
ela
y O
pt
im
ize
d, 
w1
=0
.9 
w2
=0
.1
U1
U1
0
U1
00
U1
01
U1
02
U1
03
U1
04
U1
05
U1
06
U1
07
U1
08
U1
09
U1
1
U1
10
U1
11
U1
12
U1
13
U1
14
U1
15
U1
16 U1
17
U1
18
U1
19
U1
2
U1
20
U1
21
U1
22
U1
23
U1
24
U1
25
U1
26
U1
27
U1
28
U1
29
U1
3
U1
30
U1
31
U1
32
U1
33
U1
34
U1
35
U1
36
U1
4
U1
5
U1
6
U1
7
U1
8
U1
9
U2
U2
0
U2
1
U2
2
U2
3
U2
4
U2
5
U2
6
U2
7
U2
8
U2
9
U3
U3
0
U3
1
U3
2
U3
3
U3
4U
35
U3
6
U3
7
U3
8
U3
9
U4
U4
0U
41
U4
2
U4
3
U4
4
U4
5
U4
6 U4
7
U4
8
U4
9
U5
U5
0
U5
1
U5
2
U5
3
U5
4
U5
5
U5
6
U5
7
U5
8
U5
9
U6
U6
0
U6
1
U6
2
U6
3
U6
4
U6
5
U6
6
U6
7
U6
8
U6
9
U7
U7
0
U7
1
U7
2
U7
3
U7
4
U7
5
U7
6
U7
7
U7
8
U7
9
U8
U8
0
U8
1
U8
2
U8
3
U8
4
U8
5
U8
6
U8
7
U8
8
U8
9
U9
U9
0
U9
1
U9
2
U9
3
U9
4
U9
5
U9
6
U9
7
U9
8
U9
9
Fi
gu
re
4.
4
Pl
ac
em
en
to
fc
43
2
B
en
ch
m
ar
k
U
sin
g
QP
(A
)w
1=
0.
1
w
2=
0.
9,
(B
)w
1=
0.
9
w
2=
0.
1
37
 
0
 
50
00
 
10
00
0
 
15
00
0
 
20
00
0
 
25
00
0
 
30
00
0
 
35
00
0
 
40
00
0  
0
 
50
00
 
10
00
0
 
15
00
0
 
20
00
0
 
25
00
0
 
30
00
0
 
35
00
0
 
40
00
0
 
c4
32
: S
ER
 O
pt
im
ize
d, 
w1
=0
.01
 w
2=
0.9
9
U1
U1
0
U1
00
U1
01
U1
02
U1
03
U1
04
U1
05
U1
06
U1
07
U1
08
U1
09
U1
1
U1
10
U1
11
U1
12
U1
13
U1
14
U1
15
U1
16
U1
17
U1
18
U1
19
U1
2
U1
20
U1
21
U1
22
U1
23
U1
24
U1
25
U1
26
U1
27
U1
28
U1
29
U1
3
U1
30
U1
31
U1
32
U1
33
U1
34
U1
35
U1
36
U1
4
U1
5
U1
6
U1
7
U1
8
U1
9
U2
U2
0
U2
1
U2
2
U2
3
U2
4
U2
5
U2
6
U2
7
U2
8
U2
9
U3
U3
0
U3
1
U3
2
U3
3
U3
4
U3
5U3
6
U3
7
U3
8
U3
9
U4
U4
0
U4
1
U4
2
U4
3
U4
4
U4
5
U4
6
U4
7
U4
8
U4
9
U5
U5
0
U5
1
U5
2
U5
3
U5
4
U5
5
U5
6
U5
7
U5
8
U5
9
U6
U6
0
U6
1
U6
2
U6
3
U6
4
U6
5
U6
6
U6
7
U6
8
U6
9
U7
U7
0
U7
1
U7
2
U7
3
U7
4
U7
5
U7
6
U7
7
U7
8
U7
9
U8U
80
U8
1
U8
2
U8
3
U8
4
U8
5
U8
6
U8
7
U8
8
U8
9
U9
U9
0
U9
1
U9
2
U9
3
U9
4
U9
5
U9
6
U9
7
U9
8
U9
9
 
0
 
50
00
 
10
00
0
 
15
00
0
 
20
00
0
 
25
00
0
 
30
00
0
 
35
00
0
 
40
00
0  
0
 
50
00
 
10
00
0
 
15
00
0
 
20
00
0
 
25
00
0
 
30
00
0
 
35
00
0
 
40
00
0
c4
32
: D
ela
y O
pt
im
ize
d, 
w1
=0
.99
 w
2=
0.0
1
U1
U1
0U1
00
U1
01
U1
02
U1
03
U1
04
U1
05
U1
06
U1
07
U1
08
U1
09
U1
1
U1
10 U
11
1
U1
12
U1
13
U1
14U1
15
U1
16
U1
17
U1
18
U1
19
U1
2
U1
20
U1
21
U1
22
U1
23 U1
24
U1
25
U1
26
U1
27
U1
28
U1
29
U1
3
U1
30
U1
31
U1
32
U1
33
U1
34
U1
35
U1
36
U1
4
U1
5
U1
6
U1
7
U1
8
U1
9
U2
U2
0
U2
1
U2
2
U2
3
U2
4
U2
5
U2
6
U2
7
U2
8
U2
9
U3
U3
0
U3
1
U3
2
U3
3
U3
4
U3
5
U3
6U
37
U3
8
U3
9
U4
U4
0
U4
1
U4
2
U4
3
U4
4
U4
5
U4
6
U4
7 U4
8
U4
9
U5
U5
0
U5
1
U5
2
U5
3
U5
4
U5
5
U5
6
U5
7
U5
8
U5
9
U6
U6
0
U6
1
U6
2
U6
3
U6
4
U6
5
U6
6U6
7
U6
8
U6
9
U7
U7
0
U7
1
U7
2
U7
3
U7
4
U7
5
U7
6
U7
7
U7
8
U7
9
U8
U8
0 U
81
U8
2
U8
3
U8
4
U8
5
U8
6
U8
7
U8
8
U8
9
U9
U9
0
U9
1
U9
2
U9
3
U9
4
U9
5
U9
6
U9
7
U9
8
U9
9
Fi
gu
re
4.
5
Pl
ac
em
en
to
fc
43
2
B
en
ch
m
ar
k
U
sin
g
QP
(A
)w
1=
0.
01
w
2=
0.
99
,(B
)w
1=
0.
99
w
2=
0.
01
38
We performed the bipartitioning by alternating between sorting based on x co-ordinates and
sorting based on y co-ordinates on each iteration. The x and y co-ordinates obtained after solving
the CQOP in the previous step is used to do the bipartitioning of the next step. The cells belonging
to each of the two regions are used to compute the centroids of the two new regions and the centers
of these two new regions are then used to impose new constraints. Subsequently, the next global
optimization step is performed by solving a CQOP with these new constraints for all regions. This
alternative global optimization and partitioning step is carried out until each cell is assigned to its
own region. As the CQOP formulation for placement considers all cells as point masses, a final
legalization step is necessary to remove minor overlaps.
4.4 Experimental Results
The proposed algorithms were implemented on 1.5Ghz UltraSparc processor with 4GB of mem-
ory and running SunOS 5.8. The results were validated using the ISCAS’85 benchmark circuits
using the FreePDK 45nm technology kit and the OSU standard cell library built on it [58]. Dimen-
sions of each cell was extracted from the DEF file of the standard cell library. Synopsys Design
Compiler was used to do the initial technology mapping and for computing the enabling probability
of the nets. The CPO for each net was calculated by a separate C script using the method dis-
cussed in chapter 3. The technology mapped structural netlist is converted into the GSRC bookshelf
format using a converter C script. Many soft error estimation tools have been reported in litera-
ture [26,29,33]. The SEAT-LA tool [29] models the entire spectrum of neutron strikes (from charge
values in the [10fC,150fC] range) and quite close in accuracy to actual SPICE simulations. We
extended this path-based tool to a node-based formulation as described [56] for our SER estimation.
The various controlling parameters used for the SA based placement algorithm are summarized
in Table 4.1. The proposed algorithm reads the blocks and the netlist in the GSRC format. In
Figure 4.6, we plot the absolute areas for SER optimized placement, area optimized placement and
wirelength optimized placement. We also plot the absolute wirelengths for the three placement
schemes in Figure 4.7. As shown, the total area and the wirelength were increased marginally
39
Table 4.1 Simulated Annealing Parameters
Parameter Name % Value
INIT-TEMP 5 million
FREEZING-TEMP 0.1
INIT-ANNEAL-TIME 100
INIT-PLACEMENT Identity permutation on sequence pair
LOW-TEMP-THRESHOLD 4.0
COOLING-RATE 90%
ANNEAL-RATE 2%
Table 4.2 Comparison of SER Aware QP Based Placement with Timing Driven Placement: Weight
Combination 0.99 and 0.01
Benchmark % SER reduction % Delay Overhead %WL Overhead
c432 53.79 25.09 19.02
c499 30.54 3.47 -0.83
c880 37.88 28.22 16.21
c1355 41.97 4.61 -1.45
c1908 27.83 8.34 3.96
c2670 101.0 4.53 1.74
c3540 77.82 3.57 -3.66
c5315 37.65 12.63 10.01
c6288 57.46 15.85 7.34
c7552 11.92 0.1 -0.24
AVG 47.79 10.62 5.21
40
Figure 4.6 Area Comparison for Different Placement Schemes. (Note: All values are in generic
units)
Figure 4.7 Wirelength Comparison for Different Placement Schemes. (Note: All values are in
generic units)
for the SER optimized placement scheme compared to area optimized and wirelength optimized
placements respectively.
In Table 4.3, we illustrate the results of SER reduction on ISCAS85 benchmarks for our radia-
tion immune SA based placement for SER optimization. As shown in the table, our SER optimized
placement reduces the SER by 27.12% while incurring an area overhead of only 18.86% compared
to area optimized placement. The radiation immune placement scheme reduces the SER 72.29%
compared to placement with a delay overhead of only 9.26% compared wirelength optimized place-
ment. For some benchmarks, like the c1908 benchmark, a SER reduction as high as 95% was
achieved. As shown, the radiation immune SA based placement scheme in general selectively in-
41
creases the netlengths of soft error critical nets while keeping the total area and total wirelength
under check.
The QP based placement formulation was implemented in C. We used GNU scientific library
[64] for solving the initial unconstrained QP problem. The CQOP problems are solved using a
quadratic programming solver package [65]. The QP based placement formulation requires pin-
assignment for IO pins which was done by doing a initial standard cell placement using Cadence
SoC Encounter. The DEF file with the pin-locations are converted to GSRC format using Capo [66].
The radiation immune QP placer reads the GSRC format files containing the pin locations and
outputs the placed co-ordinates for all movable cells again in GSRC format. Capo is used to legalize
this placement solution and create the final placement. We performed timing driven QP based
placement by using higher weights for w1 and lower weights for w2 in equation 4.4. Weights are
assigned values in the range [0 1]. For placement skewed towards SER optimization we used higher
weights for w2 and lower weights for w1. A weight of 1  0 for w1 and 0  0 for w2 or vice versa
makes the matrix Q in equation 4.6 singular. So we achieved timing driven placement in the limit
by providing weights of 0.9, 0.99 and 0.9999 to w1 and 0.1, 0.01, and 0.0001 for w2. Similarly we
approached SER aware placement by providing weights of 0.9, 0.99 and 0.9999 to w2 and 0.1, 0.01,
and 0.0001 for w1. Figures 4.4, 4.5 and 4.8 show placements of c432 benchmark for various values
of w1 and w2.
42
Table 4.3 SER Reduction for Radiation Immune SA Based Placement with Associated Delay and
Area Overhead
Benchmark
Comparison of Radiation Immune Placement with
Area optimized Placement Wirelength optimized placement
% SER reduction % Area Overhead % SER reduction % Delay Overhead
c17 20.93 5.26 71.26 2.0
c432 20.83 23.83 67.70 10.72
c499 23.89 30.77 64.28 9.45
c880 13.12 14.57 69.75 10.13
c1355 30.09 31.01 76.15 9.66
c1908 29.81 25.27 88.43 12.07
c2670 49.36 19.59 94.72 9.91
c3540 20.39 11.17 59.79 10.91
c5315 35.90 8.26 62.0 9.7
c6288 20.79 3.70 67.67 9.2
c7552 33.26 11.33 73.53 8.14
AVG 27.12 18.86 72.29 9.26
43
 
0
 
50
00
 
10
00
0
 
15
00
0
 
20
00
0
 
25
00
0
 
30
00
0
 
35
00
0
 
40
00
0  
0
 
50
00
 
10
00
0
 
15
00
0
 
20
00
0
 
25
00
0
 
30
00
0
 
35
00
0
 
40
00
0
c4
32
: S
ER
 op
tim
ize
d, 
w1
=0
.00
01
 w
2=
0.9
99
9
U1
U1
0
U1
00
U1
01
U1
02
U1
03
U1
04
U1
05
U1
06
U1
07
U1
08
U1
09
U1
1
U1
10
U1
11
U1
12
U1
13
U1
14
U1
15
U1
16
U1
17
U1
18
U1
19
U1
2
U1
20
U1
21
U1
22
U1
23
U1
24
U1
25
U1
26
U1
27
U1
28
U1
29
U1
3
U1
30
U1
31
U1
32
U1
33
U1
34
U1
35
U1
36
U1
4
U1
5
U1
6
U1
7
U1
8
U1
9
U2
U2
0
U2
1
U2
2
U2
3
U2
4
U2
5
U2
6
U2
7
U2
8
U2
9
U3
U3
0
U3
1
U3
2
U3
3
U3
4
U3
5U3
6
U3
7
U3
8
U3
9
U4
U4
0
U4
1
U4
2
U4
3
U4
4
U4
5
U4
6
U4
7
U4
8
U4
9
U5
U5
0
U5
1
U5
2
U5
3
U5
4
U5
5
U5
6
U5
7
U5
8
U5
9
U6
U6
0
U6
1
U6
2
U6
3
U6
4
U6
5
U6
6
U6
7
U6
8
U6
9
U7
U7
0
U7
1
U7
2
U7
3
U7
4
U7
5
U7
6
U7
7
U7
8
U7
9
U8U
80
U8
1
U8
2
U8
3
U8
4
U8
5
U8
6
U8
7
U8
8
U8
9
U9
U9
0
U9
1
U9
2
U9
3
U9
4
U9
5
U9
6
U9
7
U9
8
U9
9
 
0
 
50
00
 
10
00
0
 
15
00
0
 
20
00
0
 
25
00
0
 
30
00
0
 
35
00
0
 
40
00
0  
0
 
50
00
 
10
00
0
 
15
00
0
 
20
00
0
 
25
00
0
 
30
00
0
 
35
00
0
 
40
00
0
c4
32
: D
ela
y O
pt
im
ize
d, 
w1
=0
.99
99
 w
2=
0.0
00
1
U1
U1
0U1
00
U1
01
U1
02
U1
03
U1
04
U1
05
U1
06
U1
07
U1
08
U1
09
U1
1
U1
10U1
11
U1
12
U1
13
U1
14U1
15
U1
16
U1
17
U1
18
U1
19
U1
2
U1
20
U1
21
U1
22
U1
23
U1
24
U1
25
U1
26
U1
27
U1
28
U1
29
U1
3
U1
30
U1
31
U1
32
U1
33
U1
34
U1
35
U1
36
U1
4
U1
5
U1
6
U1
7
U1
8
U1
9
U2
U2
0
U2
1
U2
2
U2
3
U2
4
U2
5
U2
6
U2
7
U2
8
U2
9
U3
U3
0
U3
1
U3
2
U3
3
U3
4
U3
5
U3
6U
37
U3
8
U3
9
U4
U4
0
U4
1
U4
2
U4
3
U4
4
U4
5
U4
6
U4
7 U
48
U4
9
U5
U5
0
U5
1
U5
2
U5
3
U5
4
U5
5
U5
6
U5
7
U5
8
U5
9
U6
U6
0
U6
1
U6
2
U6
3
U6
4
U6
5
U6
6U6
7
U6
8
U6
9
U7
U7
0
U7
1
U7
2
U7
3
U7
4
U7
5
U7
6
U7
7
U7
8
U7
9
U8
U8
0
U8
1
U8
2
U8
3 U
84 U
85
U8
6
U8
7
U8
8
U8
9
U9
U9
0
U9
1
U9
2
U9
3
U9
4
U9
5
U9
6
U9
7
U9
8
U9
9
Fi
gu
re
4.
8
Pl
ac
em
en
to
fc
43
2
B
en
ch
m
ar
k
U
sin
g
QP
(A
)w
1=
0.
00
01
w
2=
0.
99
99
,(B
)w
1=
0.
99
99
w
2=
0.
00
01
44
We computed the SER reduction and delay overhead for SER aware placement as the relative
increase in delay and decrease in SER with that of the timing driven QP based placement with si-
miliar weight combination. In Tables 4.4, 4.2 and 4.5, we summarize the results of SER reduction
and corresponding delay and wirelength overhead on ISCAS85 benchmarks for these weight com-
binations. For example, Table 4.4 compares SER reduction, delay and WL overhead by assigning
w1 of 0.9 and w2 of 0.1 for timing driven placement and by assigning w1 of 0.1 and w2 of 0.9 for
SER aware placement. Similarly in table 4.2 we compare a weight combination of w1 of 0.99, w2
of 0.01 and w1 of 0.01, w2 of 0.99, while in table 4.5 we compare a weight combination of w1
of 0.9999, w2 of 0.0001 and w1 of 0.0001, w2 of 0.9999. As shown, the average SER reduction
saturates to about 48% for more and skewed weight combinations while delay and WL overhead
is about 11% and 5% respectively. Large reductions in SER can be achieved for c2670 and c3540.
This can be explained by the large spread of CPO values in these benchmarks. c7552 shows smaller
reduction in SER as the distribution of CPO values in this benchmark is very tight. Overall, the
radiation immune QP based placement scheme consistently reduces SER by selectively increasing
the netlengths of soft error critical nets while keeping the delay and total wirelength under check. It
should be noted, that although increasing interconnect lengths for soft error critical nets with timing
slack does not worsen circuit performance, overall interconnect power may increase. We would like
to mention that, although power has not been considered in both of our placement schemes, it can
easily be incorporated into our placement frameworks by suitably changing the cost function used
during optimization.
We compared the proposed simulated annealing based radiation immune placement with the
quadratic programming based radiation immune placement. Overall, we saw a loss of SER reduc-
tion by about 13-14% in the QP based scheme compared to that of the SA based radiation immune
placement. However, there was orders of magnitude difference in runtime between SA based radi-
ation immune placement and QP based radiation immune placement scheme. In figure 4.9, we plot
the speedup in the runtimes of QP based placement for radiation immunity compared the SA based
placement for radiation immunity. The experiments were performed with varied number of cells in
the design by using a subset of the ISCAS85 benchmarks and certain large benchmarks from the
45
Table 4.4 Comparison of SER Aware QP Based Placement with Timing Driven Placement: Weight
Combination 0.9 and 0.1
Benchmark % SER reduction % Delay Overhead %WL Overhead
c432 38.41 18.09 14.31
c499 25.03 2.91 -0.39
c880 24.17 19.0 11.37
c1355 31.30 4.29 0.0
c1908 27.84 9.57 5.92
c2670 78.16 4.32 2.27
c3540 59.65 2.83 -0.90
c5315 25.76 8.73 7.35
c6288 48.11 12.38 5.61
c7552 10.21 1.06 0.34
AVG 36.00 6.90 3.32
OpenSparc T1 designs [61]. As shown in the figure, on average there can be a speedup of more
than 100X. Thus, QP based radiation immune placement provides a nice compromise in solution
quality with much superior runtime. The SA based radiation immune placement provides better
SER reduction, but for large designs the scheme is prohibitive in terms of runtime. The primary
reason for superior runtimes of QP based placers is due to the fact that iterative solution methods
used for solving the CQOP exploits the sparsity of the Q matrix efficiently. Furthermore, with the
increasing number of constraints in the CQOP the solutions of the previous iteration can be used to
guide solution of the next iteration [63]. Therefore, the number of iterations required to solve the
CQOP decreases rapidly. It should also be noted that the QP based placement scheme could easily
be modified into a force directed placement scheme. However, we found that such force directed
placers only improved execution time marginally while producing a lot of cell overlaps putting a lot
of pressure on the placement legalizer.
46
Table 4.5 Comparison of SER Aware QP Based Placement with Timing Driven Placement: Weight
Combination 0.9999 and 0.0001
Benchmark % SER reduction % Delay Overhead %WL Overhead
c432 54.35 25.12 18.40
c499 30.71 3.18 -1.23
c880 38.75 27.67 12.95
c1355 42.87 4.57 -1.71
c1908 23.83 7.76 3.27
c2670 103.1 4.95 2.08
c3540 75.41 2.93 -4.21
c5315 43.43 21.09 19.39
c6288 55.73 12.56 2.54
c7552 11.84 0.0 -0.21
AVG 48.01 10.91 5.12
Figure 4.9 Speedup Comparison of QP Based and SA Based Radiation Immune Placement Schemes
47
CHAPTER 5
SOFT ERROR MITIGATION AT CIRCUIT LEVEL
A soft error may manifest itself as a bit flip in a memory element or it can occur in any internal
node of a combinational logic and subsequently propagate to and be captured in a latch. In the
past, soft errors have been handled at the circuit level using schmitt triggers, adding duplicate cells
and clamping the voltage between the duplicate nodes, and addition of pass transistors to filter the
random glitches that lead to soft errors. However, these approaches for avoiding soft errors in logic
circuits often incur significant overheads in terms of delay, area and power.
In this chapter, we propose a novel circuit which can be inserted at the output of a logic cell
to prevent the generation of transient glitches due to radiation strikes. The circuit is based on
a RC differentiator to detect the occurence of such transient glitches. The large voltage swing
accross the resistance of the differentiator is used to control the gate to body volatge of enhancement
type NMOS and PMOS gates placed in series. Thus, the very characteristic of transient pulses is
exploited to cut off the cell hit by the strike from the driven cell. Experimental results indicate that
the insertion of these radiation blocking cells on the gate output nodes can significantly reduce the
generation of transient glitches. However, blind insertion of these cells on circuit nodes can incur
delay and area penalties. Based on this observation, we propose an algorithm to insert the cells
only on selected nodes in a logic circuit. The algorithm is based on ranking the circuit nodes based
on a new metric called the Probability of Radiation-blocker circuit Insertion (PRI) and inserting
the radiation blocking cells only on the top few nodes in the sorted list of PRI values. The PRI
metric is computed by taking a weighted combination of the glitch masking effect on a circuit node
and the slack available at that particular node. Thus, the algorithm inserts radiation blocking cells
selectively on soft error vulnerable nodes for the non-critical paths of a circuit. We experimented
with the framework using the NSCU 45nm Process Design Kit and the Nangate standard cell library.
48
Cell
Standard
M2
DrivenDriving
Standard
Cell
M1
vdd
Depletion
Mode MOS
M4 M5
node n2 Vr_bar
Vr Vr_bar
vdd
Vr
M3 Devices
Vr
Vr_bar
node n1
node n3
Figure 5.1 Schematic of Radiation Induced Glitch Blocker Circuit
Experimental results on ISCAS85 benchmark circuits indicate that logic circuits optimized with
selective insertion of these radiation blocking cells can significantly reduce soft error rates with
marginal overheads in terms of delay and area.
The rest of the chapter is organized as follows. In Section II, we present the transistor level
circuit that can reduce the propagation of transients that are generated due to radiation. In Section
III, we describe an algorithm to insert the glitch blocking cells on selected circuit nodes to have
very low costs on delay and area costs. Section IV describes the experimental setup and presents
the results. Finally, we compare with some related works in Section V.
5.1 Radiation Induced Glitch Blocker Circuit
In this section, we describe the proposed circuit level technique for countering transient faults
in a standard cell based design flow. The technique is based on a RC differentiator circuit to counter
transient glitches due to radiation strikes occurring on the active area of the cells. The circuitry
which is attached to a standard cell output, will be referred to as the radiation blocker circuit through-
out the paper.
The transistor level schematic of the radiation blocker circuit is shown in Figure 5.1. As shown
in the figure, the circuit consists of a RC differentiator implemented with MOS transistors M1, M2
and M3. A small, always on, NMOS transistor M1 provides enough resitance to obtain a large
voltage swing during radiation strikes. The small resistance can also be implemented using a simple
49
Figure 5.2 Plotting the Voltages across M1
poly strip eliminating the need for M1. The NMOS and PMOS transisors M2 and M3 provide a
good constant capacitance value across a voltage range. The use of this configuration is motivated
by the idea that the current that flows through M1 is proportional to the change in voltage across the
node n1. In particular, the current flowing through M1, acting as a resistor, is given by,
I

t  Ce f f
d

Vn1 
dt (5.1)
where Ce f f is the effective capacitance due to the NMOS and PMOS transistors M2 and M3.
The voltage swing accross M1 is proportional to this current. As shown in the figure, we denote
the voltage, with respect to ground, on the drain and source of M1 as Vr and Vr bar respectively.
During a radiation strike the voltage accross M1 is very high. This voltage is used to control the
gate-body voltage of the depletion type NMOS and PMOS transistors M4 and M5. As M4 and M5,
are depletion mode devices they are normally on and a negative voltage has be applied to make them
go into cutoff. During a regular logic transition, the voltage on node n1 changes in a comparatively
slower ramp and therefore the voltage accross M1 is small. Thus, during a regular logic transition
the voltage accross M1 is not enough to cut off M4 or M5.
However, during a single event transient due to a radiation strike, the change in voltage of node
n1 is exponential which leads to a large voltage drop across M1. This voltage drop is sufficient to
50
Figure 5.3 (A) Transient Pulses on Inverter Cell for Radiation Strikes of Varying strength, (B) Cor-
responding Results on an Inverter Cell Protected with Radiation Blocker Circuit
cut-off depletion mode transistors M4 or M5. In Figure 5.2, we plot the transient node voltages
denoted by Vr and Vr bar. It should be noted that during the rising phase of the radiation induced
transient pulse, the voltage difference swing is positive and the depletion mode PMOS device M5
cuts off while if the voltage swing is negative and the depletion mode NMOS device M4 cuts off.
It should also be noted that a small positive or negative voltage appears accross M1 during regular
logic transitions as well. Since the magnitude of this voltage is small due to comparatively slower
changing voltage ramp of the output node during regular logic transitions, it does not cut off the
depletion mode transistors M5 or M4 respectively. We experimented with the radiation blocker
circuit using the FreePDK 45nm Process Design Kit on a simple inverter cell hit by radiation strike.
Figure 5.3(A) illustrates the transient glitches generated due to radiation strikes of varying strength
on the inverter cell. Figure 5.3(B) shows the corresponding results for inverter cell with radiation
blocker circuitry, which shows significant reduction in the transient pulses generated due to radiation
strikes. The passgate could always be replaced by a transmission gate to avoid a threshold drop. This
is especially true as supply voltage is reduced due to scaling trends which leads to the decrease in
noise margins. However, since this requires addition of two more transistors, we chose to stick with
the passgate solution in order to reduce the area overhead.
The radiation blocker circuit though effective in reducing transients due to radiation strikes has
an impact on area and delay when attached to the output of a standard cell. A generalized approach
51
to reduce the area overhead is with the use of more compound/complex gate realizations. A com-
pound gate is formed by the combination of series and parallel MOS structures with complementary
pull-up and pull-down logic. As these gates are built using static CMOS style, they are called static
CMOS complex gates (SCCG) [82]. The limitation with SCCG gates is that if the number of tran-
sistors in series exceeds an upper limit in any path of pull-up or pull-down logic, then there is a
hostile effect on the propagation delay of the gate. Typically, this upper bound can be safely fixed
to three or even four transistors. Thus, complex logic gates can be provided as input during the
technology mapping phase and the nodes in the corresponding circuit can be protected with the
radiation blocker circuit. We used a mix of some simple standard cells like inverter, nand2, nor2 etc
with SCCG standard cells like AOI222, OAI222 etc. Although, simple cells were required to enable
technology mapping cover all types of logic function large number of SCCG gates could be found in
the mapped netlist. This technique does reduce the overheads for our proposed approach, however,
if we protect all logic gates with the radiation blocker circuit then the area and delay overhead of
the entire circuit can still be significant. The area and delay overhead for various protected standard
cells are shown in Table 5.1 which shows that the overall area overhead can more than 60%. Based
on this observation, we developed an algorithm that exploits the asymmetric distribution of masking
probability to optimize only selective nodes of a logic circuit. Next, we present metrics to estimate
the various soft error masking effects on the circuit nodes.
5.2 Selective Insertion Algorithm
As discussed in section 5.1, the SER savings by using the radiation blocker circuit may be
nullified due to the overheads in delay and area. The overheads for protecting standard cells with
the radiation blocker circuit may be reduced by enforcing the use of SCCG gates. However, blind
protection by the use of radiation blocker on all gate output nodes will result in significant overheads.
In this section, we propose an algorithm for selective insertion of radiation blocker circuits on cell
outputs to provide reduction in circuit SER with very low performance and area overheads. The
CPO metric that is developed in the previous section is leveraged to selectively optimize vulnerable
52
Table 5.1 Overhead of Adding Radiation Blocker Circuit for Various Standard Cells
Cell Name % Area Overhead % Delay Overhead
INVX16 74.14 3.4
NAND2X4 74.14 3.5
NOR2X4 74.14 3.2
AOI21X4 52.95 2.6
AOI211X4 41.19 3.4
AOI22X4 41.19 2.3
AOI221X4 30.89 3.4
AOI222X4 26.48 2.7
OAI33X1 52.95 2.2
OAI21X4 52.95 3.2
OAI211X4 41.19 3.1
OAI22X4 37.07 2.4
OAI221X4 30.89 3.0
OAI222X4 26.48 2.2
circuit nodes. Thus, the asymmetric distribution of SER masking probability is used to provide high
SER savings for the circuit while marginally impacting delay and area.
A combinational circuit without feedback can be modeled as a directed acyclic graph (DAG).
The DAG can be made polar by assigning a dummy source node connected to all primary inputs and
a sink node connected to all primary outputs. The earliest arrival time (EAT) of each net can now be
computed by traversing the DAG in the topologically sorted order from the source and assigning the
EAT of a gate output as the maximum of the EATs of its inputs plus the delay of the gate. Similarly,
the latest arrival time (LAT) of each net can be computed by traversing the DAG in the topologically
sorted order from the sink and assigning the LAT of a gate input as the minimum of the LATs of its
outputs minus the delay of the gate. The difference of the LAT and the EAT provides the slack for
each net.
The probability for radiation blocker circuit insertion (PRI) of each gate output is now computed
by taking a weighted combination of the slack and the CPO at each cell output net. Thus the PRI at
the output of a gate i can be expressed as,
PRI

i  WSER  CPO  i 

Wslack  slack  i  (5.2)
53
Algorithm 3 Steps for the Proposed Selective SER Optimization Using Radiation Blocker Cells
(i) Perform technology mapping to structural netlist.
(ii) Read in structural netlist as a graph and create a polar DAG with source and sink nodes.
(iii) Estimate the logical observability of all the nets using the signal probability at the primary
inputs.
(iv) Populate the load caps at each internal node and compute the electrical observability of all
internal nets.
(v) Populate node delays using the gate type and the load cap and compute timing window ob-
servability for each net.
(vi) Compute the CPO of each net.
(vii) Perform a topological sort from the source node and calculate the EAT of each net.
(viii) Perform a topological sort from the sink node and calculate the LAT of each net.
(ix) Compute the slack of each net.
(x) Compute the PRI values of each net by taking the product of the CPO and the slack.
(xi) Select the M% topmost gates based PRI values (GM).
(xii) Radiation blocker cells are inserted at the output of the gates selected in GM .
where CPO

i  and slack

i  indicate the CPO and the slack at the corresponding gate output,
while WSER and Wslack are user defined weights to tradeoff SER optimization and the corresponding
delay overhead. A higher value of PRI of a gate output indicates that the corresponding net has
a high slack and is highly susceptible for soft errors upsets at the registers or primary outputs due
to radiation strikes on the active area of the gate. Thus, protecting selective circuit nodes having
higher PRI values ensures that radiation blocker cells protect nodes on the non-critical path, but
those which are highly susceptible to soft errors. We select a set of M% of the gate nodes, GM , by
sorting the various gate output nets based on its PRI values. Choosing various values of M can be
used to tradeoff SER reduction with area and delay overhead. We have experimented with different
values of M. We provide detailed results of the tradeoff in SER reduction on area and delay overhead
with varying WSER, Wslack and M in the experimental results section.
The proposed algorithm for the selective insertion of radiation blocker circuits for SER opti-
mization can be summarized in Algorithm 3. As shown, the algorithm starts with a initial technol-
ogy mapped netlist and then computes the CPO and slack values for each net. The PRI values are
then computed for each gate output. The top M% of the gates based on PRI values are selected for
insertion of radiation blocker circuit. The computational complexity of the algorithm (not consid-
54
Select top M% of the nodes
probabilities
Extract node signalSynopsis Design
     Compiler
benchmarks
Behavioral ISCAS
Cell Library (.lib)
Nangate Standard
and Delay overhead
Estimate Area
Estimate glitch reduction factor and  
% increase in delay and area/power
Technology mapped netlist
for all circuit nodes
Compute CPO and slack
Extract node caps based
on the structural netlist
Calibre RC extractor
Cell Layout
Radiation BlockerNangate Standard 
Library
HSPICE simulations
standard cells and radiation blocker cell
Extracted spice level netlist with RC parasitics of
Estimation
for SER
Methodology
ASSA
in circuit SER
Estimate reduction
SER Calculator
Cumulative circuit
Calculator
Window Masking
Node TimingNode Logical 
Masking 
Calculator
Masking Calculator
Node Electrical
Database
NRC Curve
Engine
Calibration
Technology Kit
FreePDK 45nm
Cell Layouts
Figure 5.4 Simulation Flow: SER Reduction by Using Radiation Blocker Circuits
ering technology mapping) is dominated by use of computation of topological sort which is used
for computation of CPO (steps iii and v in Algorithm 3) and slack at the gate outputs (steps vii and
viii in Algorthm 3). The computational complexity of topological sort depends on running DFS on
the circuit graph and roughly proportional to O

n2  when n is the number of gates in the circuit.
Steps ii,iv,vi,ix and x are linear time and hence is proportional to O

n  while step xi is constant time
(O  1  ). Thus the overall computational complexity of the algorithm is quadratic in the number of
gates.
5.3 Experimental Results
The proposed algorithm was implemented on 1.5Ghz UltraSparc processor with 4GB of mem-
ory and running SunOS 5.8. The results were validated using the ISCAS’85 benchmark circuits.
We experimented with our proposed approach using the FreePDK 45nm Process Design Kit [31]
and the Nangate standard cell library [34] based on the 45nm technology. Synopsys Design Com-
piler was used to do the initial technology mapping and for computing the enabling probability of
55
Figure 5.5 Layout of the Radiation Blocker Circuitry
the nets. We modeled radiation strikes of deposited charges in the range of [60fC, 135fC] with
current sources as defined in equation 1 with a τα of 10ps and τβ of 5ps as in [39]. The range of
deposited charges were considered based on typical radiation flux at the sea-level [28]. The layout
of radiation blocker circuitry was created in Cadence Virtuoso(shown in Figure 5.5) and the netlist
with extracted parasitics were then simulated in SPICE for the original standard cell and the stan-
dard cell protected with the radiation blocker circuitry at its output node. We estimated the delay
and area overheads for each standard cell for adding the radiation blocker circuit using the SPICE
simulations and the layouts.
Many soft error estimation tools have been reported in literature [26, 29, 33, 59]. The ASSA
methodology [59] yields results quite close in accuracy to actual SPICE simulations and is sig-
nificantly faster than other tools for SER estimation. We have implemented a version of this tool
in-house for our SER estimation. The overall simulation flow is given in Figure 5.4.
56
Table 5.2 Experimental Results for ISCAS’85 Benchmark Circuits
Ckt % Reduction in SER % Delay Overhead % Area Overhead
M=1% M=5% M=10% M=1% M=5% M=10% M=1% M=5% M=10%
c432 6.20 18.76 32.54 0.18 0.37 0.54 2.72 9.53 19.07
c499 4.95 15.41 28.85 0.00 0.05 0.23 2.33 9.35 18.70
c880 17.74 44.46 61.76 0.07 0.20 0.43 2.18 9.48 18.97
c1355 4.07 13.67 25.33 0.00 0.06 0.06 2.39 9.56 19.13
c1908 14.25 36.70 53.00 0.09 0.09 0.28 2.31 9.84 19.11
c2670 18.39 64.91 76.27 0.07 0.07 0.32 2.17 9.54 18.66
c3540 16.71 39.49 54.01 0.00 0.02 0.05 2.13 9.43 18.56
c5315 18.81 43.36 59.42 0.00 0.00 0.00 1.93 9.45 18.72
c6288 30.54 50.33 62.43 0.00 0.00 0.00 1.90 9.33 18.58
c7552 14.14 41.58 59.38 0.02 0.33 0.42 1.94 9.35 18.57
AVG 14.58 36.87 51.30 0.04 0.12 0.23 2.20 9.49 18.81
In Table 5.2, the results for ISCAS85 benchmarks are shown for reduction in SER along with
the corresponding overheads in delay and area. The results are reported for varying values of M with
the WSER and Wslack being fixed at 0.9 and 0.1 respectively. The SER reduction was calculated as the
decrease in SER of the selectively protected circuit compared with the SER of the original circuit
divided by the original SER. Similarly, the delay(area) overhead was calculated as the increase in
delay(area) of the selectively protected circuit compared with the delay(area) of the original circuit
divided by the original delay(area). c2670, c6288 and c880 did very well in reducing SER using
our proposed approach. Especially, for c6288 even when only 1% of cells were selected for the
selective insertion of the radiation blocker circuit more than 30% reduction in SER was achieved at
no delay penalty and a area overhead of only 1.9%. c432 showed high susceptibility of increasing
delay overhead with increase of M while increase in area overhead was more or less similiar accross
the various benchmarks. As shown in Table 5.2, on the average, our proposed approach can achieve
SER reduction of as much as 51% with area overhead of 18% and delay overhead of only 0.2%. As
power scales very well with the circuit area, we believe that the power overhead of our proposed
approach will also be quite less.
We also experimented with varying the weights for providing relative importance to SER and
slack (WSER and Wslack) during computation of PRI values at various values of M. The results of
57
Figure 5.6 Comparison of SER Reduction for Different User Defined Parameters
these comparisons are indicated in figures 5.6 - 5.8. As shown in Figure 5.6, the rate of SER reduc-
tion increases significantly when providing higher values to WSER while as shown in 5.7, increase
in delay overhead is only marginal. The effect is especially pronounced at higher values of M. As
shown in figure 5.8, changing the weights WSER and Wslack at a fixed value of M does not affect area
overhead, which is expected.
5.4 Comparison with Related Works
We understand that the comparison of our work to other circuit level works for improving SER
is not straight forward, since the base simulation platform for the methods are quite different. For
example, the shadow gates technique [39] uses 65 nm BPTM technology for their simulations while
the schmitt trigger based technique in [38] uses 0.35 um technology libraries for their experiments.
We therefore provide a qualitative comparison of our work with other circuit level techniques for
node hardening. The shadow gates with diode clamper based technique incurs a low delay overhead
for hardening circuit nodes [39]. However, the duplication of entire cells lead to high area overheads.
This is especially true for complex standard cells with many transistors. The area overhead of our
proposed approach is, on the other hand, irrespective of the type of standard cell. Infact as we have
shown, the SCCG gates using a radiation blocker circuit incur relatively low area overheads. Also,
due to process variations duplicate gates in the shadowing technique may not have the exact delay as
58
Figure 5.7 Comparison of Delay Overhead for Different User Defined Parameters
original gate. This in turn may affect the performance of the hardened standard cell. Our approach,
does not suffer through such a limitation.
Complimentary pass gates can act as a low pass filter for glitches induced by radiation strikes
[37]. However, the method can only eliminate transient pulses with low or moderate magnitudes.
High amplitude pulses are attenuated but are not completely eliminated. Hence, protection against
a subset of radiation induced SETs can only be achieved. Otherwise, large sized pass gates or a
chain of pass gates need to be used. This can make it expensive in terms of delay for realistic
radiation flux found in sea level. The schmitt trigger based technique [38] uses explicit feedback
of stored charge to fight the transient charges injected during a radiation strike. This idea has also
been used in [36] in the context of dynamic gates or latches and in [40] for static CMOS circuits.
However, due to technology scaling, very less amount of charge is stored in the feedback node.
This seriously impacts the glitch reduction capability of these circuits at scaled technology nodes
especially where the soft error problem becomes an even bigger challenge. In contrast, our approach
uses the characteristic of the radiation induced transient itself to detect occurence of a radiation
strike and cuts the affected cell hit by the strike from providing input drive to the driven cell.
We also note that many works exist for selective sizing of gates of a circuit [84], [42], [45] and
simultaneous sizing and flip-flop selection [30] for SER reduction. However, we felt that it was not
fair to compare these logic level sizing approaches to our approach which predominantly depends
59
Figure 5.8 Comparison of Area Overhead for Different User Defined Parameters
on circuit level hardening. We strongly feel that such sizing approaches can be applied over and
above our proposed approach to further reduce the circuit SER.
60
CHAPTER 6
LOGIC LEVEL RELIABILITY-CENTRIC GATE SIZING
The trends in technology scaling, the shrinking device dimensions, the environmental noise fac-
tors and the uncertainty due to process variations have significantly impacted the reliability and
yield of integrated circuits in the nanometer regime. The transient faults, also known as soft er-
rors, induced by particles arising from radiation strikes could occur not only in memory elements
but also in the internal nodes of combinational and sequential logic, which can propagate to other
nodes posing a significant threat to the signal integrity in circuits. Further, crosstalk noise due to
the cross coupling capacitance among wires placed proximally close is another major challenge to-
wards achieving high signal quality. The presence of uncertainty due to process variations makes
it difficult to analyze and estimate noise in circuits. In this work, we investigate a new approach
for reliability centric gate sizing in which the objective is the simultaneous optimization of both the
soft error rate (SER) and the crosstalk noise besides the power and performance of circuits while
considering the effect of process variations. In the proposed approach, the soft error rate for a gate
is modeled as a first order function of the gate size and the sizes of the gates in its transitive fan-
in. The glitch masking effects are accurately captured by using two new metrics called the glitch
enabling probability (GEP) and the cumulative probability of observability (CPO) defined based
on the signal probabilities of the nets. The crosstalk induced noise is modeled at the logic level
based on the clustering of the structural netlists using the Rents rule. While the clustering algorithm
iterates until the difference in Rent’s exponent values between any pair of clusters falls below a
user defined threshold, the crosstalk noise is optimized by minimizing the pair-wise differences in
sizes of all cells within a cluster. Further, maximizing the variance in the gate sizes results in maxi-
mizing the available slack which in turn minimizes the delay uncertainty due to process variations,
thus improving the timing yield. The first order modeling of SER and the crosstalk noise at the
61
logic level reduces the number of decision choices thus reducing the search space resulting in an
efficient optimization algorithm. The resulting gate sizing problem is formulated as a non-linear
mathematical program which is solved using the interior point method. Experimental results on
ISCAS’85 benchmark circuits indicate significant improvement in SER, crosstalk noise, power and
timing yield compared to the corresponding constrained optimization formulations.
6.1 Interaction of Various Noise Sources under Process Uncertainty
In the nanometer regime, the interconnect geometries in integrated circuits have aggressively
scaled down. While this has reduced the self capacitance of the wires, the coupling capacitance
between the wires has become a challenge to the reliability of the systems. Two closely spaced nets
are treated as victim or aggressors each with respect to the other net. A victim net is often adversely
affected by the transitions on the aggressor net due to the coupling capacitance between them which
may lead to functional failures (wrong logic computation or longer circuit delay) leading to a timing
failure. If the victim switches in the same direction as the aggressor, the signal transition in the vic-
tim is hastened leading to hold time violations, while if the victim switches in the opposite direction
as the aggressor, the signal transition is delayed leading to setup time violations and timing failure.
If the victim line is at steady state and a switching in the aggressor induces a signal higher(lower)
than its high(low) logic level, a bootstrap noise results. In general, crosstalk noise depends primarily
on the coupling capacitance between the interconnects, the spacing between the wires, and the sizes
of the victim and the aggressor gates.
In this section, we illustrate how the effects of crosstalk noise and soft errors are compounded
due to their simultaneous presence in a circuit. Consider the occurrence of a radiation induced glitch
on the output of gate G4 in Figure 6.1. The glitch may be sufficiently separated in time from the
arrival of the clock edge. However, the induced crosstalk due to Cz causes a spurious clock pulse and
the glitch is latched. Note that G4 may have high masking probability due to its large timing window
masking factor but a soft error still results due to the coupled capacitive effect. Again, consider the
case that due to logic transitions on aggressors G1 and G5 a crosstalk noise appears on victim G3
through coupling capacitors Cx and Cy. The amplitude of such crosstalk noise may be small to
62
1 −> 00
0 −> 1
1
0
1 −> 0
0 −> 1
Cz
G4
G6
G5
G3 Cy
Cx
G2G1
Clock
D
Q’
Q
Figure 6.1 Interaction of Soft Errors and Crosstalk Noise
cause functional failure. The glitch will be intensified, however, if a radiation strike occurs on gate
G3 at the same time, ultimately leading to a functional failure. Thus, it can be seen that although a
single noise source may not affect the functionality of a circuit, the simultaneous presence of various
noise sources could intensify the effects of each other. As the vulnerability of the system is more
severely affected in the presence of multiple noise sources, the analysis and optimization of circuits
considering the effect of a single noise source could be inefficient and pessimistic.
Next, we illustrate how deterministic noise avoidance techniques are rendered ineffective in the
presence of extreme process variations in current designs. A popular strategy for soft error detection
is based on cost effective partial duplication [85]. However, as shown in Figure 6.2, the delay of the
duplicated logic block may be quite different from the original logic due to process variations. Thus,
if the delays of the original and the duplicated blocks are significantly separated in time, the above
approach is no longer applicable for error detection. Further, the signal switching delay of a victim
net is affected due to switching on its coupled aggressor nets. The delay of the signal in the victim
net could be larger(smaller) when the aggressors are switching in the opposite(same) direction as
the victim. Thus, crosstalk noise can create delay uncertainty. Delay uncertainty also results in
the presence of process variations due to uncertainty in gate length, oxide thickness, gate threshold
voltage etc. A unified scheme is thus required which can handle delay uncertainty due to both
crosstalk noise and manufacturing variations. The uncertainty in propagation delay of signals can
cause violations in set-up and hold timing constraints resulting in timing failure of the design [72].
63
Function Logic
Cutset
Logic’
Ch
ec
ke
r
.
.
.
Error
Indication
.
.
.
Primary
OutputsPrimary
Inputs
Cutset
Logic
.
.
.
.
.
.
Figure 6.2 Soft Error Mitigation under Process Uncertainty
I2
I3
I4
Strike
Radiation
O1
1
1
A
B
I1
O1
0
Strike
Radiation
I3
I4
I2
I1
B
A
1
I1
I2
I3
I4
Radiation
Strike
O1
B
1
A
(A) Transient Pulse Generation (B) Transient Pulse Propagation
B
A
O1
0
Strike
Radiation
I2
I1
I3
I4
Figure 6.3 First Order Model on Soft Errors of Logic Circuits with Varying Gate Size
6.2 Logic Level Modeling of the Design Metrics
In this section, we describe our methodology for modeling at the logic level the various metrics
for design optimization like SER, crosstalk noise, power and delay. In Section 7.2.1, we provide
some background on soft errors in logic circuits followed by a first order model for the optimization
of the soft error rate. In Section 7.2.2, we present a Rent’s rule based clustering method to model
crosstalk noise effects in the logic level. Finally, in Section 7.2.3, we describe the power and timing
models.
64
6.2.1 First Order Modeling of Glitch Masking Effects
We have developed a first order model of the soft error in a circuit node by only considering,
the effect of the size of the gate and the sizes of the gates in the transitive fan-in of the gate. It can
be shown that such an approximation leads to negligible error in SER estimation [30]. As shown in
Figure 6.3(A), higher gate sizes have an adverse effect in propagating transient pulses. This is due
to the fact that such a signal is amplified to a greater extent by a larger gate. Thus, smaller sized
gates are good for filtering out transient pulses due to radiation strikes and hence can effectively
reduce circuit SER. However, as shown in Figure 6.3(B), smaller sized gates have a lower Qcrit , and
are easily vulnerable to transient pulse generation following a radiation strike. Larger gates have
sufficiently higher amount of stored charges and hence their inherent inertia prevents the generation
of transient pulses.
As described in Chapter 3, the CPO and GEP accurately captures the masking phenomenon at
the logic level. We select a set of gate nodes, GM , by sorting the various gate output nets based on
its CPO and selecting M% of the topmost nets. The value of M was empirically selected to be 10%.
6.2.2 Crosstalk Noise Modeling at the Logic Level
Gate (driver and receiver) sizing can be an effective technique to reduce crosstalk noise on the
nets. However, if the size of one gate is increased to reduce the crosstalk noise on its output net,
the noise induced by it on the neighboring net increases. This makes the aggressor and victim gates
interchange roles thus resulting in a cyclic dependency in sizing order. In general, crosstalk noise
may be estimated very easily in the post routing stage. However, in this case sizing has to be limited
only to free/dead space around the cell which may not yield the best solution. Alternatively, the
entire place and route needs to be repeated for the new sizing solution, which in turn may lead to
very different crosstalk estimates, and the whole process needs to be iterated until convergence is
achieved. To alleviate this problem, we choose to estimate crosstalk noise at the gate level.
It should be noted, however, that modeling crosstalk noise at the logic level is a challenging
problem. As no layout information is available, neighbouring aggressor nets of a victim net and the
degree of their overlap is unknown. This makes estimation of coupling capacitance at the logic level
65
Figure 6.4 Modeling Crosstalk Noise using Graph Clustering based on Rent’s Exponent Values
a very difficult problem and an efficient heuristic is required to estimate the subsequent placement
and routing phases. We have modeled the effect of crosstalk noise by exploiting Rent’s rule which
relates the number of external signal connections to the gate count of a logic block and is given by,
T  tGp (6.1)
where T denotes the number of external connections, G is the gate count of the logic block, t
corresponds roughly to the average pin count of each gate and p is the Rent’s exponent. The Rent’s
exponent is often used to derive placement models [102]. We model crosstalk effects by clustering
the structural netlist based on Rent’s exponent values. The clustering algorithm is iterated until
the difference in Rent’s exponent values between any pair of clusters falls below a user defined
threshold. Rent’s exponent values are computed as in [102] for large clusters and by brute force
for small clusters. As shown in Figure 6.4 for an example circuit (c17), we found that a generated
cluster is quite accurate in providing an estimate of the local placement around each cell in the
placement phase that follows. The nets for the cells of the same cluster are thus highly probable to
66
be coupled together and hence can induce crosstalk noise. The crosstalk noise can then be optimized
by minimizing the difference in sizes of all the cells in the same cluster. Our simulations suggest
that such a Rent’s exponent based clustering approach along with routing hints is quite effective in
modeling crosstalk noise at the logic level.
6.2.3 Power and Timing Models
The dynamic power consumption of a gate i is given as,
Pi 
1
2
fV 2ddEi  Ci  Cwire   Psc (6.2)
where, Pi is the total dynamic power consumed by gate i, f is the clock frequency, Vdd is the
supply voltage for the gate, Ei is the average switching activity of the gate, Ci is the intrinsic gate
capacitance, Cwire is the sum of all the interconnects that fanout from gate i and Psc is the short-
circuit power. Reducing the size (Si) of the gate reduces the intrinsic gate capacitance of gate i,
power consumption and fan-in load capacitances of the gates in the transitive fan-out of the gate i.
A non-linear delay model of a gate i is given by the logical effort model [68, 81],
di  ai

bi
∑
jε f o 
 i 
S j
Si
(6.3)
where, Si refers to the size of gate i, f o

i  is the set of gates that fanout from gate i, constant coef-
ficients ai and bi are empirically determined for each type of cell by processing the vendor specific
library file. The delay values for various load values with different drive strengths is extracted from
the library file and the non-linear coefficients are determined by using a curve-fitting program.
67
6.3 Gate Sizing Formulation
The minimum delay in a circuit is achievable by respecting the static timing analysis constraints
and hence solving the unimetric constrained optimization problem as shown below,
min Tmax (6.4)
s  t  Smini % Si % Smaxi
and Dg
%
Tout & gεG
and Tout
%
Tmax & outεPO
and Dg  Tin


ai

bi
∑
jε f o 
 i 
S j
Si 
where, Tin and Tout is the specified timing target of the input and output of a gate, g denotes a
particular gate in a circuit which belongs to the set of all gates G, and PO is the set of all primary
outputs. Smini and Smaxi for gate i accounts for the size restrictions for that gate type in the library.
If we only target power minimization, then all the gates can be set to minimum size. The mini-
mum power in a circuit is thus obtained by solving the following unimetric constrained optimization
problem,
min∑
i
Pi (6.5)
s  t  Smini % Si % Smaxi
and Dg
%
Tout & gεG
and Dg  Tin


ai

bi
∑
jε f o 
 i 
S j
Si 
where, Pi is same as defined in equation 6.2, while the other notations are the same as defined in
equation 6.10.
68
Using the first order soft error analysis described in Section 6.2.1, a nonlinear mathematical
program for unimetric constrained SER aware gate sizing can be formulated as shown below,
max ∑
kεGM
XkSk (6.6)
s  t  X j  ∑iε f i 
 j  Si ' GEPi j  	 S j 
%
	
1
&
jεGM
and
	
1
%
X j
%
1
&
jεGM
and Smini % Si % Smaxi
and Dg
%
Tout & gεG
with Dg  Tin


ai

bi
∑
jε f o 
 i 
S j
Si 
where f i  i  is the set of gates that is in the transitive fan-in of gate i, GEPi j is the GEP for a net
i for gate j, GM is the selected set of M gates as described previously and Xi are slack variables
for the mathematical program for the M selected gates. Our convex program accounts for the first
order soft error analysis by selectively choosing to minimize Si (by maximizing ( Si) or to maxi-
mize Si. If X j happens to be positive to satisfy the first constraint in equation 6.7, simultaneously
maximizing X jS j and bounding X j to be in [-1,1] pushes X j towards the value of 1 and leads the
objective function to maximize Si. Similarly, if X j happens to be negative to satisfy the constraint,
simultaneously maximizing X jS j and bounding X j to be in [-1,1] pushes X j towards the value -1
and leads the objective function to maximize ( Si. The formulation of the first constraint in equa-
tion 6, actually captures whether there is a higher probability of glitch occurring at the gates in the
transitive fan-in of the gate i or if there is a higher probability of glitch occurring on the gate i itself.
Thus, simultaneously bounding the slack variables, along with maximizing Xk ) Sk forces selective
minimization or maximization of gate sizes of the gates depending on the higher probability of the
occurrence of a glitch, either on transitive fan-in of the gate or on the gate itself.
Further, using the Rent’s rule based graph clustering for crosstalk noise modeling described
in 6.2.2, a mathematical program for optimizing crosstalk noise effects during gate sizing can be
69
represented as,
min∑
C
∑
i  jεC

Si 	 S j  2 & CεClustersgraph (6.7)
s  t  Smini % Si % Smaxi
and Dg
%
Tout & gεG
with Dg  Tin


ai

bi
∑
jε f o 
 i 
S j
Si 
where Clustersgraph denotes the set of clusters obtained using hierarchical graph clustering on the
structural netlist graph. We can now provide a multi-metric gate sizing scheme for simultaneous
optimization of SER, crosstalk noise and power under delay constraints as follows,
min
	
c1
∑
kεGM
XkSk
Ru  c2
∑
i
Pi
Pu  c3
∑
C
∑
i  jεC

Si 	 S j  2
Nu (6.8)
s  t  X j  ∑iε f i 
 j  Si ' GEPi  	 S j 
%
	
1
&
jεGM
and
	
1
%
X j
%
1
&
jεGM
and Dg
%
Tout & gεG
and Dp
%
Tspec & pεPOnodes
with Dg  Tin


ai

bi
∑
jε f o 
 i 
s j
si

where, POnodes are the gates connected to the primary outputs, Tspec is the specified timing tar-
get based upon solving equation 5, while Pu, Ru and Nu are the solutions obtained after solving
the unimetric optimization problems given by equations 6-8, and helps in normalizing the various
metrics in the multi-metric objective function. c1  c2 and c3 are user-defined weighting coefficients
controlling the optimization of SER, power and crosstalk noise respectively.
As discussed previously, process parameter variations impact the gate delays of the cells in a
design thus impacting the overall circuit delay and hence affecting the timing yield of the design.
We minimize the timing yield loss due to process variations by maximizing the delay variance at
each node in the timing graph. The maximum delay variance is achieved by adding slack variables
for each node in the node based STA formulation and then maximizing the sum of such slack vari-
ables for all nodes. Further, our Rent’s exponent based clustering formulation to optimize crosstalk
70
noise can also be used effectively to model spatial correlations among process parameters. Nodes
belonging to the same cluster are likely to be placed closer together by the placer and hence will
have similar delay variance characteristics. Thus, a mathematical program for optimizing timing
yield under delay constraints can be stated as follows,
max ∑g σg (6.9)
s  t  Smini % Si % Smaxi
and Dg
%
Tout & gεG
and Tout
%
Tmax & outεPO
and Dp
%
Tspec & pεPOnodes
and ∑i  jεC  σi 	 σ j  2
%
δ
&
CεClustersgraph
and Dg  Tin


ai

bi
∑
jε f o 
 i 
S j
Si   σg
where δ is user defined threshold parameter that models spatial correlation in the delay variance of a
gate. We can now provide a multi-metric gate sizing formulation towards simultaneous optimization
of power, SER, crosstalk noise and parametric yield under delay constraints as follows,
min
	
c1
∑
kεGM
XkSk
Ru  c2
∑
i
Pi
Pu (6.10)

c3
∑
C
∑
i  jεC

Si
	
S j  2
Nu  c4
∑g σg
Vu
s  t  X j  ∑iε f i 
 j  Si ' GEPi  	 S j 
%
	
1
&
jεGM
and
	
1
%
X j
%
1
&
jεGM
and Dg
%
Tout & gεG
and Dp
%
Tspec & pεPOnodes
and ∑i  jεC  σi 	 σ j  2
%
δ
&
CεClustersgraph
with Dg  Tin


ai

bi
∑
jε f o 
 i 
s j
si


σg
where, Vu is the normalization factor obtained after solving equation 10 and c4 is user-defined
parameter controlling the optimization of timing yield.
71
Algorithm 4 Steps in the Proposed Gate Sizing Algorithm
(i) Perform technology mapping to structural netlist.
(ii) Read in structural netlist as a graph.
(iii) Estimate the CPO and GEP of all the nets using the enabling probability of the nets.
(iv) Select the M topmost gates based CPO values.
(v) Solve unimetric delay minimization problem to determine the timing target
(vi) Solve unimetric power minimization problem.
(vii) Solve unimetric SER maximization problem.
(vii) Solve unimetric crosstalk minimization problem.
(vii) Solve unimetric delay variance maximization problem.
(ix) Use the solutions for the above four problems to form a multi-metric optimization problem
and decide a timing target using the solution of the unimetric delay minimization problem.
(x) Solve the corresponding problem using KNITROS.
(xi) Discretize the solution into a structural netlist.
The steps for the proposed reliability-centric gate sizing algorithm under process variations is
illustrated in Algorithm 5. The algorithm initially solves the unimetric optimization problems for
reliability (both for crosstalk noise and SER), power and delay. These solutions are used to for-
mulate the unified optimization problem. The continuous gate sizes are discretized using a nearest
neighbor function, as in [78, 81], to produce a sized structural netlist.
6.4 Experimental Results
The proposed gate sizing algorithm was implemented on 1.5Ghz UltraSparc processor with
4GB of memory and running SunOS 5.8. The results were validated using the ISCAS’85 bench-
mark circuits [70]. We used a subset of cells (INV, NOR2, NAND2, XOR2) from the TSMC 90
nm technology library for our simulations. Synopsys Design Compiler was used to do the initial
technology mapping and for computing the enabling probability of the nets. The Rent’s exponent
based clustering was performed by a separate C script. The convex programs were solved using the
KNITROS optimization solver [71] available from the NEOS server. The co-efficient weights were
chosen empirically in our formulation, and were set to equal weights of 0.25 each.
Many soft error estimation tools have been reported in literature [29, 33]. We chose the SEAT-
LA tool [29] for soft error rate estimation, primarily for its speed and accuracy and also because it
models the entire spectrum of neutron strikes (from charge values in the [10fC,150fC] range). The
72
Extract next response
and use the respective cap file
top down approach
and compute circuit SER using the
Run SEAT−LA for each sub−circuit
sizing solutions)
Extract cap file based
on the structural netlist
( 2 cap files for 2 different
for each i/p vector
Generate responses
N Random i/p
   Vectors
random i/p vectors
Script to generate 
If (i<N)
i = i+1Y
Characterize Delay 
using Curve Fitting
90 nm Technology 
Library (.lib)
Behavioral ISCAS 
     benchmark
netlist (verilog)
Technology mapped
Synopsis Design
     Compiler
Script to convert to 
AMPL format
Estimation of CPO
and GEP of nets
N
as constraints
SoC Encounter
icecaps.tcsh
lefdef.layermap
SPEF File
Estimate Xtalk
C script
SSTA tool Yield
Estimate timing
Fire’n’Ice
Place n
Route
Calculate Power using Design Compiler
Accumulate the SER rate computed
accumulated SER rate
Generate avg SER rate from 
with other metrics
Single metric optimization of SER, Power, Xtalk noise and Timing Yield
Delay
Optimization
Multi−metric Gate Sizing
considering process variations
under Timing constraints
under timing constraint
Structural netlist
optimized simultaneously
for SER, Power and Xtalk
Power Optimization
with other metrics
as constraints
SER Optimization
with other metrics
as constraints
Xtalk optimization
Figure 6.5 Simulation Flow for Reliability-centric Gate Sizing Under Process Variations
73
Table 6.1 Experimental Results on Benchmark Circuits
ISCAS’ 85 Timing Yield Calculation at Different Timing Margins for
Benchmark Deterministic Worst Case Proposed Approach
circuits at 5% at 15% at 30% at 5% at 15% at 30%
c17 79.48 82.99 99.91 95.57 99.08 99.91
c432 51.74 87.45 99.1 61.72 92.1 99.32
c499 41.79 84.24 97.81 56.57 88.53 98.9
c880 52.76 87.51 98.38 73.92 93.03 98.41
c1355 42.56 84.56 95.68 62.75 91.31 97.51
c1908 43.04 84.38 97.84 66.42 89.58 98.66
c2670 62.96 81.27 94.19 71.6 84.02 96.84
c3540 47.59 86.96 96.63 54.29 88.91 97.46
c5315 48.33 88.52 99.23 83.26 99.04 99.93
c7552 42.94 85.4 98.08 57.88 95.17 99.63
tool, however, has a path-based approach which was impractical for large circuits due to the expo-
nential number of paths. The overall circuit SER was therefore computed by a top-down procedure.
We partitioned the larger circuits using into sub-circuits and the SER was computed for each sub-
circuit using SEAT-LA. We computed the product of the SER of the sub-circuit closer to primary
outputs with the SER of the sub-circuit at its fan-in multiplied by its GEP. The SER for a top level
circuit was computed as the sum of such products for all sub-circuits at its transitive fan-in. The
entire flow is repeated for several random input vectors to compute an average SER rate.
Crosstalk noise was estimated by placement and routing of the structural netlist in Cadence
Encounter and then using Cadence Fire’n’Ice for extracting the parasitic resistance and capacitance
values in SPEF format. Fire’n’Ice also provides the net lengths of the coupled nets to a particular
net in the SPEF file. The top few aggressor nets were identified for each net using a C script.
Subsequently, the average crosstalk noise is estimated as in discussed in [69, 82] with a separate C
script.
We estimated the timing yield of the netlist using a in-house implementation of a SSTA tool,
which is based on propagating discrete probability distributions through the structural netlist as
described in [73]. The variance of the individual gate distributions were modified to model spatial
correlations using the information from the placement tool.
74
Figure 6.6 Average Timing Yield at Different Timing Margins
We estimated the timing yield for a deterministic delay optimized gate sizing formulation with
worst corner-case values and for a gate sizing approach as provided in equation 8 at different timing
targets. The timing targets was decided based on adding a T% margin on the critical delay obtained
using traditional gate sizing for delay optimization. The timing yield for various values of the timing
margin T is shown in Table 1. As shown in the Figure 6.6, our gate sizing methodology considering
process variations can improve the timing yield of designs by more than 15% for designs with timing
margins of less than 5%.
We illustrate the results of SER reduction on ISCAS85 benchmarks for single-metric SER op-
timization and multi-metric optimization technique considering simultaneous optimization of SER,
delay and power. As shown in Figure 6.7, for single-metric SER optimization, the average SER
reduction over unimetric delay optimization is 52% and 25% over unimetric power optimization.
For multi-metric optimization, the average SER reduction over unimetric delay optimization is 45%
and 11% over unimetric power optimization. We compute the delay overhead of the single-metric
SER optimization and the multi-metric optimization as the percentage increase in normalized delay
compared to unimetric delay optimization.
We also estimated the percentage reduction in dynamic power, crosstalk noise and SER rate
for the multi-metric gate sizing formulation considering process variation with the corresponding
constrained optimization with a single objective while the other metrics were constrained to the
75
Figure 6.7 SER Reduction for ISCAS85 benchmarks
value obtained using the multi-metric sizing scheme. For example, for comparing power reduction
we constrained the SER, crosstalk noise and parametric yield to be the same as obtained in the
multi-metric optimization and formulated the power optimization problem under these constraints.
As shown in figure 6.8, on average the multi-metric optimization methodology showed about 39%
improvement in dynamic power reduction. For comparing SER reduction we constrained the power,
crosstalk noise and parametric yield to be the same as obtained in the multi-metric optimization and
formulated the SER optimization problem under these constraints. As shown in figure 6.8, on
average the multi-metric optimization methodology showed about 21% improvement in soft error
rates.
Figure 6.8 Improvement in SER, Crosstalk Noise and Power
76
We also compared reduction in average crosstalk noise by constraining the power, SER and para-
metric yield to be the same as obtained in the multi-metric optimization and formulated the crosstalk
optimization problem under these constraints. As shown in figure 6.8, on average the multi-metric
optimization methodology showed about 26% improvement in average crosstalk noise. We also
noted that methodology, when tested on larger benchmarks failed due to memory and size limita-
tions of the KNITRO solver. This is inherently a limitation of the solver used and not of the proposed
approach. Further, this is not severe because of the increasing trend towards shallower logic depths
in integrated circuits of the nanometer regime due to high clock frequency requirements.
77
CHAPTER 7
SOFT ERROR TOLERANCE AT ARCHITECTURAL LEVEL
With the continuous decrease in the minimum feature size and increase in the chip density due
to technology scaling, on-chip L2 caches are becoming increasingly susceptible to multi-bit soft
errors. The increase in multi-bit errors could lead to higher risk of data corruption and potentially
result in the crashing of application programs. Traditionally, the L2 caches have been protected from
soft errors using techniques such as (i) error detection/correction codes, (ii) physical interleaving of
cache bit lines to convert multi-bit errors into single-bit errors and (iii) cache scrubbing. While the
first two methods incur large area overheads for multi-bit errors, identifying the time interval for
scrubbing could be tricky.
In this chapter, we investigate in detail the multi-bit soft error rates in large L2 caches and pro-
pose a framework of solutions for their correction based on the amount of redundancy present in the
memory hierarchy. We investigate several new techniques for reducing multi-bit errors in large L2
caches, in which, the multi-bit errors are detected using simple error detection codes and corrected
using the data redundancy in the memory hierarchy. We also propose several techniques to con-
trol/mine the redundancy in the memory hierarchy to further improve the reliability of the L2 cache.
The proposed techniques were implemented in the Simplescalar framework and validated using the
SPEC 2000 integer and floating point benchmarks for L2 cache vulnerability, global cache miss-rate,
average cycle count and main memory write back rate, considering the area and power overheads.
Experimental results indicate that the vulnerability of L2 caches can be decreased by 40% on the
average for integer benchmarks and 32% on the average for floating point benchmarks, with an av-
erage multi-bit error coverage of about 96%, with significantly less area and power overheads and
with virtually no performance penalty.
78
Figure 7.1 Vulnerability of Different Cache Organizations for SPECINT2000.
The rest of the chapter is organized as follows. In Section 4.1, we model the vulnerablity of
the L2 caches due to multi-bit errors using a probabilistic formulation characterized by extensive
simulations for multi-bit errors in various L2 cache organizations. In Section 4.2, we present several
schemes to improve vulnerablity of the L2 cache based on exploiting the redundancy present in the
memory hierarchy. In Section 4.3, we present schemes to control the redundancy for reducing the
vulnerability of the L2 cache. Section 4.4 details the experimental methodology and the results. Fi-
nally, Section 4.5 compares our redundancy based multi-bit error protection framework with several
recent works in literature.
7.1 Characterization of Multi-bit Errors in Conventional Caches
In this section, we provide a characterization of the multi-bit error rate in coventional caches.
In particular, we are interested in characterizing how the multi-bit error rate changes with cache
size, associativity and cache line size. We assume that error detection codes (EDC) like CRC or
Hamming distance are maintained which require much less area overhead than error detection and
correction codes like ECC. Multi-bit errors in the dirty bit lines of the L2 caches can be detected
79
Figure 7.2 Vulnerability of Different Cache Organizations for SPECFP2000.
using these EDC codes. However, unlike clean cache lines, the multi-bit errors in the dirty cache
lines cannot be corrected, as no duplicate of the correct data is maintained. We therefore define the
vulnerability of the L2 cache as the percentage of dirty cache lines within a given time interval. Next
in Section 7.1.1, we model the vulnerablity of the L2 caches due to multi-bit errors using a prob-
abilistic formulation. In Section 7.1.2, we characterize the probabilistic model through extensive
simulations for multi-bit errors in various L2 cache organizations.
7.1.1 Probabilistic Characterization of Multi-bit Error Rate
As discussed previously, the vulnerability of the L2 cache is given by the expected number of
dirty cache lines in a time interval. The expected number of dirty cache lines (represented as E  D  )
in a time interval of T , is the joint probability that a block with address X will be written and will
not be replaced. This can be represented mathematically as:
E

D +* N ,
T
0
p
-
Blk * B /.

Wrt 0.

Evict  c  dt (7.1)
80
where N is the number of blocks in the cache. Let p

blk * B  represent the probability that a
particular block B is accessed, p

W rt  represent the probability that a write occurs at that block, and
p

Ev  represent the probability that the block is evicted during the time period T . Assuming that
the events are independent, we obtain from the above equation:
E

D 1* N ,
T
0
p

blk * B  p

W rt 32 1 ( p

Ev 4 dt (7.2)
A block B is evicted from the cache if the same set address as that of block B is generated, a tag
match does not occur for none of the blocks in the set and the block B is selected for replacement by
the replacement scheme. Representing this mathematically and again assuming independence we
have:
p

Ev 5* p

SetAd * set 2 B 4632 1 ( p

M 4 p

blkEv * B  (7.3)
In the above equation, p

SetAd * set 2 B 46 is the probability that a set address is generated that has
the same set as block B, p

M  gives the probability that a tag match succeeds and p

blkEv * B  is
the probability that the block B in that set is selected for replacement by the replacement algorithm.
Based on a LRU replacement policy, for example, p

blkEv * B  gives the probability that the oldest
block in the set is B.
Based on the above equations, we can thus characterize the change in cache vulnerability due to
changes in cache size. However, characterizing L2 cache vulnerability directly from the probabilis-
tic model, due to changes in associativity and cache line size is difficult. Therefore, we performed
extensive simulations on SPEC2000 benchmarks to characterize L2 cache vulnerability against
changes in cache line size and associativity. Based on this study, we estimated the probabilities
for our model.
7.1.2 Vulnerability of Conventional Cache Organizations
In this subsection, we describe the experiments conducted to study the vulnerability due to
multi-bit errors for various L2 cache organizations for estimating the probabilities of the model
81
described in the previous subsection. Figure 7.1 shows the results for the SPEC2000 integer bench-
marks. We varied cache sizes from 16KB to 64KB and 256KB and cache line sizes from 16 bytes
to 32 bytes and 64 bytes, assuming direct and set-associative mapping. The vulnerabilities of the
16KB, 64KB, and 256KB caches were obtained to be 28%, 37%, and 46%, on the average, re-
spectively. Also, as shown in Figure 7.1(D), changing associativity does not affect much the vul-
nerability. The 2-way and 4-way caches show slightly lower vulnerability than the direct-mapped
cache. The results for the floating point benchmarks were similiar to that of the integer bench-
marks as in Figure 7.2. The vulnerability is observed to be 39%, 43%, and 49% for the 16KB,
64KB, and 256KB caches, respectively. The floating-point benchmarks show higher vulnerability
in small cache configurations than the integer benchmarks. The above results are used to estimate
the probabilities of the model described in previous subsection.
7.2 Redundancy-based Error Protection
In this section, we present two new schemes that can exploit the inherent redundancy existing
in the memory hierarchy to improve the vulnerablity of L2 cache. In Section 7.2.1, we present a
scheme to exploit the redundancy existing between the write through L1 cache and the L2 cache
to reduce the vulnerablity of the L2 cache. In Section 7.2.2, we describe a scheme to exploit the
redundancy between the L2 cache and the main memory to reduce the vulnerablity of the L2 cache.
7.2.1 Exploiting L1/L2 Redundancy
The redundancy inherent in the memory hierarchy of high performance processors can be ex-
ploited to impove the reliability of the L2 cache against soft errors [92]. Most commercial proces-
sors support a write-through L1 cache and a write-back L2 cache. We assumed that the L1 cache
supports a no-write allocate policy and a merging write buffer exists between the L1 cache and the
L2 cache which prevents bandwidth and power bottlenecks for the write-through L1 cache [86]. As
the L1 cache is write-through, the write operations are performed on both the L1 and the L2 cache
thus maintaining redundant copies of the data. Also, there are many cache lines that reside in both
the L1 cache and the L2 cache since they are placed in both of them on L2 cache read misses. We
82
00 1
11 0
111
78787
9 9
:8:8:
:8:8:
:8:8:
; ;
; ;
; ;
<8<8<
= =
 0
Redundancy
Vulnerable
Non−vulnerable
No Redundancy and dirty
L1 / Memory Redundancy
L1 Redundancy
Memory Redundancy
Legend 
Dirty bits(2)Inclusion bit
L1 Cache
L2 Cache Main  Memory
01 >8>8>
> > >
?8?
? ?
@8@8@8@8@8@
A A A A A B8B8B8B8B8B8B8B8B8B8B8B
B8B8B8B8B8B8B8B8B8B8B8B
C8C8C8C8C8C8C8C8C8C8C
C8C8C8C8C8C8C8C8C8C8C
D D D D D D D D D D D D
E E E E E E E E E E E
F8F8F8F
F8F8F8F
G G G G
G G G G
H8H8H8H
H8H8H8H
I I I
I I I
J8J8J8J8J8J
K K K K K
L8L8L8L8L8L
M M M M M M
N8N8N8N8N8N
N8N8N8N8N8N
O O O O O O
O O O O O O P8P8P8P8P8P
P8P8P8P8P8P
Q8Q8Q8Q8Q
Q Q Q Q Q
Figure 7.3 Illustrating Inclusion Property and Fine Grain Dirtiness
define this implicit redundancy between the L1 and the L2 cache lines as the inclusion property of
the L2 cache.
Soft errors become effective when the data items with errors are replaced from the L2 cache and
written into the main memory. If the data items are referenced again from the main memory, the
errors will be effective and affect program output. This however can be avoided as redundant correct
data is present in the L1 cache. Thus when a L2 cache line is replaced, they have to be checked
for soft errors. All multi-bit errors can be detected using conventional error detecting codes and
corrected by fetching non-corrupt data from the L1 cache.
In order to support the above scheme, an inclusion bit is maintaind with each L2 cache line.
On a read operation, with a L1 cache miss but a L2 cache hit, the inclusion bit is set to 1 for the
corresponding L2 cache block. Also, the L1 cache block that is being replaced due to the miss will
cause the corresponding L2 cache block to have no duplicates in the L1 cache. So the inclusion bit
of the L2 cache block corresponding to the replaced block from the L1 cache is reset to zero. On
a write operation, with a miss on both the L1 cache and the L2 cache, the inclusion bit is reset to
zero for the L2 cache block (no write-alloate policy for L1). The L1 cache line is also invalidated
corresponding to the replaced L2 cache block. On a read operation, with a miss on both the L1 and
L2 cache, the inclusion bit is set to 1 for the new cache line.
83
7.2.2 Fine Grain Dirtiness
The redundancy between L2 cache and main memory assumes the form of clean L2 cache lines.
Errors in clean L2 cache lines can be corrected by re-fetching them from the main memory, whereas,
the errors in the dirty cache lines are not correctable. This, however, assumes that whole data in the
cache line are modified. In the standard cache architecture, even when only one word is modified,
the dirty bit for the entire cache line containing that word is set to one. Thus, we lose the information
that other words in the cache line are clean. This problem can be alleviated by adding more dirty bits
for each cache line. We define this as supporting fine-grain dirtiness in the L2 cache. Fine-grain
dirtiness can be supported, for example, if one dirty bit can be allocated for each memory word.
Only the dirty bit corresponding to modified memory word is set to one and other dirty bits are not
affected. When an error is detected in a clean L2 cache word during a cache read or a cache line
replacement, the error can be corrected by refetching the word from the main memory. Thus, we
can correct multi-bit soft errors in the L2 cache and improve recover-ability of the L2 cache. Area
overhead is small for fine-grain dirtiness: one dirty bit for each memory word, which is the same
overhead as parity check code.
Supporting a dirty bit for each memory word does not increase the complexity of the cache
hierarchy. On a read miss in the L2 cache, all dirty bits are reset to zero. The dirty bit corresponding
to the modified memory word is set to one on a L2 cache write. From CACTI simulation [87], the
latency and power overhead due to additional dirty bits is much lower than 1% for a 256KB L2
cache with 32B cache lines.
Figure 7.3 illustrates our memory hierarchy that utilizes inclusion property and supports fine-
grain dirtiness. Without loss of generality, the L1 and the L2 cache line sizes have been assumed to
be the same. Often, the larger L2 cache line size is assumed to be a multiple of the L1 cache line
size as in [5, 30, 31]. In this case, to support inclusion property, we consider that the L2 cache line
is divided into blocks of sizes equal to the L1 cache size and provide multiple inclusion bits for the
each of these blocks. As illustrated in the figure, a multi-bit error in the right half of the L2 cache
line with inclusion bit 0 and dirty bits 10 can be corrected by re-fetching the matching data from
the main memory since the right half has not been modified. A multi-bit error in the L2 cache line
84
RSRSRSRSR
RSRSRSRSR
RSRSRSRSR
T T T T T
T T T T T
T T T T T USUSU
USUSU
VSVSV
VSVSV
01011
Duplication of small values
LRU bits
Dirty bits
NMW bits
Threshold
Value bits
Hybrid
Replacement
Policy
Victim Cache
    Line
Cleaned Cache
    Line
Legend
NMW : No more write bit
WSW
WSW
X X
X X YSYSY
YSYSY
ZSZSZ
Z Z Z
[S[S[
[S[S[
\ \ \
\ \ \
]S]S]S]S]
]S]S]S]S]
]S]S]S]S]
^ ^ ^ ^ ^
^ ^ ^ ^ ^
^ ^ ^ ^ ^
Figure 7.4 Illustrating Reliability-centric Replacement and Small Value Duplication
with inclusion bit 1 and dirty bits 00 will cause no writeback when it is replaced thus correcting the
error. All L2 cache lines with their inclusion bits 1 can be recovered from soft errors by re-fetching
the corresponding L1 cache lines.
Since the L1 cache lines are a small percentage of L2 cache lines, vulnerability of L2 cache
does not reduce significantly using this scheme. Also correcting a clean cache word by accessing
the corresponding memory word can create a performance bottleneck. Therefore, we suggest more
aggressive techniques in the next sections which combined with the techniques already proposed
will significantly reduce the vulnerability of the L2 cache.
7.3 Improving Reliability by Controlling Redundancy
In this section, we propose two new schemes to mine/control the additional redundancy in the
memory hierarhy. In Section 7.3.1, we propose a cache line replacement policy biased towards
reliability. The dirty cache blocks which have no duplicates in the memory hierarchy are selected
for replacement on a cache miss, thus implicitly increasing redundancy and improving reliability. In
Section 7.3.2, we exploit small data values in cache lines to increase redundancy at the word level
and hence further improve reliability of the L2 cache.
85
7.3.1 Reliability-centric Replacement Policy
The conventional cache line replacement policies aim at improving memory performance by
reducing miss rates. They are generally based on access history of cache lines such as recency and
frequency of cache line accesses. For example, LRU (Least Recently Used), MFU (Most Frequently
Used), and FIFO (First In First Out) use recency or frequency information. The cache line replace-
ment policy can be adapted to improve the reliability of the L2 cache. In addition to recency and
frequency information, we can also include dirtiness of the cache blocks in the process of selecting
a victim cache line. If a dirty cache line is chosen as a victim, the number of dirty cache lines in
the L2 cache per cycle will reduce and, thus, the vulnerability of the L2 cache will reduce. As blind
cache line replacements may affect performance adversely, a hybrid replacement policy has been
developed by combining the conventional LRU policy with the dirtiness-based replacement policy.
When there is no dirty cache line in the accessed set of the L2 cache line, the LRU cache line is
replaced. When the LRU cache line is clean and a next LRU cache line is dirty, the next LRU line
is selected as a victim. Only the LRU replacement policy is considered when the number of dirty
blocks in the L2 cache is below a vulnerability threshold. The estimated number of dirty cache
lines, E

D  , derived from the probabilistic model discussed in Section 4.1 is used to determine the
vulnerability threshold, VT , as follows:
VT * KV
E

D 
N
(7.4)
where KV is a user-defined constant and N is the total number of blocks in the cache. Thus,
the vulnerability threshold depends on the target application workload, which in our case is the
SPEC2000 benchmarks, while a user-defined soft-error budget can be specified by controlling KV .
Thus, using the probabilistic model, average number of vulnerable blocks can be estimated based
on the cache design parameters and therefore can be used to set the vulnerability threshold ap-
propriately. The probabilistic formulation decouples vulnerability, which is a characteristic of the
application and the cache architecture, from the soft error rate which is characteristic of the envi-
86
ronment in which the system is operating. Performance can also be traded for higher reliability of
the L2 cache by controlling KV .
Algorithm 5 The Algorithm for L2-cache access for Multi-bit Soft Error Protection
if CACHE HIT then
if cmd == WRITE then
if value generated is small then
set the corresponding small value bit /* Small Value Detection */
end if
if matched block in set-address(addr).dirty-bit == TRUE then
set-address(addr).written-bit (NMW bit) = TRUE
end if
set-address(addr).dirty-bit = TRUE
end if
else
if No. of dirty blocks _ Total No. of blocks  VT then
/* Use LRU replacement */
else
Select a Block for replacement such that
set-address(addr).dirty-bit == TRUE and
set-address(addr).written-bit (NMW bit) == FALSE and
set-address(addr).inclusion-bit == FALSE
If other blocks in this set are found with this property, write these to lower level as well /*
Clustered Cleaning */
end if
end if
/* Maintain inclusion property */
The hybrid replacment policy is supported by the addition of a bit per cache line called ”No More
Write” (NMW). Generational behavior of cache lines is exploited by using the NMW bit [88, 89].
Generational behavior of cache states that, cache lines are brought in from the main memory on
cache misses, used frequently for a short period of time, and, then, not used (dead) until they are
evicted by another cache miss. The NMW bit in a cache line is maintained using the following
algorithm. The NMW bit is reset to 0 when an L2 cache line is brought into the L2 cache. When the
cache line is written more than one time, its NMW bit is set to 1, indicating that they are likely to be
modified soon. NMW bits of L2 cache lines are reset to zero periodically, resembling the popular
CLK algorithm implemented to maintain LRU bits. Thus the NMW bits acts as a 1-bit predictor of
whether the cache line will be written soon. Vulnerable cache lines which are dirty but have their
87
NMW bit 0 are in their dead write time and can be cleaned and made non-vulnerable. The LRU bits
along with the NMW bit are used for selecting the victim cache line to be replaced so that the cache
line are close to (or already in) their dead time. The cache lines with their NMW bit set will likely to
be written onto very soon and thus will be vulnerable again if cleaned. If the prediction is incorrect,
i.e., cache line has not yet reached its dead time but has a NMW bit 0 (and becomes a candidate for
eviction), the cache block will suffer a cache miss, thus causing a performance penalty.
The hybrid replacement policy can be extended to further improve the reliability of the L2 cache
by cleaning other dirty cache lines on a replacement. When there are dirty cache lines in the same
set as that of the replaced cache line and they are expected not to be modified for a long time, they
can be cleaned together with the victim cache block. This will not increase the cache miss rate but
can make the L2 cache more immune to errors by reducing the average number of dirty cache lines
per cycle. When an L2 cache line is replaced, the other lines in the same set are also checked for
their NMWs. The cache lines with their NMW bits set to 0 are written back together to the main
memory since the lines are not likely to be modified soon. If this clustered cleaning of dirty cache
lines is accurate, i.e., the lines will not be modified for a fairly long time and then replaced, the
vulnerability of the L2 cache will be reduced and there will be no performance penalty.
7.3.2 Exploiting Small Data Value Size
It is commonly known that a large percentage of memory values are small [90, 91, 93]. Small
memory values use at most half of the memory word bits. These small memory values can be
exploited to increase redundancy and improve the reliability of the L2 cache. The small memory
values can be duplicated in their upper half of memory word bits, which increases the degree of
redundancy in the L2 cache. If the value of the memory word is small, a detected multi-bit error
in the lower half bits can be corrected by using the duplicate data found in the top half bits. To
implement the duplication of small memory values, each memory word requires a ”small value bit”
for indicating that the value stored in the word is small and, thus, duplicated in the upper half bits of
the memory word. The area overhead due to the duplication is the same as that of a parity bit: one
bit for each memory word.
88
    Detector
 Small Value
      Circuit
Error Protection
          Read/Write Buffer
HSB     LSB IDDP
MUX3
   in lower half is corrupted
If small value and original dataData Out
L HH
LH
’0’
AND2
MUX1 MUX2
H L
Data In
Figure 7.5 Hardware Architecture for Small Value Detection and Duplication
The tasks of detecting, duplicating, and un-duplicating small memory values in the L2 cache
require hardware overhead. Detecting small memory values can be performed by adding zero de-
tectors that can check the upper half bits of memory word. Duplicating memory values can be done
with multiplexers that can select between the lower half bits and the upper half bits of the memory
values for the upper half bits of the results. Similarly, un-duplicating small memory values can be
performed with multiplexers that can select between zeros and upper half bits of the memory values
for the upper half bits of results. When the duplicate bit from the L2 cache is 1, zero is selected as
the output of the multiplexor. A typical hardware architecture for this scheme is shown in Figure
7.5. The tasks of zero detection, duplication and un-duplication are performed between the L2 cache
and the main memory to augment L2 cache line fillings and replacements, and between the L1 data
cache and the L2 cache to support write-through requests from the write buffer. An outline of the
cache line access/replacement algorithm to control or mine redundancy presented in this section is
provided in Algorithm 5. It should be noted that status bits like additional dirty bits, small value
bits and NMW bits are also themselves vulnerable to soft error strikes. However, we have assumed
that such status bit are specially designed using radiation hardened latches [46]. Radiation hardened
latches or SRAMs add a slight area overhead compared to regular storage structures. However, as
89
shown in our experimental results section, the overall area overhead for using such schemes in large
caches is minimal.
7.4 Experimental Setup and Results
Table 7.1 Description of the Schemes Used in Experiments
Scheme Description
Baseline Conventional L2 cache
I Exploit inclusion property
IM Add multiple dirty bits
Exploit inclusion property
D Replace a dirty cache line with NMW bit 0
DC Replace a dirty cache line with NMW bit 0
Clean dirty cache lines with NMW bit 0 in
the same set
IDC-T1 Exploit inclusion property
Replace a dirty cache line with NMW bit 0
Clean dirty cache lines with NMW bit 0 in
the same set
Enabled when the L2 cache vulnerability is
higher than VT with VT =0.25
IDC-T2 Exploit inclusion property
Replace a dirty cache line with NMW bit 0
Clean dirty cache lines with NMW bit 0 in
the same set
Enabled when the L2 cache vulnerability is
higher than VT with VT =0.1
IMDC Exploit inclusion property
Add multiple dirty bits
Replace a dirty cache line with NMW bit 0
Clean dirty cache lines with NMW bit 0 in
the same set
IMSDC Exploit inclusion property
Add multiple dirty bits
Duplicate small memory values
Replace a dirty cache line with NMW bit 0
Clean dirty cache lines with NMW bit 0 in
the same set
Enabled when the L2 cache vulnerability is
higher than VT with VT =0.25
In this section, we describe our experimental setup and the results of the various schemes pro-
posed in the paper for improving the reliability of the L2 cache. Table 7.1 summarizes the various
schemes that we have experimented in our simulations.
90
In the table, the scheme termed ’Baseline’ indicates our baseline processor configuration without
the inclusion of any of the proposed schemes. The sceheme termed ’I’ in the table exploits inclu-
sion property. The scheme termed ’IM’ employs multiple dirty bits for each cache line along with
exploiting inclusion property. The scheme termed ’D’ (Dirty Line First) implements our reliability-
centric replacement policy. The scheme termed ’DC’ in the table adds clustered cleaning to clean
dirty cache lines that are not likely to be modified soon, along with the reliablity centric replace-
ment policy. The scheme termed ’DCI-T1’ exploits inclusion property along with the reliablity cen-
tric replacement policy and clustered cleaning with a vulnerablity threshold of 25%. The scheme
’DCI-T2’ is the same as scheme ’DCI-T1’ but instead uses a threshold of 10%. In the table, the
scheme termed ’IMDC’ employs multiple dirty bits for each cache line along with exploiting inclu-
sion property and supporting a reliability centric replacement policy with clustered cleaning. The
scheme termed ’IMSDC’ exploits all proposed schemes together, having multiple dirty bits, detect-
ing and duplicating small values, maintaining inclusion property along with the reliablity centric
replacement policy and clustered cleaning with a vulnerablity threshold of 25%.
7.4.1 Experimental Setup
We modified the SimpleScalar version 3 tool suite [94] for this study. Since we target high
performance embedded processor and/or desktop processors, our baseline processor models an out-
of-order four-issue superscalar processor with a split transaction memory bus. Table 7.2 summarizes
the simulation parameters of this processor. Since SimpleScalar models a write back L1 cache,
we modified SimpleScalar to support a write-through L1 cache. We also implemented a merging
write buffer with fully associative 8 entries between the L1 and L2 cache and each entry of the
buffer contained four words. Inclusion property is maintained between L1 and L2 caches. When
an L2 cache line with its inclusion bit set to one is replaced, the corresponding cache line in the
L1 cache is invalidated to maintain inclusion property. The replacement policy for the L2 cache
can be easily extended to implement reliability-centric replacement; we only add an NMW bit for
each L2 cache line and a finite state machine for the replacement function is modified to take into
account dirtiness, the NMW bit, and the inclusion bit of the cache line. If the number of dirty cache
91
Table 7.2 Baseline Processor Configuration
Parameter Configuration
Issue window 64-entry RUU 32-entry LSQ
Decode, issue and
commit rate
4 instructions per cycle
Functional 4 INT add, 1 INT mult/div
units 1 FP add, 1 FP mult/div
L1 instruction cache 16KB 4-way, 32B line, 1-cycle
L1 data cache 16KB 4-way, 32B line, 1-cycle
L2 cache unified 256KB, 4-way, 32B line,
10-cycle
Main memory 8B-wide, 100-cycle
Branch prediction 2-level, 2K BTB, 32-entry RAS
Instruction TLB 64-entry, 4-way
Data TLB 128-entry, 4-way
lines is larger than dirtiness threshold, the reliability-centric replacement policy is enabled while the
conventional LRU policy is used otherwise. Small values are detected dynamically and maintained
using a small value bit. Multiple dirty bits for each cache line are maintained to implement fine grain
dirtyness. Our simulations have been performed with a subset of SPEC2000 benchmarks [95].
These were compiled with DEC C V5.9-008, Compaq C++ V6.2-024, and Compaq FORTRAN
V5.3-915 compilers using high optimization level. Eight programs from each of floating-point
and integer benchmarks are randomly chosen for our evaluation. All the benchmarks were fast-
forwarded for one billion instructions to avoid initial start-up effects and then simulated for another
one billion committed instructions. We also simulated for another one billion instructions after fast
forwarding 10 billion instructions. For all simulations, the reference input sets were used.
7.4.2 Simulation Results
We measure the vulnerability of the L2 cache by computing the average number of dirty blocks
per cycle without any duplicates in the memory hierarchy. Figures 7.6 and 7.7 present vulnerabili-
ties of the L2 cache for various schemes we have proposed in Table 7.1 including the baseline cache.
Vulnerability of the L2 cache for the baseline configuration is 64.6% and 61.4%, on the average, for
the floating-point and integer benchmarks, respectively. The mesa, gcc and gzip benchmarks show
higher than 90% vulnerability. Scheme ’I’ reduces vulnerability to 61.4% and 58%, on the average,
92
Figure 7.6 Vulnerability of the L2 Cache for Various Schemes Proposed for SPECINT2000.
Figure 7.7 Vulnerability of the L2 Cache for Various Schemes Proposed for SPECFP2000.
93
for the floating-point and integer benchmarks, respectively. These percentages are 53.9%, 43.4%,
41%, and 39.5%, on the average, for schemes ’D’, ’DC’, ’IDC-T1’, and ’IDC-T2’, respectively,
for the floating-point benchmarks. These percentages are 51.3%, 43.1%, 40.6%, and 38.3% for the
integer benchmarks. The maximum benefit from scheme ’I’ is limited to 6.25% since at most 16KB
of dirty data can be redundant between the 16KB L1 data cache and the 256KB L2 cache in our
baseline processor configuration. Scheme ’D’ does not show good results when baseline vulnera-
bility is high. For example, in mesa, applu, gcc, and gzip, scheme ’D’ shows small reductions in
vulnerability. This is because, in these benchmarks, most of cache lines are dirty and, thus, there
is little difference between our reliability based replacement and the LRU policies. In contrast,
scheme ’D’ works well with ammp; vulnerability of the L2 cache reduces to 26.2% from 82.9%
in the baseline. Since L2 cache miss rate is very high (28.8%) and, thus, IPC is very low (0.1)
in ammp, cache lines remain dirty when pipelines are stalled for a long time due to the L2 cache
misses, increasing vulnerability per cycle. Scheme ’D’ makes those dirty cache lines non-vulnerable
by evicting them from the L2 cache, reducing vulnerability per cycle. Scheme ’DC’ consistently
reduces vulnerability by 10.5% and 8.3%, on the average, over and above scheme ’D’. Scheme
’DC’ works very well for mesa and parser, in which scheme ’D’ was not effective in reducing the
vulnerability. Scheme ’IDC-T1’ reduces additional 2.4% and 2.5% of vulnerability for the floating-
point and integer benchmarks, respectively. A vulnerability threshold of 10% further reduces the
vulnerability of the L2 cache. The vulnerability reduces by 1% and 2.3% for the floating-point and
integer benchmarks, respectively. The ammp, bzip2, and crafty benchmarks benefit most from 10%
threshold.
The fine grain dirtiness based method was implemented by having four dirty bits per cache line
for a cache line size of four words. Scheme ’IM’ reduces the vulnerability of the L2 cache to 43%
and 39.6%, on the average, for the floating-point and integer benchmarks, respectively. Reductions
in vulnerability are 18% and 21%, respectively, compared to ’Baseline’. Scheme ’IM’ is compa-
rable to scheme ’IDC-T1’ in reducing vulnerability as can be observed in Figures 7.6 and 7.7. In
most floating-point benchmarks, scheme ’IM’ shows better results than scheme ’IDC-T1’. Only
mesa and galgel show worse results with scheme ’IM’. Half of the integer benchmarks show better
94
Figure 7.8 Global Miss Rates of the L2 Cache for Various Schemes Proposed for SPECINT2000.
Figure 7.9 Global Miss Rates of the L2 Cache for Various Schemes Proposed for SPECFP2000.
95
results and the other half show worse results with scheme ’IM’. Scheme ’IMDC’ further reduces the
vulnerability to 33.5% and 32.4%, on the average, for the floating-point and integer benchmarks,
respectively. The applu, mgrid, gzip, and parser benchmarks show large reductions in vulnerability
with scheme ’IMDC’ compared to scheme ’IM’. As shown in the figures, we have also experi-
mented with our proposed scheme to exploit small memory values. Our combined optimization
scheme ’IMSDC’ reduces L2 cache vulnerability by 40% on the average for the integer bench-
marks. Floating point benchmarks show a lesser decrease in vulnerablity, of about 32%, primarily
because the floating point values include a sign bit, exponent and mantissa fields and hence cannot
be detected by the small value detector. As discussed later, with the significant reduction of the
vulnerability by exploting/mining redundancy and with the addition of a small direct mapped ECC
cache, for the error-correction of the vulnerable blocks, an average multi-bit error coverage of about
96% can be achieved with our approach.
As discussed previously, the NMW bit provides a 1 bit prediction for whether the cache line will
be written soon. We also experimented with 2 bit predictors but we did not notice any significant
changes from the 1 bit predictor case. We do not show these results here for brevity. We also
measured L2 cache miss rate change since our proposed schemes use either the conventional LRU
policy or the proposed reliability-centric policy depending on vulnerability of the L2 cache at a
particular time and the chosen vulnerability threshold. Figures 7.8 and 7.9 present L2 cache miss
rate for various schemes proposed in this paper. We use global cache miss rate in the figure. Cache
miss rates increase by 0.4%, 0.1%, 0.1%, and 0.4%, on the average, for schemes ’D’, ’DC’, ’IDC-
T1’, and ’IDC-T2’, respectively, for the floating-point benchmarks. These percentages are 12.6%,
10.7%, 10.7%, and 10.7% for the integer benchmarks. The gcc benchmark shows a decrease in
miss rate, which demonstrates that the conventional LRU policy is not optimal for all benchmarks.
As shown in the figure, the miss-rates reduces significantly when the replacement policy is changed
from LRU replacement policy to a replacement policy favoring replacement of dirty lines. We note
that replacement scheme based on LRU policy is based on the approximation that the least recently
used block will not be used in the near future. As we also select that dirty cache line in the set to
replace which is oldest in terms of LRU, our simulations show that the replacement scheme using
96
Figure 7.10 IPCs for Various Schemes Proposed for SPECINT2000.
such a technique predicts cache lines in their dead time very accurately and hence has a significantly
lower miss rate compared LRU.
Figures 7.10 and 7.11 plots IPC results for various schemes proposed in this paper. IPC re-
ductions are 0.2%, 0.2%, 0.3%, and 0.3%, on the average, for schemes ’D’, ’DC’, ’IDC-T1’, and
’IDC-T2’, respectively, for the floating-point benchmarks. These percentages are 0.1%, 0.1%, 0.1%,
and 0.1% for the integer benchmarks. IPC reduces slightly due to additional write back traffic in
our schemes. The gcc benchmark shows IPC increase of 25% for scheme ’D’. The benchmark
showed high miss rate reduction in Figure 7.8 which translated directly into improved performance.
The other benchmarks show slight decreases or increases in IPC. Our proposed scheme, especially
’IDC-T1’, reduces vulnerability by 23.6%, on the average, for the floating-point benchmarks with
0.3% performance penalty. For the integer benchmarks, vulnerability reduces by 23.1%, on the
average, with less than 0.1% performance loss.
Since our replacement policies favor dirty cache lines, we also measured the write back traffic
rate from the L2 cache to the main memory as shown in Figures 7.12 and 7.13. The write back traffic
rate is measured as the ratio of the number of writes from the L2 cache to all L2 cache accesses. The
write back traffic is increased by 1.1% and 191.7%, 2.5%, and 2.8%, on the average, for schemes
’D’, ’DC’, ’IDC-T1’, ’IDC-T2’, respectively, the floating-point benchmarks. These percentages are
-10.8%, 163.3%, 0.9%, and 0.8% for the integer benchmarks. Scheme ’DC’ increases memory write
traffic significantly since it performs clustered cleaning of dirty cache lines. In contrast, scheme
97
Figure 7.11 IPCs for Various Schemes Proposed for SPECFP2000.
’IDC-T1’ shows little difference in write back traffic since it takes inclusion bits into account. Since
redundant cache lines between L1 and L2 caches are most active cache lines, they are likely to be
modified frequently. Cleaning these redundant cache lines does not help reduce vulnerable cache
lines in scheme ’DC’. Scheme ’IDC-T1’ does not clean redundant cache lines since they are highly
likely to be written soon. Schemes ’D’ and ’IDC-T1’ even decrease memory write back traffic for
the integer benchmarks mainly because of gcc, where cache miss rate decreases significantly for
schemes ’D’ and ’IDC-T1’, which reduces dirty write backs from the L2 cache. ’IDC-T2’ shows a
similar behavior to ’IDC-T1’.
As previously discussed, we assumed that a small ECC cache is maintained for error correction
of the vulnerable cache blocks, i.e., those dirty cache blocks that have no duplicates in the memory
hierarchy. The multi-bit error correction codes for only the vulnerable blocks are maintained in
this small ECC cache. A multi-bit soft error is always detected by the low cost error detection
codes. If a L2 cache block is non-vulnerable, it is corrected by exploiting the redundancy existing
in the memory hierarchy, while vulnerable blocks are corrected using the small ECC cache. The
significant reduction in vulnerability of the L2 cache by exploiting/mining the redundancy in the
memory hierarchy allowed a ECC cache of significantly smaller size. We found that a direct-mapped
ECC cache, of size 8KB was sufficient for upto 6-bit error protection using our redundancy based
approach for most SPEC2000 benchmarks on 256KB L2 cache. Our simulations suggest that with
a ECC cache of size 4% of that of the L2 cache together with exploiting our redundancy based
98
Figure 7.12 Write Back Traffic Rate to the Main Memory for Various Schemes Proposed for
SPECINT2000.
Figure 7.13 Write Back Traffic Rate to the Main Memory for Various Schemes Proposed for
SPECFP2000.
approach, can provide a multi-bit error error coverage of about 96% with significatly less area/power
overhead and with marginal performance penalty.
We estimated the area overhead of our redundancy based scheme with a small ECC cache for
multi-bit error protection of the vulnerable blocks. We estimated the area overhead of multi-bit error
correction coding by using the following formulation. As codes obtained by multi-bit errors from a
valid codeword must be disjoint from each other for correction to a distinct valid code, we have for
p-bit error correction scheme for a m-bit word requiring r check bits,

N ` 1 ba 2m c 2m d r (7.5)
99
Figure 7.14 Area Overhead for a L2 Cache with Redundancy Based Error Protection Compared to
a Baseline L2 Cache without Error Protection
where N is number of possible ways a p-bit error can happen on a m-bit word. Since this is the
same as the number of ways of choosing 1  2  -e p objects from m ` r objects,
N * Cm d r1 ` C
m d r
2 `feeC
m d r
p (7.6)
Solving these equations for r, gives us an estimate of area overhead for complete multi-bit error
protection. The area overhead for our redundancy based multi-bit error correction approach was
estimated by considering the area overhead for the small ECC cache and adding the number of
status bits (inclusion bit, small value bits etc.) required for implementing our redundancy based
approach. The area overhead of our redundancy based multi-bit error protection for fixed number of
ECC cache blocks is shown in Figure 7.14. As shown in the figure, our redundancy based multi-bit
error protection for 6-bit errors on a L2 cache with line size of 32B incurs only a marginal area
overhead of 6%.
We also estimated the power overhead of our redundancy based multi-bit error protection. We
ported the Simplescalar based framework implementing our approach to the Wattch 2.0 [96] frame-
work. Wattch performs architectural level power analysis for the cache by maintaining counters for
number of read/write accesses to the cache and multiplying it with the average power required for
a single read/write cache access in a particular process technology. We also estimated the leakage
power at the architectural level, which is a significant portion of total power in current technology
100
Figure 7.15 Average Dynamic Power Consumption for a L2 Cache with a 8KB ECC Cache Com-
pared with Baseline L2 Cache without Error Protection
nodes. We used CACTI 4.2 [87] for this analysis, which is a detailed cache access and power anal-
ysis tool. The cache size estimates made previously was provided to CACTI to obtain estimates
of leakage power. With a 70nm process technology model, the dynamic power overhead of our
redundancy based multi-bit error correction scheme with different SPEC2000 benchmark circuits is
plotted in Figure 7.15. As shown in the figure our scheme increases the dynamic power overhead by
only about 13.75%. As shown in Figure 7.16, marginal overhead is also incurred in leakage power
for our area efficient multi-bit error correction scheme for different sized multi-bit errors. This, thus
makes the total power overhead of our approach much smaller than that of most works found in
literature for moderately sized multi-bit errors.
7.5 Comparison with Related Works
We note that many competing solutions have been proposed in the literature for protecting
caches against multi-bit errors with low area/performance overhead. For comparison of our work
with recent works, we have used data reported in the results of the corresponding research papers
and interpolated the results according to our simulation setup. We have assumed an average IPC of
2.5 and that the instruction mix contains 30% memory reference instructions and 10% stores as is
typical of most SPEC benchmarks [95].
101
Figure 7.16 Average Leakage Power Consumption for a L2 Cache with Small ECC Cache with
Fixed Number of Blocks for Different Sized Multi-bit Errors
”In-Cache Replication (ICR)”, has been proposed in [19] to exploit ”dead” blocks in the data
cache to store the replicas of the ”hot” blocks. These duplicate blocks can be used to correct multi-
bit errors in the active blocks. Although an area overhead of less than 1% has been reported with
a modest performance penalty of 3.6%, the parity based multi-bit error protection scheme provides
an error coverage of only 65%. Our redundancy based approach has a error coverage of 96% with
a performance penalty of less than 1%. ”Shadow Caching” [18], maintains copies of MFU (Most
Frequently Used) cache lines in separate shadow caches. In the context of error correction, atleast
two shadow caches should be maintained to support correction using majority voting. The approach
although significantly better than blind NMR (N-Modular Redundancy), however, incurs significant
area overhead. For example a 4-way associative shadow cache of 128 entries has an area overhead
of about 28%. Also, as multiple copies of data are read for error detection the cache access latency
is increased, resulting in a performance overhead of about 40% with a modest error coverage of
about 85%. These overheads scale exponentially as higher order mltiple-bit errors are considered.
In comparison, our redundancy based approach can achieve 96% error coverage with about 6% area
overhead with very little performance penalty.
The ”R-cache” aprroach [21], maintains a small fully associative ”replication cache” to detect
and correct multi-bit errors using copies of dirty data. The method achieves 100% multi-bit error
coverage. However, as multi-bit error detection is achieved by parallel access of the R-cache and
102
the data cache, a large latency overhead is incurred. For example, with a 2 cycle load latency for
reads as reported in the work, a performance penalty of about 7.31% can be incurred. As illustrated
in Section 4.4, the performance penalty for our redundancy based multi-bit error protection scheme
is less than 1% with high multi-bit error coverage. The ”Last Store” prediction technique [23],
proposes the use of an accurate program-counter (PC) trace based predictor which immediately ini-
tiates a writeback after observing a PC trace with a sequence of store instructions. However, the
hardware structures like ”history table” and ”signature table” incur an area overhead of about 8%.
Assuming that updating these hardware structures takes atleast 1 cycle latency during stores, a per-
formance penalty of about 15% can be incurred which is quite high compared to our redundancy
based approach. Our approach also achieves a higher error coverage than that reported in their work.
”Punctured Error Recovery Cache (PERC)” [86], decouples error detection and correction by main-
taining the ”punctured” error correction codes in a separate cache. As error detection and correction
is separated, little performance overhead is incurred. However, as the number of vulnerable blocks
is not actively reduced, complete multi-bit error coverage requires about 19% area overhead.
Table 7.3 Comparison with Recent Works in Literature
Scheme Area
over-
head
Performance
Penalty
Error
Cover-
age
ICR [19] g 1% 3  6% 65%
Shadow Cache [18] 28% 40% 85%
Last Store Prediction
[23]
7  98% 15% 88%
R-cache [21] 10% 7  31% 100%
PERC [86] 19% g 1% 100%
This work 6% g 1% 96%
We note that the techniques proposed in this work, although primarily targeted at single core
processors, however can be extended and applied to multi-core processors. The bandwidth required
by the write-through L1 cache used in our approach can be significantly reduced by employing a
merging write-buffer between the L1 and the L2 caches. For example, when a fully associative
merging write buffer with 8 entries and with each entry of the size of four words is placed between
the L1 and the L2 cache, a negligible increase of write-through bandwidth is observed. Also, the
103
techniques to mine redundancy using small values and reliablity centric replacement can be applied
to a cache hierarchy with non-inclusive L1 caches. In multi-core systems, the cache coherence
protocol between the local L1 caches and a shared L2 cache, which also acts as a synchronization
point, can lead to increased exploitation of the inclusion property between the L1 and the L2 cache.
A cache read of data in the local L1 caches of a processor that has been modified by another pro-
cessor in a multi-core environment will lead to invalidation requests by the cache controller and
re-fetching of the modified data from the shared L2 cache. As inclusion property is naturally en-
forced between the L1 and the L2 caches, this leads to increased redundancy between local L1
caches and shared L2 caches.
104
CHAPTER 8
CONCLUSIONS
Aggressive scaling trends have significantly impacted the susceptibility of nanometer designs to
transient faults. Transient faults occur due to several reasons, such as soft errors, power supply and
interconnect noise, and electromagnetic interference. Soft errors occur when the energetic neutrons
coming from space or the alpha particles arising out of packaging materials hit the transistors.
A soft error may manifest itself as a bit flip in a latch or memory element. Additionally, soft
errors can occur in any internal node of a combinational logic and subsequently propagate to and be
captured in a latch. In this dissertatiom, we have investigated the development of a unified design
flow framework for mitigation of soft errors. Several design and circuit optimization techniques
applicable at various levels of hardware design have been be explored to improve the reliability of
nanoscale VLSI systems.
In chapter 3, we presented some preliminaries of soft errors in memory and logic circuits. Dif-
ferent circuit nodes in a logic circuit had different soft error criticality depending on the various
masking effects. The masking effects depend on the circuit topology and the underlying cells real-
izing the logic circuit. Towards this, we developed several metrics that estimate the masking effects
in logic circuits. We showed that it is possible to accurately capture the soft error masking effects by
using a new metric called the cumulative probability of observability (CPO). The metrics developed
in this chapter are extensively used in chapters 5-6 for selectively optimizing circuits nodes and nets.
In chapter 4, we show that the interconnects realizing the signal nets can act as RC ladders
and can effectively filter out glitches due to random radiation strikes. We leveraged the fact that
different circuit nodes can be quite different in their ability to mask soft errors based on the circuit
topology and the logic cells of the circuit. Based on this, we have developed a SA based placement
algorithm that places standard cells in a way to provide higher wirelengths for soft error critical
105
nets while simultaneously constraining the chip area and the total wirelength. SA based placement
schemes produce good reductions in SER but suffer from large runtimes. Towards this, we propose
a fast quadratic programming based standard cell placer which is orders of magnitude faster than
the SA based placement scheme with some loss in solution quality in terms of SER reduction and its
associated delay and area overheads. Experimental results on ISCAS85 benchmark circuits using
the FreePDK 45nm technology kit and the OSU standard cells indicate that our radiation immune
placement algorithms can reduce the SER in logic circuits significantly with low delay and area
overheads.
In chapter 5, we proposed a transistor level circuit which significantly reduces the propagation
of random glitches due to radiation strikes. The circuit is based on a RC differentiator implemented
in CMOS, which utilizes the exponential voltage spike generated during a radiation strike to detect
occurrence of single event transient (SET) and disconnects the driving cell from the driven cell.
The high voltage drop during the resistor (implemented using NMOS) is provided as gate-to-body
voltages of two depletion mode NMOS and PMOS transistors arranged in series. During the high
positive(negative) voltage swing due to the SET, the ”normally on” depletion mode PMOS(NMOS)
is cut off disconnecting the driver from the driven cell. The circuit incurs some overhead in terms
of area and delay. Towards this, we develop an algorithm for selective insertion of these radiation
blocker cells on critical circuit nodes. The algorithm is based on ranking circuit nodes based on
a new metric called the Probability of Radiation-blocker circuit Insertion(PRI) and inserting the
radiation blocker cells on the top few nodes in the sorted list of PRI values. The PRI metric is
computed by considering the product of the glitch observability of a node and the slack available at
that node. Thus, the algorithm inserts the radiation blocker cells selectively on highly soft error vul-
nerable nodes for the non-critical paths of a circuit. We experimented with the proposed framework
using the FreePDK 45nm Process Design Kit and the Nangate standard cell library based on the
45nm technology. Experimental results indicate that our methodology can reduce the SER in logic
circuits by as much as 51% with area overheads of about 18% and delay overhead of only 0.2%.
In chapter 6, we developed a reliability-centric gate sizing formulation that jointly optimizes
the circuit against both radiation induced soft errors and capacitive crosstalk noise under process
106
uncertainty. A first order model is developed for soft errors in logic gates by only considering the
effect of the size of a gate and the sizes of the gates in the transitive fan-in of the gate. Based
on this, we have developed a fast and accurate method for optimizing SER during gate sizing.
Crosstalk noise is modeled by clustering the structural netlist based on Rent’s exponent values and
by equalizing the drive strengths of all cells in a cluster. Timing yield loss due to process variations
are optimized by maximizing the delay variance for each gate. These models are incorporated
along with delay and power metrics to develop a reliability-centric gate sizing technique based on
mathematical programming.
Finally, in chapter 7, we modelled the vulnerablity of the L2 caches due to multi-bit errors using
a probabilistic formulation characterized by extensive simulations for multi-bit errors in various L2
cache organizations. Based on this study, we proposed a framework of solutions based on redun-
dancy for the correction of multi-bit soft errors. In our approach, simple error detection codes like
Hamming distance or Cyclic Redundancy Codes (CRC) are used to detect the multiple-bit errors,
and they are corrected using the redundancy existing in the memory hierarchy. We demonstrate that
multi-bit errors in the L2 cache can be corrected by exploiting the redundancy existing between the
the write-through L1 cache and the L2 cache and the redundancy existing between the clean data
lines of the L2 cache and the main memory. We found that the bandwidth and power requirement
of the write-through L1 cache can be sufficiently reduced by addition of a small merging write
buffer between the L1 and L2 cache. We investigated methods to increase the amount of redun-
dancy in the memory hierarchy by employing a redundancy-based replacement policy, the amount
of redundancy being controlled is based on a redundancy threshold which is estimated using our
probabilistic model. Finally, we investigated how redundancy can be mined at the word level by du-
plicating small memory values in the upper half of the memory word. Multi-bit errors in the lower
half of the word is corrected using the duplicate copy in the upper half. The multi-bit errors which
cannot be corrected using the inherent redundancy are corrected by using a small ECC cache.
Thus, in this dissertation, we explored techniques at all levels in the design flow to improve the
vulnerability of VLSI systems against soft errors without compromising on other design metrics
like delay, area and power. The design techniques, algorithms and architectures can integrated
107
into existing design flows and prototype chips can implemented on a reliable VLSI System. The
implementation can leverage on the architectural solutions for the caches while the custom hardware
synthesized for the VLSI System can utilize the various circuit optimization algorithms that are
developed at various design abstraction levels.
108
REFERENCES
[1] DC Bossen, JM Tendler, and K. Reick. Power4 system design for high reliability. In IEEE
Micro, volume 22, pages 16–24, 2002.
[2] N. Quach. High availability and reliability in the itanium processor. In IEEE Micro, volume 20,
pages 61–69, 2000.
[3] R. Phelan. Addressing soft errors in arm core-based soc. In ARM White Paper, Dec, 2003.
[4] KC Yeager. The mips r10000 superscalar microprocessor. In IEEE Micro, volume 16, pages
28–41, 1996.
[5] S.S. Mitra, N.M.Z.Q.S. Kee, and S. Kim. Robust system design with built-in soft-error re-
silience. In IEEE Computer, volume 38, pages 43–52, 2005.
[6] G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, and P. Roussel. The mi-
croarchitecture of the pentium 4 processor. In Intel Technology Journal, volume 1, 2001.
[7] P. Hazucha and C. Svensson. Impact of cmos technology scaling on the atmospheric neutron
soft error rate. In Trans. on Nuclear Science, volume 47, pages 2586–2594, 2000.
[8] T. Karnik, B. Bloechel, K. Soumyanath, V. De, and S. Borkar. Scaling trends of cosmic ray
induced soft errors in static latchesbeyond 0.18 µ. In Digest of Symp. on VLSI Circuits, pages
61–62, 2001.
[9] N. Seifert, D. Moyer, N. Leland and R. Hokinson. Historical trend in alpha-particle induced
soft error rates of the alpha microprocessor. In Intl. Reliability Physics Symposium, pages
259–265, 2001.
[10] C. Constantinescu. Trends and challenges in vlsi circuit reliability. In IEEE Micro, volume 23,
pages 14–19, 2003.
[11] N. Jha and S. Kundu. Testing and Reliable Design of CMOS circuits. Kluwer Academic
Publishers, 1990.
[12] H. Asadi, V. Sridharan, MB Tahoori, and D. Kaeli. Balancing performance and reliablity in
the memory hierarchy. In Proc. of Symp. on Performance Analysis of Systems and Software,
pages 269–279, 2005.
[13] A. Biswas, P. Racunas, R. Cheveresan, J. Emer, S.S. Mukherjee, and R. Rangan. Computing
architectural vulnerability factors for address-based structures. In Proc. of ISCA, pages 532–
543, 2005.
109
[14] V. Narayanan and Y. Xie. Reliability concerns in embedded system designs. In IEEE Com-
puter, volume 39, pages 118–120, 2006.
[15] CW Slayman. Cache and memory error detection, correction, and reduction techniques for
terrestrial servers and workstations. In Trans. on Device and Materials Reliability, volume 5,
pages 397–404, 2005.
[16] SS Mukherjee, J. Emer, T. Fossum, and SK Reinhardt. Cache scrubbing in microprocessors:
myth or necessity? In Proc. of Intl. Symp. on Dependable Computing, pages 37–42, 2004.
[17] W. Zhang, M. Kandemir, A. Sivasubramaniam, and MJ Irwin. Performance, energy, and
reliability tradeoffs in replicating hot cache lines. In Proc. of the Intl. Conf. on Compilers,
architectures and synthesis for embedded systems, pages 309–317, 2003.
[18] S. Kim and A.K. Somani. Area efficient architectures for information integrity in cache mem-
ories. In Proc. of the ISCA, 1999.
[19] W. Zhang, S. Gurumurthi, M. Kandemir, and A. Sivasubramaniam. Icr: in-cache replica-
tion for enhancing data cache reliability. In Proc. of Intl. Conf. on Dependable Systems and
Networks, pages 291–300, 2003.
[20] T. Tanzawa, T. Tanaka, K. Takeuchi, R. Shirota, S. Aritome, H. Watanabe, G. Hemink,
K. Shimizu, S. Sato, Y. Takeuchi, et al. A compact on-chip ECC for low cost flash memo-
ries. In Journal of Solid-State Circuits, volume 32, pages 662–669, 1997.
[21] W. Zhang. Enhancing data cache reliability by the addition of a small fully-associative repli-
cation cache. In Proc. of Intl. Conf. of Supercomputing, pages 12–19, 2004.
[22] V. Sridharan, H. Asadi, MB Tahoori, and D. Kaeli. Reducing data cache susceptibility to soft
errors. In Trans. on Dependable and Secure Computing, volume 3, pages 353–364, 2006.
[23] B.T. Gold, M. Ferdman, B. Falsafi, and K Mai. Mitigating multi-bit soft errors in l1 caches
using last-store prediction. In Proc. of Federated Computing Research Conf., 2007.
[24] J. Kim, N. Hardavellas, K. Mai, B. Falsafi, and Hoe J. Multi-bit error tolerant caches using
two-dimensional error coding. In IEEE Micro, 2007.
[25] K. Bhattacharya, S. Kim, and N. Ranganathan. Improving the reliability of on-chip l2 cache
using redundancy. In Proc. of the ICCD, pages 224–229, 2007.
[26] Y. Dhillon, A. Diril, and A. Chatterjee. Soft-error tolerance analysis and optimization of
nanometer circuits. Proc. of DATE, pages 288–293, 2005.
[27] G. Messenger. Collection of charge on junction nodes from ion tracks. Trans. of Nuclear
Science, 29(6):2024–2031, 1982.
[28] S. Mitra, T. Karnik, N. Seifert, and M. Zhang. Logic soft errors in sub-65nm technologies
design and CAD challenges. Proc. of DAC, pages 2–4, 2005.
[29] R. Rajaraman, J. Kim, N. Vijaykrishnan, Y. Xie, and M. Irwin. SEAT-LA: A soft error analysis
tool for combinational logic. Proc. of VLSID, pages 499–502, 2006.
110
[30] R. Rao, D. Blaauw, and D. Sylvester. Soft error reduction in combinational logic using gate
resizing and flipflop selection. Proc. of ICCAD, pages 502–509, 2006.
[31] FreePDK 45nm Technology Kit. http://www.eda.ncsu.edu/wiki/FreePDK.
[32] P. Shivakumar, M. Kistler, S. Keckler, D. Burger, and L. Alvisi. Modeling the effect of tech-
nology trends on the soft error rate of combinational logic. Proc. of DSN, pages 389–398,
2002.
[33] B. Zhang, W. Wang, and M. Orshansky. FASER: Fast analysis of soft error susceptibility for
cell-based designs. Time, 1(66):2–10, 2003.
[34] Nangate Standard Cell library. http://www.si2.org/openeda.si2.org/projects/nangatelib.
[35] M. Nicolaidis. Time Redundancy Based Soft-Error Tolerance to Rescue Nanometer Technolo-
gies. Trans. on VTS, 99:86–94, 1999.
[36] T. Karnik, S. Vangal, V. Veeramachaneni, P. Hazucha, V. Erraguntla and S. Borkar. Selective
node engineering for chip-level soft error rate improvement. Proc. of Symp. On VLSI Circuits,
pages 204–205, 2002.
[37] J. Kumar and M. Tahoori. Use of pass transistor logic to minimize the impact of soft errors in
combinational circuits. Proc. of Workshop on SELSE, 2005.
[38] Y. Sasaki, and K. Namba, and H. Ito Circuit and Latch Capable of Masking Soft Errors with
Schmitt Trigger. Journal of Electronic Testing, 24(1), pages 11–19, 2008.
[39] R. Garg, N. Jayakumar, S.P. Khatri and G. Choi. A design approach for radiation-hard digital
electronics. Proc. of the DAC, pages 773–778, 2006.
[40] K. Bhattacharya and N. Ranganathan. RADJAM: A Novel Approach for Reduction of Soft
Errors in Logic Circuits. Proc. of the VLSI Design, pages 453–458, 2009.
[41] Y. Sasaki, K. Namba and H. Ito. Soft Error Masking Circuit and Latch Using Schmitt Trigger
Circuit. Proc. of the Symp. on DFT, 327–335, 2006.
[42] K. Bhattacharya and N. Ranganathan. Reliability-centric Gate Sizing with Simultaneous Op-
timization of Soft Error Rate, Delay and Power. Proc. of the ISLPED, 99–104, 2008.
[43] N. Hanchate and N. Ranganathan. LECTOR: A Technique for Leakage Reduction in CMOS
Circuits. IEEE Trans. on VLSI Systems, 12(2), 196–205, 2004.
[44] K. Roy and S. Prasad. Low Power CMOS VLSI: Circuit Design. Wiley-Interscience, 2000.
[45] J. Cazeaux, D. Rossi, M. Omaa, A. Chatterjee and C. Metra. On Transistor-Level Gate Sizing
for IC Design Robust To Transient Faults. Proc. of Intl. On-Line Testing Symposium, 2005.
[46] C. Nagpal, R. Garg and S. Khatri. A delay-efficient radiation-hard digital design approach
using CWSP elements. Proc. of DATE, pages 354–359, 2008.
[47] S. Mitra, M. Zhang, S. Waqas, N. Seifert, B. Gill and K. Kim. Combinational logic soft error
correction. Proc. of ITC, pages 824–832, 2006.
111
[48] M. Choudhury, Q. Zhou, and K. Mohanram. Design optimization for single-event upset ro-
bustness using simultaneous dual-vdd and sizing techniques. Proc. of ICCAD, pages 204–209,
2006.
[49] N. Sherwani. Algorithms for VLSI Physical Design Automation. Kluwer Acedemic Publish-
ers, Boston, 1995.
[50] V. Mahalingam and N. Ranganathan. Variation Aware Timing Based Placement Using Fuzzy
Programming. Proc. of ISQED, pages 327–332, 2007.
[51] C. Li, M. Xie, C. Koh, J. Cong, P. Madden Routability-Driven Placement and White Space
Allocation IEEE Trans. on CAD, 26(5), 858–871, 2007.
[52] K. Bhattacharya and N. Ranganathan. A New Placement Algorithm for Reduction of Soft
Errors in Macro Cell based Design of Nanometer Circuits. Proc. of ISVLSI, pages 91–96,
2009.
[53] H. Murata, K. Fujiyoshi, S. Nakatake, and Y. Kajitani. VLSI module placement based on
rectangle-packing by the sequence pair. IEEE Trans. on CAD, 15(12), pages 1518–1524,
1996.
[54] X. Tang, R. Tian, and D. Wong. Fast Evaluation of Sequence Pair in Block Placement by
Longest Common Subsequence Computation. Proc. of the DATE, pages 106–111, 2000.
[55] P. Fernando and S. Katkoori. An Elitist Non-Dominated Sorting Based Genetic Algorithm
for Simultaneous Area and Wirelength Minimization in VLSI Floorplanning. Proc. of VLSI
Design, 337–342, 2008.
[56] C. Zhao, S. Dey and X. Bai. Soft-Spot Analysis: Targeting Compound Noise Effects in
Nanometer Circuits. Trans. on Design and Test, 362–375, 2005.
[57] J. Lou and W. Chen. Cross talk driven placement. Proc. of ASP-DAC, 735–740, 2003.
[58] J. Stine, J. Grad, I. Castellanos, J. Blank, V. Dave, M. Prakash, N. Iliev, and N. Jachimiec,
A Framework for High-Level Synthesis of System-on-Chip Designs. Proc. of MSE, 11–12,
2005.
[59] C. Zhao, X. Bai and S. Dey. A scalable soft spot analysis methodology for compound noise
effects in nano-meter circuits. Proc. of DAC, pages 894–899, 2004.
[60] V. Jain and P. Zarkesh-Ha. Analytical Noise-Rejection Model Based on Short Channel MOS-
FET. Proc. of ISQED, pages 401–406, 2008.
[61] I. Parulkar, A. Wood, J. Hoe, B. Falsafi, S. Adve, J. Torrellas and S. Mitra. OpenSPARC: An
Open Platform for Hardware Reliability Experimentation. Proc. of SELSE, 2008.
[62] S. Adya and I. Markov. Fixed-outline Floorplanning Through Better Local Search. Proc. of
ICCD, pages 328–333, 2001.
112
[63] J. Kleinhans, G. Sigl, F. Johannes and K. Antreich. GORDIAN: VLSI placement by quadratic
programming and slicing optimization. IEEE Trans. on CAD of Integrated Circuits and Sys-
tem, 10(3), pages 356–365, 1991.
[64] M. Galassi, J. Davies, J. Theiler, B. Gough, G. Jungman, M. Booth and F. Rossi. GNU
scientific library. Network Theory Ltd., 2002.
[65] L. Gaspero. QuadProg++. http://www.diegm.uniud.it/digaspero/ index.php?page=software-
and-data.
[66] J. Roy, D. Papa, S. Adya, H. Chan, A. Ng, J. Lu and I. Markov. Capo: robust and scalable
open-source min-cut floorplacer. Proc. of ISPD, pages 224–226, 2005.
[67] M. Berkelaar and J. Jess. Gate sizing in mos digital circuits with linear programming. Proc.
of EDAC, pages 217–221, 1990.
[68] J. P. Fishburn and A. E. Dunlop. TILOS : A posynomial programming approach to transistor
sizing. IEEE Trans. on CAD, pages 326–336, 1985.
[69] N. Hanchate and N. Ranganathan. Simultaneous interconnect delay and crosstalk noise opti-
mization through gate sizing using game theory. IEEE Trans. on Computers, 55(8):1011–1023,
2006.
[70] http://courses.ece.uiuc.edu/ece543/iscas85.html. ISCAS’85 benchmark circuits.
[71] http://www neos.mcs.anl.gov/neos/solvers/cp:KNITRO/AMPL.html. KNITROS solver from
neos server.
[72] K. Chopra, S. Shah, A. Srivastava, D. Blaauw and D. Sylvester. Parametric yield maximization
using gate sizing based on efficient statistical power and delay gradient computation. Proc. of
ICCAD, pages 1023–1028, 2005.
[73] J. Liou, K. Cheng, S. Kundu, and A. Krstic. Fast statistical timing analysis by probabilistic
event propagation. Proc. of DAC, pages 661–666, 2001.
[74] M. Hashimoto and H. Onodera. A Performance Optimization Method by Gate Sizing using
Statistical Static Timing Analysis. Proc. of ISPD, pages 111–116, 2000.
[75] L. Macchiarulo, E. Macii and M. Poncino. Wire Placement for Crosstalk Energy Minimization
in Address Buses. Proc. of DATE, pages 158–162, 2002.
[76] M. Mani, A. Devgan and M. Orshansky. An Efficient Algorithm for Statistical Minimization
of Total Power under Timing Yield Constraints. Proc. of DAC, pages 309–314, 2005.
[77] M. Mani and M. Orshansky. A New Statistical Optimization Algorithm for Gate Sizing. Proc.
of ICCD, pages 272–277, 2004.
[78] A. Murugavel and N. Ranganathan. Gate Sizing and Buffer Insertion using Economic models
for Power Optimization. Proc. of VLSI Design, pages 195–200, 2004.
113
[79] K. Bhattacharya and N. Ranganathan. A linear programming formulation for security-aware
gate sizing. Proc. of GLSVLSI, pages 273–278, 2008.
[80] S. Bhardwaj, Y. Cao and S. Vrudhula. Statistical leakage minimization through joint selection
of gate sizes gate lengths and threshold voltage. Proc. of ASP-DAC, 2006.
[81] S. Sapatnekar, V. Rao, and P. Vaidya. An exact solution to the transistor sizing problem for
cmos circuits using convex optimization. IEEE Trans. on CAD, 12(11):1621–1634, 1993.
[82] N. Weste, D. Harris, and A. Banerjee. CMOS VLSI Design: A circuits and systems perspec-
tive. Pearson/Addison-Wesley, 2005.
[83] X. Bai, C. Visweswariah, P. Strenski and D. Hathaway. Uncertainty-Aware Circuit Optimiza-
tion. Proc. of DAC, pages 58–63, 2002.
[84] Q. Zhou and K. Mohanram. Gate sizing to radiation harden combinational logic. IEEE Trans.
on CAD, 25(1):155–166, 2006.
[85] K. Mohanram and N. Touba. Cost-Effective Approach for Reducing Soft Error Failure Rate
in Logic Circuits. Proc. of ITC, 893–901, 2003.
[86] N. Sadler and D. Sorin. Choosing an error protection scheme for a microprocessor’s l1 data
cache. In Proc. of the ICCD, 2006.
[87] G. Reinman and N. Jouppi. An integrated cache timing and power model. In Compaq WRL
Report, 1999.
[88] S. Kaxiras, Z. Hu, and M. Martonosi. Cache decay: Exploiting generational behavior to reduce
cache leakage power. In Proc. of the ISCA, pages 240–251, 1930.
[89] S. Gopal, T. Vijaykumar, J. Smith, and G. Sohi. Speculative versioning cache. In Proc. of the
ISCA, pages 195–205, 1998.
[90] D. Brooks and M. Martonosi. Dynamically exploiting narrow width operands to improve
processor power and performance. In Proc. of the HPCA, 1999.
[91] J. Hu, S. Wang, and S. Ziavras. In-register duplication: Exploiting narrow-width value for
improving register file reliability. In Proc. of the Intl. Conf. on Dependable Systems and
Networks, volume 0, pages 281–290, 2006.
[92] M. Kadiyala and L. Bhuyan. A dynamic cache sub-block design to reduce false sharing. In
Proc. of ICCD, 1995.
[93] H. Lee, G. Tyson, and M. Farrens. Eager writeback-a technique for improving bandwidth
utilization. In Proc. of the Symp. on Microarchitecture, pages 11–21, 2000.
[94] D. Burger and T. Austin. The simplescalar tool set, version 2.0. In ACM SIGARCH Computer
Architecture News, volume 25, pages 13–25, 1997.
[95] SPEC2000 benchmarks. In http://www.specbench.org/osg/cpu2000/.
114
[96] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: a framework for architectural-level power
analysis and optimizations. In Proc. of ISCA, pages 83–94, 2000.
[97] D. Sinha and H. Zhou. Gate sizing for crosstalk reduction under timing constraints by La-
grangian relaxation. Proc. of ICCAD, pages 14–19, 2004.
[98] D. Sinha and H. Zhou. Yield driven gate sizing for coupling-noise reduction under uncertainty.
Proc. of ASP-DAC, pages 192–197, 2005.
[99] T. Xiao and M. Marek-Sadowska. Crosstalk reduction by transistor sizing. Proc. of ASP-DAC,
pages 137–140, 1999.
[100] K. Nepal, R. Bahar, J. Mundy, W. Patterson and A. Zaslavsky. MRF Reinforcer: A Proba-
bilistic Element for Space Redundancy in Nanoscale Circuits. Proc. of IEEE MICRO, pages
19–27, 2006.
[101] V. Mahalingam, N. Ranganathan and J. Harlow III. A Fuzzy Optimization Approach for
Variation Aware Power Minimization During Gate Sizing. IEEE Trans. on VLSI, 16(8):975–
984, 2008.
[102] P. Verplaetse, J. Dambre, D. Stroobandt and J. Campenhout. On partitioning vs. placement
rent properties. Proc. of SLIP, 33–40, 2001.
[103] K. Bhattacharya and N. Ranganathan. Reliability-centric gate sizing with simultaneous opti-
mization of soft error rate, delay and power. Proc. of ISLPED, pages 99–104, 2008.
[104] R. Bahar. Nanoscale Circuits and Architectures for Probabilistic Computation in the Presence
of Noise. Proc. of FNANO, Invited paper, 2006.
[105] International Technology Roadmap for Semiconductors.
http://www.itrs.net/Links/2001ITRS/Home.htm.
[106] J. Singh, V. Nookala, Z. Luo and S.Sapatnekar. Robust gate sizing by geometric program-
ming. Proc. of DAC, 315–320, 2005.
[107] N. Hanchate and N. Ranganathan. Statistical Gate Sizing for Yield Enhancement at Post
Layout Level. Proc. of ISVLSI, 245–252, 2007.
[108] N. Ranganathan, U. Gupta and V. Mahalingam. Simultaneous optimization of total power,
crosstalk noise, and delay under uncertainty. Proc. of GLSVLSI, pages 171–176, 2008.
[109] S. Wang and J. Hu and S.Ziavras. Self-Adaptive Data Caches for Soft-Error Reliability. IEEE
Transactions on Computer Aided Design of Integrated Circuits and Systems, 27(8), pages
1503–1507, 2008.
[110] J. Hu, F. Li, V. Degalahal, M. Kandemir, N. Vijaykrishnan and M. Irwin. Compiler-Assisted
Soft Error Detection under Performance and Energy Constraints in Embedded Systems. ACM
Transactions on Embedded Computing Systems, 8(4), pages 1–30, 2009.
[111] L. Li, V. Degalahal, N. Vijaykrishnan, M. Kandemir and M. Irwin. Soft error and energy
consumption interactions: a data cache perspective. Proc. of the ISLPED, pages 132–137,
2004.
115
LIST OF PUBLICATIONS
  K. Bhattacharya, N. Ranganathan and S. Kim, ”A Framework for Correction of Multi-bit
Soft Errors in L2 Caches Based on Redundancy”, IEEE Trans. on VLSI Systems, 17(2), pp.
194-206, 2009.
  V. Mahalingam, K. Bhattacharya, N. Ranganathan, H. Chakravarthula, R. Murphy and K.
Pratt, ”A VLSI Architecture and Algorithm for Lucas-Kanade Based Optical Flow Computa-
tion”, IEEE Trans. on VLSI Systems, to appear, 2009.
  K. Bhattacharya and N. Ranganathan, ”A New Placement Algorithm for Reduction of Soft
Errors in Macro Cell based Design of Nanometer Circuits”, Proc. of Annual Symp. on VLSI
(ISVLSI), pp. 91-96, 2009.
  K. Bhattacharya and N. Ranganathan, ”RADJAM: A Novel Approach for Reduction of Soft
Errors in Logic Circuits”, Proc. of the 22nd Intl. Conf. VLSI Design (VLSID), pp. 453-458,
2009.
  K. Bhattacharya and N. Ranganathan, ”A Unified Gate Sizing Formulation for Optimizing
Soft Error Rate, Crosstalk Noise and Power under Process Variations”, Proc. of the 10th Intl.
Symp. on Quality Electronic Design (ISQED), pp. 388-393, 2009.
  K. Bhattacharya, M. Venkataraman and N. Ranganathan, ”A VLSI System Architecture for
Optical Flow Computation”, Proc. of the 42nd Intl. Symp. on Circuits and Systems (ISCAS),
pp. 357-360, 2009.
  R. Hyman Jr., K. Bhattacharya and N. Ranganathan, ”A Strategy for Soft Error Reduction in
Multi-core Designs”, Proc. of the 42nd Intl. Symp. on Circuits and Systems (ISCAS), pp.
2217-2220, 2009.
  N. Ranganathan and K.Bhattacharya, ”Methodology and Apparatus for Reduction of Soft
Errors in Logic Circuits”, Provisional Patent Application filed on June 13, 2008.
  K. Bhattacharya and N. Ranganathan, ”A Linear Programming Formulation for Security-
Aware Gate Sizing”, Proc. of the 19th Great Lakes Annual Symp. on VLSI Design (GLSVLSI),
pp. 273-278, 2008. (Nominated for Best Paper Award: Ranked among top 6 out of 220
submissions and 40 accepted papers).
  K. Bhattacharya and N. Ranganathan, ”Reliability-centric Gate Sizing with Simultaneous
Optimization of Soft Error Rate, Delay and Power”, Proc. of the 13th Intl. Symp. on Low
Power Electronics and Design (ISLPED), pp. 99-104, 2008.
  K. Bhattacharya, S. Kim, and N. Ranganathan. ”Improving the Reliability of On-chip L2
cache Using Redundancy”, Proc. of the 25th Intl. Conf. on Computer Design (ICCD), pp.
224-229, 2007.
  K. Bhattacharya and N. Ranganathan, ”A Novel Radiation Blocker Circuit and its Selective
Insertion for Soft Error Mitigation”, IEEE Trans. on VLSI Systems (2nd Review).
  K. Bhattacharya and N. Ranganathan, ”Placement for Radiation Immunity in Cell Based De-
sign of Nanometer Circuits”, IEEE Trans. on VLSI Systems (2nd Review).
ABOUT THE AUTHOR
Koustav Bhattacharya received his B.Tech degree in Computer Engineering from Kalyani
University, West Bengal, India in 2002 and his Master’s degree in Computer Technology from the
Indian Institute of Technology, Delhi, India in 2004. In 2004, he worked as a design engineer in
ST Microelectronics, Noida, India. He was awarded the Richard E. Merwin Scholarship in 2007.
His research interests include Design Automation, VLSI Design and Test, Computer Architecture,
Design for Reliability, Design for Manufacturability and FPGA Design.
