University of Central Florida

STARS
Electronic Theses and Dissertations, 2004-2019
2015

Adaptive Architectural Strategies for Resilient Energy-Aware
Computing
Rizwan Arshad Ashraf
University of Central Florida

Part of the Computer Engineering Commons

Find similar works at: https://stars.library.ucf.edu/etd
University of Central Florida Libraries http://library.ucf.edu
This Doctoral Dissertation (Open Access) is brought to you for free and open access by STARS. It has been accepted
for inclusion in Electronic Theses and Dissertations, 2004-2019 by an authorized administrator of STARS. For more
information, please contact STARS@ucf.edu.

STARS Citation
Ashraf, Rizwan Arshad, "Adaptive Architectural Strategies for Resilient Energy-Aware Computing" (2015).
Electronic Theses and Dissertations, 2004-2019. 5009.
https://stars.library.ucf.edu/etd/5009

ADAPTIVE ARCHITECTURAL STRATEGIES FOR RESILIENT ENERGY-AWARE
COMPUTING

by

RIZWAN ARSHAD ASHRAF
M.Sc. University of Central Florida, 2013
B.Sc. University of Engineering and Technology, Lahore, 2007

A dissertation submitted in partial fulfilment of the requirements
for the degree of Doctor of Philosophy
in the Department of Electrical Engineering and Computer Science
in the College of Engineering and Computer Science
at the University of Central Florida
Orlando, Florida

Summer Term
2015
Major Professor: Ronald F. DeMara

c 2015 Rizwan A. Ashraf

ii

ABSTRACT

Reconfigurable logic or Field-Programmable Gate Array (FPGA) devices have the ability to dynamically adapt the computational circuit based on user-specified or operating-condition requirements. Such hardware platforms are utilized in this dissertation to develop adaptive techniques for
achieving reliable and sustainable operation while autonomously meeting these requirements. In
particular, the properties of resource uniformity and in-field reconfiguration via on-chip processors
are exploited to implement Evolvable Hardware (EHW). EHW utilize genetic algorithms to realize logic circuits at runtime, as directed by the objective function. However, the size of problems
solved using EHW as compared with traditional approaches has been limited to relatively compact
circuits. This is due to the increase in complexity of the genetic algorithm with increase in circuit
size. To address this research challenge of scalability, the Netlist-Driven Evolutionary Refurbishment (NDER) technique was designed and implemented herein to enable on-the-fly permanent
fault mitigation in FPGA circuits. NDER has been shown to achieve refurbishment of relatively
large sized benchmark circuits as compared to related works. Additionally, Design Diversity (DD)
techniques which are used to aid such evolutionary refurbishment techniques are also proposed
and the efficacy of various DD techniques is quantified and evaluated.
Similarly, there exists a growing need for adaptable logic datapaths in custom-designed nanometerscale ICs, for ensuring operational reliability in the presence of Process, Voltage, and Temperature (PVT) and, transistor-aging variations owing to decreased feature sizes for electronic devices.
Without such adaptability, excessive design guardbands are required to maintain the desired integration and performance levels. To address these challenges, the circuit-level technique of SelfRecovery Enabled Logic (SREL) was designed herein. At design-time, vulnerable portions of the
circuit identified using conventional Electronic Design Automation tools are replicated to provide
post-fabrication adaptability via intelligent techniques. In-situ timing sensors are utilized in a feediii

back loop to activate suitable datapaths based on current conditions that optimize performance and
energy consumption. Primarily, SREL is able to mitigate the timing degradations caused due to
transistor aging effects in sub-micron devices by reducing the stress induced on active elements by
utilizing power-gating. As a result, fewer guardbands need to be included to achieve comparable
performance levels which leads to considerable energy savings over the operational lifetime.
The need for energy-efficient operation in current computing systems has given rise to NearThreshold Computing as opposed to the conventional approach of operating devices at nominal
voltage. In particular, the goal of exascale computing initiative in High Performance Computing
(HPC) is to achieve 1 EFLOPS under the power budget of 20MW. However, it comes at the cost
of increased reliability concerns, such as the increase in performance variations and soft errors.
This has given rise to increased resiliency requirements for HPC applications in terms of ensuring functionality within given error thresholds while operating at lower voltages. My dissertation
research devised techniques and tools to quantify the effects of radiation-induced transient faults
in distributed applications on large-scale systems. A combination of compiler-level code transformation and instrumentation are employed for runtime monitoring to assess the speed and depth
of application state corruption as a result of fault injection. Finally, fault propagation models are
derived for each HPC application that can be used to estimate the number of corrupted memory
locations at runtime. Additionally, the tradeoffs between performance and vulnerability and the
causal relations between compiler optimization and application vulnerability are investigated.

iv

DEDICATION

It is my proud pleasure to dedicate this dissertation to my late father (May Allah Bless his Soul with Peace and Mercy)
Professor Dr. Muhammad Ashraf,
the then Dean of University of Engineering and Technology Lahore,
who put me on the path of pursuit of higher learning abroad in my chosen field of study. To him,
indeed, the credit of this doctoral dissertation goes, and to him I owe a heavy debt of gratitude.

v

ACKNOWLEDGMENTS

I am pleased to acknowledge with gratitude the continuous support and guidance which my academic advisor Dr. Ronald F. DeMara provided to me throughout the course of my research study
presented in this dissertation. To Drs. Mingjie Lin, Jun Wang, Sumit K. Jha and Mark Johnson I
offer my sincere thanks for serving on my dissertation committee.
Besides, I would also like to acknowledge the support and mentorship provided by Dr. Roberto
Gioiosa during my internship at Pacific Northwest National Laboratory (PNNL). My thanks are
also due to Drs. Gokcen Kestor from PNNL, Chen-Yong Cher and Pradip Bose from IBM Research
for their valuable technical feedback. Part of this dissertation research was conducted during the
internship, whereas afterwards continuous support through access to PNNL computing resources
made this research possible.
I also extend my thanks to the former and current colleagues in the Computer Architecture Laboratory at UCF, whose company has been both enjoyable and productive. Thanks are due to all
those who have contributed in some way towards this research study.
I am honored to have been a part of Department of Electrical Engineering and Computer Science at
UCF, where my graduate school experience was elevated through my professional exposures here.
Finally, my sincere thanks are due to my family and friends for their continuous support, especially,
my mother Nighat Yasmeen Ashraf, younger brothers Rehan Ashraf, Salman Ashraf and sister
Maria Ashraf for their moral support and love which kept kindled my spirit to move forward and
complete this research study. Their support, encouragement and patience throughout this phase of
my life was bedrock of my study and absence from their company.

vi

TABLE OF CONTENTS

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xviii

LIST OF ACRONYMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi

CHAPTER 1: INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.1

Need for Adaptability in VLSI circuits . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Reconfigurable logic vs Custom logic with Emphasis on Adaptability . . . . . . .

3

1.3

Permanent Faults in Nanometer-scaled VLSI devices . . . . . . . . . . . . . . . .

4

1.4

Energy-Efficiency through Near-Threshold Computing . . . . . . . . . . . . . . .

5

1.5

Implications of Transient Faults on Large-Scale Systems . . . . . . . . . . . . . .

6

1.6

Contributions of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

CHAPTER 2: PREVIOUS WORKS FOR FAULT MITIGATION IN VLSI CIRCUITS . . 10
2.1

Techniques for Refurbishment of Reconfigurable Hardware . . . . . . . . . . . . . 10
2.1.1

Genetic Algorithm-based Refurbishment Techniques . . . . . . . . . . . . 11

2.1.2

Exhaustive Testing of Resource Configurations . . . . . . . . . . . . . . . 13

vii

2.1.3
2.2

Use of Design-Diversity to Overcome Common-Mode-Failures . . . . . . 15

Techniques for Circuit-level Anti-Aging . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1

Transistor Aging Models and Mitigation through Power-Gating . . . . . . 18

2.2.2

Worst-case Design Techniques for Aging-Compensation . . . . . . . . . . 22

2.2.3

Dynamic Operating Conditions for Aging-Mitigation . . . . . . . . . . . . 24

2.2.4

Adaptive Resource Management for Aging-Resilience . . . . . . . . . . . 27

2.3

Soft Error Masking in Logic Paths . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.4

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

CHAPTER 3: DESIGN DIVERSITY APPROACH TO FAILURE MITIGATION . . . . . 34
3.1

Design-For-Diversity for Improved Fault-Tolerance of TMR Systems on FPGAs . . 34

3.2

Design-For-Diversity Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3

3.2.1

Template-Based Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2.2

NAND/NOR-Based Method . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.2.3

Case-Based and Inverted-Output Methods . . . . . . . . . . . . . . . . . . 38

Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.1

Simulation Objectives, Tools and Workflow . . . . . . . . . . . . . . . . . 39

3.3.2

Experimental Configurations . . . . . . . . . . . . . . . . . . . . . . . . . 41

viii

3.4

3.5

Experimental Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4.1

Quantifying Diversity for Proposed Techniques . . . . . . . . . . . . . . . 42

3.4.2

Diversity Metric for Multiple TMR Arrangements . . . . . . . . . . . . . 43

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

CHAPTER 4: NETLIST-DRIVEN EVOLUTIONARY REFURBISHMENT FOR FAULTTOLERANCE IN RECONFIGURABLE HARDWARE . . . . . . . . . . . 48
4.1

4.2

Scalable FPGA Refurbishment Using Netlist-Driven Evolutionary Algorithms . . . 48
4.1.1

NDER Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.1.2

Organization of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . . 51

Fault Isolation via Back Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.1

Aggressive Pruning Heuristic (HA ) . . . . . . . . . . . . . . . . . . . . . 53

4.2.2

Exhaustive Pruning Heuristic (HE ) . . . . . . . . . . . . . . . . . . . . . 54

4.2.3

Hybrid Pruning Heuristic (HAE ) . . . . . . . . . . . . . . . . . . . . . . . 54

4.2.4

Dynamic Pruning Heuristic (HD ) . . . . . . . . . . . . . . . . . . . . . . 55

4.2.5

Evaluating the Diagnostic Performance . . . . . . . . . . . . . . . . . . . 56

4.3

Architecture Supporting NDER . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.4

NDER Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

ix

4.5

4.6

4.4.1

Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.4.2

Marking of LUTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.4.3

Fitness Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.4.4

Mutation Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.4.5

Single-point Crossover Operation . . . . . . . . . . . . . . . . . . . . . . 71

4.4.6

Exit Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5.1

NDER Recovery Performance . . . . . . . . . . . . . . . . . . . . . . . . 76

4.5.2

Discussion of NDER Results . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.5.3

Scalability of Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

CHAPTER 5: SELF-RECOVERY ENABLED LOGIC FOR ANTI-AGING IN ASICs . .
5.1

85

Autonomous Circuit-level Adaptation for Resilience and Lifetime Energy Reduction of Logic Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.1.1

Proactive Aging Recovery Techniques for SRAM Arrays . . . . . . . . . . 86

5.1.2

Research Objectives of the Proposed Technique . . . . . . . . . . . . . . . 87

5.1.3

Power-gating Approaches to Mitigate Aging . . . . . . . . . . . . . . . . 90

x

5.1.4
5.2

5.3

Organization of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . . 92

Self-Recovery Enabled Logic (SREL) . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2.1

Alternating Critical Paths (ACP) . . . . . . . . . . . . . . . . . . . . . . . 93

5.2.2

Competing Critical Paths (CCP) . . . . . . . . . . . . . . . . . . . . . . . 95

Aging-Sensitive Logic Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.3.1

Identifying Paths for Aging Mitigation . . . . . . . . . . . . . . . . . . . . 98

5.3.2

Design-time Tradeoffs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.3.3

Gate-level Redundancy for Aging-Mitigation . . . . . . . . . . . . . . . . 100

5.3.4

Controlled Inclusion of Merging Points . . . . . . . . . . . . . . . . . . . 102

5.3.5

Resource-Constrained Anti-Aging . . . . . . . . . . . . . . . . . . . . . . 103

5.4

Scope and Applicability of Proposed SREL Approaches . . . . . . . . . . . . . . . 105

5.5

Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.6

5.5.1

ACP Reduction of Delay Degradation . . . . . . . . . . . . . . . . . . . . 107

5.5.2

Benefit of ACP with Tighter Constraints . . . . . . . . . . . . . . . . . . . 111

5.5.3

Benefits of Autonomous Operation with CCP . . . . . . . . . . . . . . . . 112

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

xi

CHAPTER 6: COSTS AND LIMITS OF SOFT ERROR MASKING AT REDUCED SUPPLY VOLTAGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.1.1

Organization of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.2

N MR Systems at Near-Threshold Voltage . . . . . . . . . . . . . . . . . . . . . . 117

6.3

Energy Cost of Mitigating Variability in N MR Arrangements . . . . . . . . . . . . 124

6.4

Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

6.5

6.4.1

Iso-Performance Energy Consumption for N MR Arrangements . . . . . . 127

6.4.2

Impact of Technology Scaling . . . . . . . . . . . . . . . . . . . . . . . . 128

6.4.3

Cost of Increased Reliability at NTV . . . . . . . . . . . . . . . . . . . . . 129

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

CHAPTER 7: UNDERSTANDING THE IMPACT OF TRANSIENT ERRORS IN HPC
APPLICATIONS FOR LARGE-SCALE SYSTEMS . . . . . . . . . . . . . 132
7.1

7.2

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.1.1

Implications of Compiler Optimizations on Application Vulnerability . . . 133

7.1.2

Importance of Fault-Propagation Analysis . . . . . . . . . . . . . . . . . . 136

7.1.3

Organization of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . . 138

Fault Model and Software-Implemented Fault Injection . . . . . . . . . . . . . . . 138
xii

7.2.1

Categorization of Application Outcomes . . . . . . . . . . . . . . . . . . . 139

7.2.2

Architecture-level Fault Injection . . . . . . . . . . . . . . . . . . . . . . 141

7.2.3

Fault-Injection Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

7.3

Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

7.4

Effect of Software Transformations on Application Vulnerability . . . . . . . . . . 145

7.5

Framework to Quantify the Propagation of Faults into Distributed Application State 157

7.6

Experimental Results for Fault-Propagation Analysis . . . . . . . . . . . . . . . . 166
7.6.1

The Black-Box Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

7.6.2

Fault Propagation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 168

7.7

Fault Propagation Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

7.8

Related Works on HPC Fault Tolerance . . . . . . . . . . . . . . . . . . . . . . . 177

7.9

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

CHAPTER 8: CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . . . . 183
8.1

Summary of Developed Techniques and Tools . . . . . . . . . . . . . . . . . . . . 183

8.2

Lessons Learned and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

8.3

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

LIST OF REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
xiii

LIST OF FIGURES

2.1

Power-gating helps to reduce the degradation in Vth due to BTI. . . . . . . . . 21

3.1

TMR system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2

System block diagram for Template-Based approach. . . . . . . . . . . . . . 37

3.3

Inter-Design Diversities in CMF. . . . . . . . . . . . . . . . . . . . . . . . . 44

3.4

Reliability evaluation for TMR systems based on uniform and diverse implementations (combinational circuit). . . . . . . . . . . . . . . . . . . . . . . . 46

3.5

Reliability evaluation for TMR systems based on uniform and diverse implementations (sequential circuit). . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.1

Motivating example for search space tradeoffs through multiple heuristics. . . 53

4.2

RARS architecture as utilized in NDER. . . . . . . . . . . . . . . . . . . . . 57

4.3

Algorithmic flow chart describing NDER. . . . . . . . . . . . . . . . . . . . 60

4.4

Example FPGA configuration with GA encoding. . . . . . . . . . . . . . . . 63

4.5

GA phenotype in NDER. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.6

Flow chart describing the Marking of LUTs in NDER. . . . . . . . . . . . . 66

4.7

Mutation operation in NDER. . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.8

Crossover operation in NDER with parents having similar Markings. . . . . . 72
xiv

4.9

Crossover operation in NDER with parents having dissimilar Markings. . . . 72

4.10

Distribution of fan-out for LUTs in included benchmark circuits. . . . . . . . 74

4.11

Design-time allocation of redundancy. . . . . . . . . . . . . . . . . . . . . . 74

4.12

Fitness versus generations for 5xp1 benchmark circuit. . . . . . . . . . . . . 78

4.13

Distributions of the size of subset of PI(s) required for fitness evaluation. . . . 83

5.1

Design objectives and challenges of the devised SREL techniques. . . . . . . 89

5.2

Effect of Vth degradation on logic gate with different ST configurations. . . . 92

5.3

Effect of BTI recovery by changing the sleep interval using N = 2 instances.

5.4

Use of Timing Sensors in CCPs. . . . . . . . . . . . . . . . . . . . . . . . . 96

5.5

Path distribution for Arithmetic Logic Unit of an OpenSPARC core. . . . . . 99

5.6

Influence of critical path replication on critical and near-critical POs. . . . . . 101

5.7

Insertion of merging point to accommodate replication of critical paths. . . . 102

5.8

Parameter P can be traded-off against area overhead incurred. . . . . . . . . 105

5.9

Delay degradation for VM and ACP designs over a lifetime of 10 years. . . . 108

5.10

Delays of multiple critical paths over time for alu4 using ACP design. . . . 109

5.11

Energy savings with ACP as compared to VM for a lifetime of 10 years

94

(Dspec = 5%). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

xv

5.12

Energy savings with ACP as compared to VM for lifetimes of 3 and 10 years
(Dspec = 0%). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

5.13

Comparison of energy savings for ACP as compared to VM realizing different timing specifications with a lifetime of 10 years. . . . . . . . . . . . . . . 113

5.14

Energy savings for ACP, CCP with ADS=1% and, CCP with ADS=2% as
compared to VM (Dspec = 0%). . . . . . . . . . . . . . . . . . . . . . . . . 114

6.1

Propagation delay of TMR system under increased PV. . . . . . . . . . . . . 119

6.2

Mean delay of N MR systems increases with scaling voltage down to the
Near-threshold region. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

6.3

Delay variations decrease with increasing N for N MR systems. . . . . . . . 121

6.4

Delay distributions of N MR systems at Near-threshold Voltage of 0.55V with
45nm technology node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

6.5

The variation of simplex systems composed of different types of logic gates. . 123

6.6

The variability of TMR systems composed of modules with design-diversity. . 124

6.7

Simulation framework to estimate the delay and energy for N MR systems in
the presence of PV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

6.8

Mean energy consumption of spatial redundancy systems with various operational voltages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

7.1

Faults are injected uniformly throughout the execution of LULESH. . . . . . 143

xvi

7.2

Performance impact of compiler optimizations. . . . . . . . . . . . . . . . . 147

7.3

Statistical breakdown of application vulnerability. . . . . . . . . . . . . . . . 150

7.4

Correlation between application vulnerability and stores/instruction. . . . . . 153

7.5

Correlation between application vulnerability and loads/instruction. . . . . . 155

7.6

Statistical breakdown of program crashes. . . . . . . . . . . . . . . . . . . . 157

7.7

Fault propagation in Matrix-Vector multiplication. . . . . . . . . . . . . . . . 158

7.8

Primary and secondary chain of instructions for instrumentation in faultpropagation framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

7.9

FPM transformation and instrumentation using sample LLVM-IR program. . 163

7.10

MPI message handling within the fault-propagation framework. . . . . . . . . 164

7.11

Outcome of fault injection with single fault into a single MPI process. . . . . 167

7.12

Fault propagation plots demonstrating the number of corrupted memory locations over time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

7.13

Fault propagation across different MPI processes. . . . . . . . . . . . . . . . 173

7.14

Fault-Masking levels investigated by related works. . . . . . . . . . . . . . . 179

8.1

Perspective area overhead of SREL as compared to related works. . . . . . . 190

xvii

LIST OF TABLES

1.1

Potential benefits of Self-Recovery Enabled Logic. . . . . . . . . . . . . . .

8

2.1

Comparison of NDER with GA-based fault-handling schemes for FPGAs. . . 11

2.2

Design Diversity related work. . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3

Responses to an injected fault pair for two diverse designs M1 and M2. . . . . 17

2.4

Effect of temperature variation on Vth degradation due to standby mode. . . . 22

2.5

Comparison of proposed SREL schemes with related works. . . . . . . . . . 25

3.1

Diversity value obtained through comparison with the BASE design. . . . . . 42

3.2

Intra-Design Diversity values. . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.3

TMR Reliability evaluation for combinational circuit. . . . . . . . . . . . . . 45

3.4

TMR Reliability evaluation for sequential circuit. . . . . . . . . . . . . . . . 45

4.1

Listing of LUTs Marked through presented Heuristic approaches. . . . . . . . 55

4.2

Percentage of LUTs marked and the probability of covering the faulty LUT
with Heuristic HA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.3

Nomenclature for NDER. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

xviii

4.4

Fitness and generations for refurbishment via Conventional and Netlist-Driven
approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.5

Speedup in recovery time achieved via NDER. . . . . . . . . . . . . . . . . . 79

4.6

Utilization of spare LUTs for 100% recovery via NDER. . . . . . . . . . . . 80

4.7

Performance for 100% recovery via NDER with single vs. multiple corrupt
output lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.8

Multiple output discrepancies reduces recovery time. . . . . . . . . . . . . . 83

5.1

Design-time Area/Energy/Delay overheads with increasing P (N = 2) for
alu4 benchmark at nominal voltage. . . . . . . . . . . . . . . . . . . . . . 104

5.2

Initial time Area/Energy/Delay overheads with P = 10% (N = 2) at nominal
voltage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.3

Minimum switching intervals for CCP when ADS=1% and ADS=2%. . . . . 114

6.1

Mean energy consumption for N MR systems with same performance at specified NTV of simplex system. . . . . . . . . . . . . . . . . . . . . . . . . . . 127

6.2

Impact of increased PV due to technology node scaling on energy consumption for N MR systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

7.1

Optimizations applied at each level by clang. . . . . . . . . . . . . . . . . 145

7.2

Impact of compiler optimizations on application performance and other characteristics for single node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
xix

7.3

Fault propagation depends on instruction type and its operands. . . . . . . . . 160

7.4

Fault propagation speed factors. . . . . . . . . . . . . . . . . . . . . . . . . 176

8.1

Summary of technical achievements for circuit-level fault-tolerance techniques.184

8.2

Summary of technical achievements for vulnerability characterization of largescale systems (Chapter 7). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

8.3

Summary of HPC applications’ vulnerability characteristics. . . . . . . . . . 191

xx

LIST OF ACRONYMS

ACP Alternating Critical Path
ASICs Application Specific Integrated Circuits
AVF Architectural Vulnerability Factor
BIST Built-in Self Test
BTI Bias Temperature Instability
CCP Competing Critical Path
CED Concurrent Error Detection
CLBs Configuration Logic Blocks
CMF Common Mode Failure
COTS Commodity Off-The-Shelf
CRR Competitive Runtime Reconfiguration
DD Design Diversity
DFS Dynamic Frequency Scaling
DMR Dual Modular Redundancy
DVF Data Vulnerability Factor
DVS Dynamic Voltage Scaling
xxi

ECC Error Correcting Code
EDA Electronic Design Automation
EFLOPS Exa-Floating Point Operations Per Second
EHW Evolvable Hardware
EM Electromigration
FPGAs Field Programmable Gate Arrays
FPM Fault Propagation Module
GA Genetic Algorithm
GPU Graphics Processing Unit
HCI Hot Carrier Injection
HPC High Performance Computing
ILP Instruction Level Parallelism
LUT Lookup Table
MBU Multi-Bit Upset
MPI Message Passing Interface
MTTF Mean-Time-To-Failure
NBTI Negative Bias Temperature Instability
NDER Netlist-Driven Evolutionary Refurbishment
NMR N-Modular Redundancy
xxii

NTV Near-Threshold Voltage
PBTI Positive Bias Temperature Instability
PE Processing Element
PI Primary Input
PO Primary Output
PR Partial Reconfiguration
PTM Predictive Technology Models
PV Process Variation
PVF Program Vulnerability Factor
PVT Process, Voltage and Temperature
RD Reaction-Diffusion
SA Stuck-At
SDC Silent Data Corruption
SER Soft Error Rate
SET Single Event Transient
SEU Single Event Upset
SRAM Static Random Access Memory
SREL Self-Recovery Enabled Logic
SWIFI Software Implemented Fault Injection
xxiii

TD Trapping/Detrapping
TDDB Time-Dependent Dielectric Breakdown
TDP Thermal Design Power
TMR Triple Modular Redundancy
TVF Timing Vulnerability Factor
VLSI Very Large Scale Integration
VR Voltage Regulator

xxiv

CHAPTER 1: INTRODUCTION

There is a growing concern of reliability in the embedded and high performance computing domains. This dissertation evaluates and presents the need for adaptable and autonomous faulttolerance techniques in the nanometer-scale computing era.
This chapter discusses the need and motivation for techniques and tools presented in the dissertation, and concludes with a list of contributions to the state-of-the-art as a result of this work.
In the dissertation, adaptable techniques to achieve energy-aware resilience in VLSI circuits are
proposed. In particular, computer architectures are proposed which can adapt autonomously at
runtime to achieve user-specified performance objectives, while dealing with both hard faults, i.e.,
transistor aging effects, and soft faults, i.e., radiation-induced transient effects. Specifically, faulttolerance techniques are proposed for Field Programmable Gate Array (FPGA) and Application
Specific platforms utilized in embedded systems. Whereas, the later part of this dissertation covers
reliability issues which arise, due to operation at Near-Threshold Voltage (NTV) for processors operating in distributed environments utilized in High Performance Computing (HPC) applications.

1.1

Need for Adaptability in VLSI circuits

A continuous push to miniaturize electronic devices in order to achieve higher integration and
performance levels, has given rise to challenges such as resiliency and power constraints. Traditionally, reliable operation has been mainly a concern for embedded systems operating in harsh
environments, wherein the systems are prone to hard-faults such as Local Permanent Damage.
However, with continuous scaling of technology nodes, reliability is also becoming an important design parameter in commodity electronic systems, where hard-faults due to transistor aging

1

effects such as Bias Temperature Instability (BTI) and Hot Carrier Injection (HCI) are gaining
importance in the design process. This has caused a substantial increase in the design margins required to maintain correct functionality. Yet, manufacturing-induced Process Variations (PV) can
cause further increase in design margins to accommodate the worst-case. Such margins are usually
determined using predictive models at design-time. Mostly these models are based on estimates
of the workload in the field, or the expected worst-case behavior. As a result, a continuous energy
and/or performance overhead is incurred by the embedded system. The conventional design policy
of ‘one design fits all’ needs to be eliminated and approaches need to be designed which can adapt
according to actual operating conditions at runtime so as to minimize the overheads which will
continue to grow for future electronic devices.
Furthermore, there is a need for energy-aware resiliency techniques, as increased density has given
rise to thermal power constraints, which is causal to shortage of on-chip resources that can be
simultaneously powered-on at any given time. Thus, a certain portion of the chip area can not
be utilized, referred to as the Dark Silicon [144]. As a result, traditional approaches such as
Triple Modular Redundancy (TMR), which utilize additional resources to provide reliability may
no longer be feasible. Thus, goals such as reliability and energy-efficiency need to be addressed
simultaneously.
More and more application domains are converging to requirements similar to embedded systems.
One such major example is a push for resiliency and low-energy operation in HPC systems, which
traditionally have been focused on performance alone. Now, with goals such as computational
efficiency of 1 EFLOPS under the power budget of 20 MW, soft-errors are becoming a prominent
concern in such systems [87]. The Mean-Time-To-Failure (MTTF) for such systems has dropped
from months and weeks to order of days and hours, due to sheer number of components involved.
To sustain throughput, adaptable techniques need to be developed for such systems. Addressing

2

these design challenges has a broad impact on our society at large as these systems are usually
used to conduct large scale scientific simulations [132].

1.2

Reconfigurable logic vs Custom logic with Emphasis on Adaptability

FPGAs provide a significant benefit for the embedded application domain, because of their significantly shorter time-to-deployment as compared to Application-Specific Integrated Circuits (ASICs)
and improved performance as compared to general-purpose processors. In addition, these platforms provide features such as in-field reconfigurability which is ideal to implement adaptable
computing architectures. The SRAM-based configurable logic blocks which are used to implement the computational part in FPGAs, can provide multiple functionalities based on the configuration bitstream generated for the device. These configurations can be generated via automated
techniques such as Genetic Algorithms (GAs), where desired constraints such as power and performance can be specified via the objective function in addition to operational correctness. For
example, GA-based techniques can be utilized to generate diverse configurations, where, the same
functionality is maintained across all designs but unique physical implementation can provide benefits such as avoiding the faulty resources in the FPGA fabric.
On the other end of the spectrum, provisions needs to be accommodated at design-time for ASIC
designs to be adaptable. For instance, a usual practice is to include data multiplexors in the datapath
to choose a computational circuit at runtime. Power-gating can also be utilized to minimize the
leakage energy overheads due to unutilized resources in the circuit. While, both platforms provide
the ability to manage resiliency through variation of operating frequency and/or voltage, the overall
energy cost and throughput degradation can be high. Thus, there is a need to develop adaptable
techniques which can lower energy consumption with minimal impact on throughput. In this
dissertation, the benefit of FPGA reconfigurability is demonstrated using both GAs at runtime
3

and design-diversity techniques at design-time. On the other end of the spectrum, the Dark Silicon
in ASIC platforms is utilized to mitigate performance degradation over time as demonstrated later.

1.3

Permanent Faults in Nanometer-scaled VLSI devices

For 90nm technologies and below, transistor aging effects such as Negative Bias Temperature
Instability (NBTI) and HCI have been largely considered to be the threats to reliable circuit operation [168]. In addition, technology advancements like the use of High-K / Metal Gate has also
given rise to Positive Bias Temperature Instability (PBTI) along with NBTI [129]. These effects
are still prevalent in recent 22nm 3-D Tri-gate transistor devices [130]. Generally, the above transistor aging effects cause a cumulative degradation in the threshold voltage Vth , of both nMOS
and pMOS transistors to varying degrees. If uncompensated, the Vth shift results in the circuit
becoming slower over its lifetime, i.e., the delay increases with age. If such delay degradation is
not budgeted or compensated for, then the circuit eventually fails to perform correctly within timing constraints. The long-standing practice of adding guardbands [38] at design-time provides a
worst-case design approach to compensate for aging effects including process variations. However,
such worst-case designs incur continual energy and/or performance overhead, which are strongly
influenced by the accuracy of aging models utilized.
Both BTI and HCI are strongly dependent on various factors such as supply voltage, temperature
and switching activity factor [41]. Thus, design-time prediction of actual circuit degradation in
the field can be a complex task. For instance, determination of input signal probabilities for all
transistors in the circuit is dependent on the total number of transistors and size of the input space.
Similarly, the temperature of each transistor is dependent on its spatial location and workload. In
this case, the availability of a mechanism to intrinsically measure the timing performance of the
circuit and adapt when a timing violation is expected to take place is highly desirable. This can be
4

achieved by placing timing sensors into the circuit which provide feedback to a control circuitry
which has the provision to trigger circuit-level adaptation to mitigate aging effects as demonstrated
in this dissertation.

1.4

Energy-Efficiency through Near-Threshold Computing

Improvement in power efficiency of CMOS circuits is desired due to thermal limitations associated with technology scaling. This restricts the number of simultaneously activated components,
e.g., the percentage of active cores within a many-core processor. In this regard, supply voltage
(VDD ) scaling is widely recognized as the most effective lever to reduce power dissipation / energy
consumption for CMOS logic devices.
Total energy consumption is determined by dynamic and static energy components which are both
dependent on VDD . The dynamic energy has a quadratic relationship with VDD , whereas the static
energy has a linear dependence. Taken to an extreme, VDD reduction would encounter theoretical
limitations around a value that is twice the thermal voltage, i.e., about 36mV [55]. However, downscaling the supply voltage below the transistor threshold voltage Vth specifies an operating region
exhibiting a highly undesirable exponential increase in delay. Thus, leakage energy begins to
supersede at a certain point, such that designs utilizing this Sub-Threshold region have much more
limited applicability. Hence, operation in the Near-Threshold region is sought, as it provides an
energy-efficient operating point. Here, the VDD is set to be slightly above the Vth of the transistors
to provide a 10X delay improvement compared to operation in the sub-threshold region with only
a 50% reduction in energy savings [55]. Taking all of these factors into consideration, NTV can be
preferred to provide up to 6X energy savings as compared to operating at nominal voltage [55].

5

1.5

Implications of Transient Faults on Large-Scale Systems

Large supercomputers are mainly built out of commodity off-the-shelf (COTS) components [2].
COTS components may be fairly reliable, for example the MTTF of a single memory module
with double-bit error-correction capability is over 100 years [58]. However, the sheer number
of components assembled in current supercomputers dramatically reduces the net MTTF to a few
days. For exascale systems, the expected MTTF is in the order of hours [87]. The goal of the future
application developer will be to employ fault resilient techniques at various abstraction layers
e.g., circuit-level, architecture-level, or system-level to meet end-user requirements, essentially
providing a reliable application utilizing unreliable devices [22], [134], [27].
Faults occur at hardware level as the result of physical phenomena such as exposure to alpha particles, transient timing violations, or localized temperature variations. With the development of
NTV technology [55] and the need for higher temperature tolerance required to achieve exascale
efficiency [87], transient errors are becoming predominant. In this dissertation, implications of
transient faults on distributed applications are investigated. In particular, emphasis is given on
faults that escape hardware correction and detection due to the infeasibility of complete fault coverage for millions of chips consisting of billions of transistors each. The characteristics of how
such faults propagate through the application’s memory state are analyzed.

1.6

Contributions of the Dissertation

The main contributions of this dissertation are highlighted below.
Design Diversity and its benefits for FPGA devices: Design-Diversity, is a technique to generate
multiple unique implementations of the same design specification by exploiting the flexible charac-

6

teristics of the underlying platform. In this case, FPGAs are utilized as the experimental platform,
due to their reconfiguration capabilities. The uniformity of resources allows the same functional
design to utilize different portions of the FPGA fabric. Hence, avoiding a faulty resource can be
as simple as reconfiguring with a diverse FPGA configuration, which defines the logic values and
inter-connectivity of underlying SRAM cells. This is achieved via unique placement and routing
by exploiting current Electronic Design Automation (EDA) tools. Additionally, other synthesis
techniques are also utilized to achieve diverse designs. The efficacy of such methodologies is
tested using TMR arrangements, where it is shown to achieve better performance than uniform
design based TMR under the Common Mode Failure (CMF) model. The efficacy of these schemes
is quantified using the design diversity metric. This work is presented in Chapter 3.1
Scalable Refurbishment for Evolvable Hardware: The scalability issue of Evolvable Hardware (EHW) techniques is addressed via the proposed Netlist-Driven Evolutionary Refurbishment
(NDER) technique. EHW, is characterized by the use of GA techniques to generate FPGA configurations at runtime to meet user-specifications of functionality and other constraints via the objective
function. NDER augments the conventional genetic-algorithm to utilize observed failure signature
at runtime. Then, a subset of FPGA resources which are presumed to cause the corrupt output(s)
are selected for evolutionary refurbishment via GA operators such as mutation and crossover. Thus,
substantial reduction in the search space of an evolutionary algorithm causes it to converge more
quickly, which results in considerable reductions in recovery times. As a result, novel contributions
made in the field of EHW make it possible to recover from hard faults in benchmark circuits which
are not otherwise possible by using conventional approaches. The highlights of work presented in
Chapter 42 are given below:

(a) a Fault-Handling approach to which the EHW paradigm is tractable,
1

c 2011 IEEE. Reprinted, with permission, from [12].

2

c 2013 IEEE. Reprinted, with permission, from [10].

7

(b) refurbishment of functionality on-the-fly for configurations using failed components,
(c) a fault pruning method which can be employed by other recovery schemes, and
(d) a framework for GAs to cater for dynamic encoding of individuals.

Table 1.1: Potential benefits of Self-Recovery Enabled Logic.
Need Addressed
Transistor Aging

Approach
Power-gating of critical elements
In-situ timing assessment and autonomous control of Sleep Interval

Energy Consumption

Reduced voltage guardbands
De-emphasized role of voltage
regulators
Selective Redundancy

Benefit
Reduced lifetime delay degradation through stress reduction
Avoids the need for complicated
modeling of aging degradation at
design-time
Low energy requirement with narrower margins for longer periods
Circumvent conversion inefficiencies and switching losses
Power-gating lowers the leakage
energy overheads

Fault-tolerance against transistor aging-induced timing degradations: The notion of adaptability is introduced in ASICs by the proposed Self-Recovery Enabled Logic (SREL) techniques
whose potential benefits are listed in Table 1.1. In the sub-micron era, a major design challenge is
to overcome performance degradations over time induced via transistor aging-related mechanisms.
These cause timing degradation over the device lifetime; hence users must introduce worst-case
guardbands which result in wasted energy/performance. Especially, since this is done at designtime, it requires prediction of aging behavior using models and expected runtime inputs. This can
result in over-estimations. Hence, there is a need to adapt based on the actual timing degradation induced in the field. By introducing selective redundancy in the aging-critical logic domains
at design-time and autonomously switching between these domains at runtime, SREL is able to
adapt according to real-time conditions and hence achieve substantial reduction in energy over de8

vice lifetime which otherwise would have been wasted. The favorable recovery characteristics of
transistor aging are accelerated through power-gating the inactive logic domain. Both proactive
and reactive strategies for control of sleep-transistors used for power-gating are introduced which
enable post-fabrication adaption, through the novel control knob of sleep-interval introduced as a
result of SREL techniques. This work is presented in Chapter 5.3
Resiliency characterization for HPC applications: There is a need to adopt adaptable faulttolerance techniques for the HPC domain, as it pushes towards the goal of exascale computing
with low-power budgets. Technology scaling and market shift towards mobile computing has
caused a major design shift in this domain. Like every domain, there is a definite need to adopt the
latest technologies for sustaining future growth. For example, HPC has already adopted the use of
Graphics Processing Units (GPUs), which were purely designed for the video gaming market. In
the near future, with wide adoption of low-voltage technologies such as NTV and to meet the exascale computing goals, the HPC domain will need to utilize these technologies. This will require
fault-tolerance techniques which traditionally have been the focus of embedded applications to be
adopted in the HPC domain. Thus, there is a need to characterize the behavior of large-scale scientific applications in the presence of increased failure rates, which are deemed due to operation
at NTV. Based on these characterizations, applications’ characteristics can be leveraged to configure the most power-efficient fault-tolerance mechanism that guarantees the correct termination
of HPC applications. The circuit-level effects of soft-errors are investigated in Chapter 6 and the
vulnerability of HPC applications to these effects are investigated in Chapter 7.4

3

c 2015 IEEE. Part of this chapter is reprinted, with permission, from [9].

4

c 2015 ACM. Part of this chapter is reprinted, with permission, from [11].

9

CHAPTER 2: PREVIOUS WORKS FOR FAULT MITIGATION IN VLSI
CIRCUITS

This chapter presents comprehensive techniques which are used to mitigate permanent faults in
FPGAs, with an emphasis on autonomous nature of refurbishment techniques for Evolvable Hardware. Next, circuit-level techniques are presented to mitigate transistor aging effects in nanometer
scaled ICs. Finally, techniques for soft-error masking are presented.

2.1

Techniques for Refurbishment of Reconfigurable Hardware

This section describes the previous research efforts on evolutionary and non-evolutionary refurbishment techniques exploiting the reconfigurability of reprogrammable hardware devices such
as FPGAs. It establishes the motivation to focus on the reconfiguration algorithms employed for
autonomous self-healing systems, which are plausible using these approaches. Thus, comparison of the proposed Netlist-Driven Evolutionary Refurbishment technique with other techniques is
presented below.
Many techniques have been explored to maximize lifetime in the presence of circuit aging with
the aim of increasing period prior to initial failure [155],[110],[157]. For example in [110], different circuit-level parameters such as supply voltage, operating clock frequency and cooling power
are tuned using optimization algorithms to achieve power efficiency over lifetime as compared to
traditional approach of incorporating worst-case delays. In other efforts for FPGAs, the in-field
reconfigurability and uniformity of available resources is exploited. For example, [155] employs
periodic replacement of same configuration and rerouting interconnects with high switching activities (based on design-time estimates) so as to age components of FPGA in a uniform manner.

10

Similarly, wear-leveling strategies for FPGAs such as alternative placements to form multiple configurations at design-time are introduced in [157]. A process variation and NBTI aware placement
strategy for FPGAs is introduced in [29]. While, these failure handling techniques rely on technology specific parameters such as aging models and power consumption models to optimize circuit
parameters or placement of components in reconfigurable devices, the proposed NDER technique
is technology independent. There has also been extensive research on techniques to alleviate radiation induced transient faults in SRAM-based FPGAs which are effective, such as scrubbing Single Event Upsets (SEUs) [124], smaller feature sizes can bring aging-induced failures to become
prominent in multi-year missions. Thus, the reconfigurability of FPGAs is exploited dynamically
for the notion of survivable systems.
Table 2.1: Comparison of NDER with GA-based fault-handling schemes for FPGAs.
Approach/
Previous Work

Size of Circuits
used for Experiment

Pruning the
Search Space of GA

Modular
Redundancy

No
No

Triple
Triple

Assumption of
Fault-free Fitness Evaluation
Yes
Yes

No

Dual

No

4x4 Multiplier
MCNC benchmark circuit
cm42a having 10 LUTs
CRR [51]
3x3 Multiplier;
MCNC benchmark circuits
having < 20 LUTs [52]
RARS [7]
Edge-Detector Circuit
Salvador et al. [136]
Image Filtering Circuit
NDER
MCNC benchmark circuits
(developed herein)
having up to 1252 LUTs
Vigander et al. [171]
Garvie et al. [65]

2.1.1

No
No
Yes

Triple/Dual (Dynamic) Yes
Not Addressed
Not Addressed
Triple/Dual (Dynamic) Yes

Genetic Algorithm-based Refurbishment Techniques

Previous works establish the successful use of Evolutionary Algorithms for adaptive self-recovery
of hardware systems based on reconfigurable logic, especially FPGAs [125], [52], [80], [101] and
[171]. Table 2.1 lists a comparison of proposed work with other GA-based fault-handling schemes

11

for reconfigurable logic devices. In [125], a survey of techniques ranging from passive to dynamic
in classification are presented to tackle hard faults in SRAM-based FPGAs for small circuit case
studies. For example, modular redundancy is exploited in [171] for achieving fault recovery of
a 4-bit x 4-bit multiplier. Experiments are conducted utilizing a TMR arrangement, where three
modules perform the same computation and the final output is produced via majority voting. Faults
are assumed in all three modules indicating a worst-case scenario, and the GA is used as a tool to
partially refurbish all three modules assuming the existence of a fault-free output used for fitness
calculation. Successful recovery of the overall majority output is achieved, based on the hypothesis that different modules fail in different manner. Alternatively, a (1+1) Evolutionary Strategy is
proposed for implementation in hardware to self-recover hard faults in [65]. This work also utilizes
a TMR arrangement and evolves the faulty module to find a fault-free configuration without the
requirement of an external fitness assessment scheme as the output status of the healthy modules is
used to evaluate its fitness based on the assumption of a single module failure. To reduce power and
area overheads of TMR, the Competitive Runtime Reconfiguration (CRR) technique introduced
in [51] utilizes a fitness assessment scheme based on pairwise discrepancy detection between competing configurations without the requirement of additional test vectors. Thus, it replaces absolute
fitness calculation with comparison to consensus behavior of the current population. In [101], a
genetic representation is presented for evolutionary fault recovery in Xilinx Virtex FPGAs and a
quadrature decoder having 16 LUTs is successfully evolved in the presence of a stuck-at fault.
There has also been work to reduce the overall evolution runtime by operating directly on the bitstream used to configure the reconfigurable device, e.g., [123] demonstrate self-recovery of a 4-bit
x 4-bit adder on a Xilinx Virtex II Pro FPGA device in as low as 0.4 µseconds.
A complete implementation of an EHW system on a Xilinx Virtex-5 FPGA chip is demonstrated
in [136]; composed of a 2D array of 16 Processing Elements (PEs) for computation on the reconfigurable logic and the evolutionary algorithm as a tool for self-recovery on the embedded

12

microprocessor with the ability to internally reconfigure the PEs through Internal Configuration
Access Port (ICAP) [3]. Fault tolerance experiments with an image filtering application indicates
recovery time from hard failure, of less than a minute on this platform. The evolutionary algorithm
has the ability to reconfigure the functionality of the PEs with a pre-defined set of functions and the
inputs to the PEs, which makes it application dependent and limited in scope. In contrast, our work
utilizes the evolutionary algorithm to reconfigure the logical functions of LUTs and the inputs to
the LUTs, thus recovery granularity is more fine-grained. Further, the proposed technique has wide
applicability as demonstrated through experiments on circuits from the MCNC benchmark suite.
In summary, the existing tools and platforms demonstrate that self-healing EHW systems are practically viable. Yet, there is a need to propose modifications to the evolutionary algorithm utilized
for self-recovery to become suitable for larger circuits and systems. This is addressed via the
NDER technique using properties of the failure syndrome to considerably prune the search space
while attaining complete quality of recovery.

2.1.2

Exhaustive Testing of Resource Configurations

The flexibility of FPGAs due to their dynamic reconfigurability, has propelled many research efforts to develop fault recovery techniques. Here, techniques having deterministic behavior are
presented, as opposed to those based on evolutionary algorithms which may have indefinite recovery times due to their stochastic nature. For example, techniques based on online Built-in Self Test
(BIST) [57] and those based on utilization of design-time generated alternate configurations with
spare resources [91]. In [57], the entire FPGA resources are tested for correctness while maintaining the application online. The fault detection takes place via the exchange of testing area with the
functional area on the fly. In this manner, the testing area roves the entire chip checking for logic
and interconnect faults. Faults are avoided by utilizing configurations which have design-time allo-

13

cated spare resources. It has been suggested to use this technique after detection, to avoid excessive
power consumption due to unlimited reconfigurations even when there is no failure. Also the fault
detection latency can be high as noted in [125].
Alternate configurations with spares allocated at fine or coarse grain levels and compiled at designtime or runtime are also used to tolerate faults [91], [112]. Functional partitioning of the application
is done in [112] to relocate diagnosed faulty function to spare resources using design-time alternative configurations. Fine-grained fault isolation is utilized in [91] to reconfigure affected blocks
with pre-compiled functionally equivalent blocks that avoid the faulty resources. The design-time
generated configurations have the advantage of meeting timing requirements of the design and
minimizing post-fault detection system downtime. However, there is a significant overhead to
cover all fault possibilities and only a limited number of failures can be tolerated. An alternative
is to recompute the mapping and placement & routing operations on the affected partial FPGA
configuration in the field via CAD algorithms [93], but this is a computationally expensive operation. On the other hand, the reconfiguration algorithm proposed herein is readily implementable
with low area overhead in custom hardware or on an embedded microprocessor [136], [61]. For
example, [61] reports logic utilization of 13% and memory utilization of 1% on a Virtex II-Pro
FPGA device and achieves around 5.16x speedup over analogous software implementation.
An algorithm is proposed in [25] which can be utilized to classify hard faults occurring in SRAMbased FPGAs. The algorithm is implemented in a controller, which successfully classifies all the
reported failure scenarios. The controller can be employed to trigger appropriate fault handling
strategy based on the nature of the failure.

14

2.1.3

Use of Design-Diversity to Overcome Common-Mode-Failures

Common-Mode-Failures (CMF) can arise in spatial redundant systems, for instance, due to transistor aging effects, whereby similar set of resources are effected in all redundant designs in similar
ways, thus causing the system to fail in same manner [155]. Design-diversity provides an effective
solution to mitigate CMF in spatial redundancy systems. Herein, the use of CMF in the context
of reconfigurable devices is investigated in addition to discussion of a metric which can be used to
quantify diversity for both reconfigurable and custom designs.
Some researchers made use of diversity in redundant systems, either implicitly, like in [171, 52],
or studied diversity explicitly as in [115, 116, 26]. In [171], evolutionary algorithm based repair
techniques are used to partially repair the faulty modules and these modules are used in a TMR
system to mask the faults from individual modules as diverse modules fail in a different manner.
Thus evolutionary algorithms can be potentially looked upon as an another design technique to
generate diverse designs that can be used in a TMR setup. This is noted in [80], where evolutionary
methods are used to create fault-tolerant designs of an analog multiplier and an XNOR function
on Field Programmable Transistor Array using different techniques such as population-based, and
fitness-based techniques.
In [145], diverse designs are generated through place & route technique to run Combinatorial
Group Testing methods for fault isolation. Place & Route is another design technique in which
the same module can be considered physically distinct by alternate positioning and routing among
the configurable logic blocks. The authors of [26] explore the concept of design diversity based
redundancy applied to mixed-signal circuit blocks.
All previous works provide an implicit example of the importance and usage of design diversity in
different fields (TMR, GA-based refurbishment, Fault Isolation). Table 2.2 summarizes the work

15

of each from the design diversity perspective. Our work uses the design diversity metric developed
by McCluskey et. al. [115] to measure the improvement of the reliability of the TMR system when
redundant modules are designed using radically different techniques belonging to different classes
of design paradigms.
Table 2.2: Design Diversity related work.

Vigander[171]

Keymeulen [80]

Sharma [145]

McClusky
[115]

Use of Diversity

TMR

Self-Repair

Fault-Isolation

Measuring
reliability of
redundant
systems

Diversity Method Used

GA-based

GA-based

Place&Route

NotSpecified

A mathematical model is presented in [116] to quantify design diversity among designs in a CED
configuration, with emphasis on the importance of data integrity of the system, which is defined
as its ability to either produce correct outputs or generate an error signal when incorrect outputs
are produced. Let di,j be the diversity with respect to a fault pair (fi , fj ) where fi and fj are faults
present in the diverse modules M1 and M2 respectively. Thus di,j denotes the probability that
the designs do not produce identical error patterns, in response to a given input sequence. This
prompts the notion of joint detectability, ki,j which is defined as the number of input patterns that
produce the same erroneous output pattern in both implementations of M1 and M2. Assuming that
all input patterns are equally likely, then di,j can be specified as:

di,j = 1 −

16

ki,j
2n

(2.1)

where n is the number of inputs. Assuming all fault pairs are equally probable and there are m
fault pairs (fi , fj ), the design diversity metric, D for the design pair is:

D=

1 X
di,j
m i,j

(2.2)

The CMF is represented by i = j whereas the random fault case can be expressed by i 6= j.
For example, for two designs M1 and M2, with the responses given in Table 2.3, due to the injection
of fault pair (fi , fj ), the value of ki,j = 2, and hence di,j = 1 −

2
4

= 0.5. So based on single fault

pair, D = 0.5 for this pair of designs. This metric will be extensively utilized in Chapter 3 to assess
the uniqueness of different designs generated through the proposed techniques.
Table 2.3: Responses to an injected fault pair for two diverse designs M1 and M2.

Inputs

Fault-free Outputs

M1 Outputs

M2 Outputs

00

01

00

10

01

10

10

10

10

00

10

10

11

11

10

10

2.2

Techniques for Circuit-level Anti-Aging

Moving onwards from FPGAs to custom-designed circuits. In this section, a comprehensive survey
of multiple circuit-level approaches to mitigate hard-faults due to transistor aging phenomenon is
presented. Traditional approaches for handling performance degradation range from simple onetime static timing margin and voltage guardbanding, up through more complex dynamic adaptation
of supply voltage and/or frequency during operation. This section emphasizes the comparison with
17

the proposed Self-Recovery Enabled Logic technique beginning with a discussion of transistor
aging models utilized in this dissertation.

2.2.1

Transistor Aging Models and Mitigation through Power-Gating

In this section, the details of analytical transistor aging models utilized in this dissertation are
discussed. Aspects which relate to the ability of power-gating to achieve aging mitigation are also
identified herein.
BTI Aging Models: For BTI, the contribution of interface traps and traps inside the dielectric
layer are addressed. Both a Charge Trapping/Detrapping (TD) model and a Reaction-Diffusion
(RD) model have been utilized in the literature to account for these effects [169]. For instance,
compact modeling of aging under DVS for both TD and RD mechanisms is presented in [160].
Vth degradation: Eq. 2.3 expresses the stress-time induced contribution of Interface Traps (IT) to
the Vth shift [167]:

∆Vth,IT (t) ∼ exp(−

Ea ε
)[ (Vgs − Vth )]A exp[B · E(Vgs , Vds )]tC
KT tox

(2.3)

along with Oxide Traps (OT) inside the dielectric [167]:

∆Vth,OT (t) ∼ exp(−

where:

• A denotes the inversion charge exponent,

18

D + FT
) · tG
E(Vgs , Vds )

(2.4)

• B denotes the oxide electric field dependence,
• C denotes the stress time exponent for IT,
• E(Vgs , Vds ) denotes the electric field strength,
• F denotes the temperature dependent component, and
• G denotes the stress time exponent for OT.

Thus, Eqs. 2.3 and 2.4 govern aging behavior during intervals of stress. Meanwhile, the partial
recovery effect is modeled by taking into account the stress-stimulus duty cycle. When recovery is
taken into account, the net impact on the Vth shift becomes [167]:

∆Vth,AC (t) = H · ∆Vth (t) · exp(−J · K)

(2.5)

where:

• H denotes the transient degradation parameter,
• J denotes the duty cycle dependent exponent for transient degradation, and
• K models the effect of the duty cycle.

These relationships are utilized by the commercially-available MOS Reliability Analysis (MOSRA)
tool for Synopsys HSPICE which is used to assess the proposed SREL techniques. Next, the significant parameters which influence overall delay degradation of the circuit are identified and the
use of power-gating to reduce these effects is analyzed.

19

Logic-Gate Delay Degradation: The delay at the gate-level under BTI due to ∆Vth can be expressed as:
Di (t) =

Ci VDD
βi [VDD − (Vth + ∆Vth (t))]α

(2.6)

where, βi is a constant which is dependent on the area of the gate, Ci is the effective load capacitance of the gate and α is the saturation velocity index. Clearly, an increase in transistor’s
∆Vth from the initial value Vth results in an increase in the delay at logic gate-level. Thus, the
effectiveness of power-gating to reduce ∆Vth due to aging is evaluated.
Effectiveness of Power-Gating in Aging-Mitigation: Overall, ∆Vth in Eqs. 2.3 and 2.4 is strongly
dependent on stress time t, Temperature T , and supply voltage VDD due to the term (Vgs − Vth ) as
identified below.
Stress time reduction: The stress time can be expressed as p × t where p is the probability that
the transistor is under stress. Thus, the degradation of Vth can be seen to be dependent on stress
time from Eqs. 2.3 and 2.4 according to the power law as follows:

∆Vth (t) ∝ (pt)n

(2.7)

where: n is the stress time exponent (n = C for IT in Eq. 2.3 and n = G for OT in Eq. 2.4)
Under ideal conditions, power-gating can reduce degradation of Vth by lowering p, e.g., if the
circuit is power-gated for half of the circuit’s lifetime, then the corresponding reduction is evident
through the results shown in Figure 2.1, where n = 0.25 is chosen and lifetime is set to 3 years.
Thus, SREL employs power-gating to achieve stabilization of degradation/recovery cycles during
device operation based on controlling the switching interval as described later.

20

100
90
80
70

∆ Vth (a.u.)

60
50
tlifetime = 3 years

40
30
20
10

Non-Power-Gated Circuit
Power-Gated Circuit

0
0

0.1

0.2

0.3

0.4

0.5
Duty Cycle

0.6

0.7

0.8

0.9

1

Figure 2.1: Power-gating helps to reduce the degradation in Vth due to BTI.

Effect of Standby Temperature: Temperature has an exponential relationship with ∆Vth . In this
regard, the ratio of active and standby times and the standby mode temperature greatly impacts
the degradation due to NBTI [177]. Specifically, the authors show that if same temperature is
assumed for both standby and active modes then ∆Vth is highest. However, by assuming different
temperatures for standby mode, the ∆Vth is lowered. Experiments on a NAND2 gate using MOSRA
models exhibit similar benefits in the recovery mode as shown by ∆Vth after one year of operation
in Table 2.4. These results indicate that more recovery can be obtained with power-gating if this
temperature variation is modeled in the simulations. In this dissertation, however, the foremost
concentration is on the development of adaptation-enabled schemes based on area-energy tradeoffs.
Next, multiple techniques related to SREL for aging-mitigation are presented.

21

Table 2.4: Effect of temperature variation on Vth degradation due to standby mode.
Active Temperature

Standby Temperature

105 ◦ C
105 ◦ C
105 ◦ C
105 ◦ C

105 ◦ C
100 ◦ C
95 ◦ C
85 ◦ C

2.2.2

∆Vth (Normalized
w.r.t Uniform Temperature)
1
0.95
0.91
0.84

Worst-case Design Techniques for Aging-Compensation

Various worst-case approaches to compensate for aging-degradations at design-time are discussed
below. A main drawback of these approaches as highlighted later is the non-adaptable nature of
these schemes.
Voltage-Margin (VM): Guardbanding is a worst-case design technique to ensure reliable operation throughout the circuit lifetime. For instance, timing guardbanding (Frequency-Margin (FM))
selects the clock period to be more than propagation delay of the aged circuit. Such over-design can
also take the form of elevated supply voltage operation (Voltage-Margin (VM)) as high as 14.5%
over the nominal voltage of an unaged device [186]. This translates to about 30% increase in lifetime energy consumption to compensate for NBTI effects alone. The significance of compensating
these overheads is increasing with scaling of technology nodes. SREL is shown to reduce such
overheads by a significant amount.
Gate-Sizing: Other worst-case design techniques which utilize additional area to compensate for
aging effects include gate-sizing techniques [183][39]. As indicated by Eq. 2.6, the delay at gatelevel can be decreased by increasing the gate size ∆βi from its minimal allowable size βi . Thus,
gate-sizing techniques involve finding optimal sizes for all gates within an allowable discrete or
continuous range in the circuit synthesis stage such that all the logical paths meet the desired
timing specifications throughout the lifetime. The complexity of discrete gate-sizing is known to
22

be NP-complete, thus mostly heuristics are utilized [161]. Furthermore, this aging-aware synthesis
is performed based on assumptions such as knowledge of the stress probabilities at all nodes in the
circuit and provision of a standard cell library having logical gates with multiple widths for each
stress level. For example, an area overhead of 19.8% is reported for gate-sizing selected portions
in a physical register file of a SPARC processor [88]. The area overheads associated with sizing
at the gate level can be reduced by adopting more fine-grain sizing and considering the impact of
adjacent gates [82], however, it increases the complexity of the sizing problem.
In addition to the area overhead associated with gate-sizing techniques, the increased gate-widths
contribute to increasing the effective gate capacitance thereby increasing the dynamic power consumption of the circuit. Furthermore, the gate leakage and subthreshold leakage currents are also
dependent linearly on the width of the gates. Hence, gate-sizing schemes directly contributes to
both dynamic and leakage powers. Thus, the gate-sizing problem becomes more complicated as
these factors also need to be considered during the optimization process. The over-sized gates also
undergo continuous stress in the form of elevated temperature and high signal activities depending
on the input conditions while no opportunity for forced recovery is pursued. On the other hand,
the proposed SREL technique, provides an opportunity for aging-critical portions to recover with
minimal impact on leakage power of the circuit. In addition, the proposed critical path extraction
technique is straightforward to implement using the conventional EDA design flow as no change
in standard cell library nor multi-objective optimization problem need to be solved.
Aging-aware Re-synthesis: An additional design-time technique of aging-aware logic synthesis
is proposed in [56] and [121] where multiple timing constraints are applied on different logic paths
based on the available timing slacks and aging rates. For example, [121] proposes to synthesize
microprocessor pipeline stages which are balanced in terms of MTTF instead of the traditional approach of delay-balanced pipelines. Here, tighter timing constraints are obtained through a combination of gate-sizing, logic path re-organization or time borrowing or by using low-Vth gates. The
23

iterative optimization process through the use of commercial synthesis tools is attractive, although
the authors report in [56] that it may not converge and the area overhead may be excessive in some
cases. Furthermore, such High-Level Synthesis (HLS) techniques [42] assume the availability of
a standard cell library characterized with aging delays which can guide the synthesis process to
utilize optimal-sized/-Vth logic gates such that desired lifetime is achieved. However, the need for
such assumptions is alleviated using SREL techniques.

2.2.3

Dynamic Operating Conditions for Aging-Mitigation

A significant drawback of design-time techniques is that they require accurate estimation of aging
using anticipatory models such as RD or TD as described earlier, whose accuracy is still under investigation by the research community [41]. The models have only been verified at the individual
logic-gate level or on ring oscillator circuits. The authors are not aware of any study which evaluates the accuracy of these aging models on benchmark or real-use-case circuits. Most design-time
techniques rely on these models to lower their associated overheads, such as the area overheads
of gate-sizing techniques. A summary of anti-aging circuit-level techniques is presented in Table 2.5, where MD-RoD stands for Model-Dependent Rate of Vth Degradation and VR stands for
Voltage-Regulators. The related work is categorized into three main categories: worst-case design
(Section 2.2.2), dynamic operating conditions (Section 2.2.3), and adaptive resource management
(Section 2.2.4). By using MD-RoD, over-estimation is done in most schemes to accommodate the
circuit lifetime. Herein, the devised SREL techniques give the explicit provision of control over
the delay degradation via the novel control knob of sleep interval and hence eliminates the need of
accurate aging estimation.

24

Table 2.5: Comparison of proposed SREL schemes with related works.

Technique

Anti-Aging
Strategy

Design
Requirements/
Parameters

VM, FM

Static Margin

Gate-Sizing

Static Margin

Re-Synthesis

Static Margin

DVFS

Dynamic Margin

SVS

Dynamic Margin

GNOMO

Static Margin +
Power-Gating

MD-RoD/
FM: High
∆VDD ,
∆Fnominal
MD-RoD; ExNone
None
tended Std. Lib.;
Multi-obj. Opt./
∆βi , ∀ gates i
MD-RoD annoNone
None
tated Std. Lib.;
Aging-aware
Synthesis/ ∆βi ,
∆Vth,i ∀ gates i
Dynamic Operating Conditions
Timing Sensors;
Yes/ Fully AuLow
Feedback Contonomous
trol/ ∆VDD (t),
∆F (t),
∆Vbb (t)
MD-RoD/
Yes/ tstep
None
∆VDD (t
+
∆tstep )
MD-RoD/
None
Medium (Work(VDD,g , tidle )
load Dependent)

SD

Proactive Mngt.
+ Power-Gating

ITL schemes

Proactive Mngt.
+ Power-Gating

ACP

Proactive FineGrain Mngt. +
Power-Gating
Reactive FineGrain Mngt. +
Power-Gating

CCP

Adaptability
Characteristics/
Degree
Worst-case Design
None

Throughput

Adaptive Resource Management
Modular RedunYes / Sleep InterNone
dancy/ Sleep Inval
terval
Exploit
App.
Yes
/
Task
Medium (WorkRedundancy/
Scheduling
load Dependent)
Idle time
CPRT/ Sleep InYes/ Sleep InterNone
terval
val
Timing
Sensors; Feedback
Control; CPRT/
ADS%

Yes/ Fully Autonomous

None

Overheads
Power

Area

VM: High (Dynamic & Leakage)
Medium
(Dynamic & Leakage)

None

Low-Medium
(Dynamic
&
Leakage)

Low (Gate-level)

Medium
(Dynamic & Leakage)

Medium
(Onchip VR &
sensors)

Medium
(Dynamic & Leakage)
Medium
(Dynamic & Leakage)

Medium
chip VR)

High (Leakage)

High (Modulelevel)

None

None

Minimal (Leakage)

Low (Gate-level)

Minimal (Leakage)

Low (Gate-level
& sensors)

Low (Gate-level)

(On-

None

Dynamic Voltage and/or Frequency Scaling: To overcome the constant overheads associated
with voltage guardbanding, the voltage can also be increased gradually from its nominal value at
run-time to compensate for delay degradation due to aging (see Eq. 2.6). In this case, supply voltage can be dynamically adapted, VDD (t), to prevent any aging-related failure based on feedback
mechanisms whereby aging is monitored using canary circuits or tunable replicas [175]. These
canary circuits are assumed to assimilate wearout of the original circuit. On the other hand, timing
sensors can also be placed into the circuit for this purpose [23]. Based on the feedback provided,
multiple control policies are devised in [110],[81] to jointly tune parameters such as voltage, operating frequency and dynamic cooling to maximize lifetime energy-efficiency.
Alternatively, if the provision of a feedback mechanism is not available, then Scheduled Voltage
Scaling can be performed whereby the voltage increments and time steps ∆tstep are determined at
design-time [186]. However, DVS is shown to achieve a lifetime (10 years) energy benefit of only
7% with respect to simple guardbanding [38]. Additionally, such schemes assume the availability
of a mechanism which can achieve voltage steps on the order of 5-10mV, which require high areaoverhead and power inefficient on-chip voltage regulators. In SREL approaches, similar effects
are sought through the use of spatial multiplexing at the circuit-level without the complexities of
dynamic operating conditions at power-network level.
Yet another technique to dynamically mitigate Vth degradation is Adaptive Body Biasing [128],
where the body-biasing voltage Vbb (t) is adapted based on feedback at the transistor level. However, it can be less effective with increased levels of degradation and needs to be combined with
other techniques such as SVS and aging-aware synthesis [89].
Computational Sprinting: An effective aging-mitigation technique Greater-than-NOMinal Operation (GNOMO) is proposed in [69] which eliminates the need for complex feedback-based
control policies and/or on-chip voltage regulators. As proposed herein, aging recovery is simi-

26

larly promoted through the use of power-gating. However, to lower the throughput degradation
due to power-gating, the circuit is operated at elevated supply voltage, VDDg > VDD , to increase
throughput during bursts followed by idle periods tidle generated to ensure recovery via powergating. These circadian rhythms were shown to reduce delay degradation by about 1.3-fold to
1.8-fold. Herein, similar or greater improvements in delay degradation are desired, while reducing energy consumption using spatial multiplexing. Doing so will also avoid the complications of
peak-temperature (TDP) constraints during bursts and the corresponding design-time prediction of
ideal greater-than-nominal voltage.
Even though static or dynamic guardbanding in the form of voltage/frequency margining are effective techniques to combat aging, which do not require any changes at the circuit-level [38].
However, the energy overheads due to static margining are generally high, and the implementation complexities of dynamic margining schemes are also high [41]. Thus, the focus of SREL
technique is on development of low-complexity and low-overhead anti-aging strategy based on
adaptive resource management at the circuit-level.

2.2.4

Adaptive Resource Management for Aging-Resilience

In this sub-section, a class of anti-aging schemes are highlighted whereby the rate of degradation
is balanced over all the resources in the circuit either through management of idle time and task
scheduling or availability of expendable resources. Aging-mitigation in these cases include powergating to enforce recovery, or applying specific input vectors which promote recovery or using
differential voltage scaling.
Idle-Time Leveraging (ITL) Schemes: As mentioned earlier, power-gating has been effectively
shown to mitigate transistor aging effects. Schemes such as clustered power-gating [34] have proposed effective ways to reduce the overheads of power-gating, whereby the critical and non-critical
27

portions of the circuit are power-gated using different-sized sleep transistors. However, such works
assume the availability or prediction of circuit downtime. For instance, tradeoff analysis between
lifetime extension and leakage reduction for a 4x4 Network-on-Chip switch [34] show that with
the sleep probability ranging between 0.4 and 0.9, it is possible to increase the lifetime between
2.5X and 6.67X.
In other applications like modern microprocessor cores, the inherent redundancy in certain structures like the pipelined Execution Stage which has multiple arithmetic units can be exploited [120].
Aging mitigation in this scenario can be achieved by carefully scheduling tasks onto the available resources while simultaneously recovering aging effects through power-gating. However, this
comes at a cost of throughput degradation as the power-gated units are unavailable to realize maximum Instruction-Level Parallelism (ILP). Thus, the amount and distribution of idle time is highly
dependent on application characteristics and may require changes at the instruction-sequencing
level to realize full benefits as reported by authors. Herein, the developed SREL scheme is more
adaptable at the circuit-level to achieve similar effects, yet with reduced or nonexistent degradation
of throughput.
Similar resource scheduling at a much coarse-grain level has been demonstrated for multicore
processors [159] or General-Purpose Graphics Processing Units (GPGPUs) [40], whereby tasks
are assigned based on stress information at core-level. At runtime, workload assignment is done
such that cores are relaxed periodically. The NBTI-aware task mapping for multicore processors
is shown to improve MTTF by 30% under considered workloads [159]. Here again, availability of
both idle time and idle cores is assumed, which is dependent on the application characteristics or
workload.
Some other techniques which assume the availability of idle (standby) time involve applying
design-time generated input vectors which promote recovery [63]. In summary, the aforementioned

28

schemes typically require control strategies at hardware/software level to dynamically detect idle
times during dynamic operation and perform the required aging-mitigation.
Controlled Resource Wearout to Improve Performance: The BubbleWrap scheme [77] for
many-core systems proposes control strategies with the goal of optimizing either power or performance while managing the amount of aging at the core-level. It utilizes a combination of DVS
and task scheduling schemes to overcome the throughput degradation. A set of throughput cores,
which are utilized for parallel sections of application and expendable cores, which are utilized for
sequential sections of application are assumed. These two subset of resources are created based on
the rate of degradation of individual cores. The expendable cores, which have shorter life are expended earlier using elevated voltages to obtain higher application performance. Alternatively, the
set of throughput cores can be expended for the same power budget, where aging for expendable
cores is managed by using voltages below VDD . The BubbleWrap scheme has similar concerns as
found in other DVS schemes and assumes availability of extra resources which can be worn-out.
The SREL approaches looks to employ all of the resources until the end-of-device-lifetime while
incurring a minimal impact on overall power profile of the device.

2.3

Soft Error Masking in Logic Paths

This section establishes the need for mitigating soft errors in logic paths, and evaluates them for
the NTV operating region. Then, a generic discussion follows to quantify the level of protection
established through various error masking schemes.
Soft Errors in Logic Paths: The contribution of soft errors in logic paths as opposed to memory
elements is becoming significant as the supply voltage is scaled down to the threshold region. It
was predicted in [149] that the SER of logic circuits per die will become comparable to the SER

29

for unprotected memory elements, which was later verified through experimental data for a recent
microprocessor [54]. Operation at NTV is expected to exaggerate these trends. For instance, [54]
states that SER increases by approximately 30% per each 0.1V decade as VDD is decreased from
1.25V to 0.5V. In the NTV region, it is shown through both simulation and experiment at the
40nm and 28nm nodes, that SER doubles when VDD is decreased from 0.7V to 0.5V. Primarily, the
critical charge needed to cause a failure decreases as VDD is scaled and SER has an exponential
dependence on critical charge [54]. Such trends are consistent with decreasing feature sizes due to
technology scaling [151].
In logic paths, there are three inherent masking mechanisms that prevent the propagation of a spurious transient pulse along a path towards the input of a flip-flop/latch, where it may be registered
to cause an error [149]:

1. Logical Masking: occurs when a transient pulse does not affect the computation in other
gates along the path towards the output for a given input vector,
2. Electrical Masking: due to the attenuation of the glitch as it passes through subsequent logic
gates, and
3. Latching-window Masking: occurs when the generated glitch does not occur within the setup
and hold time window of the flip-flop.

It is evident that logical masking effect is not impacted by operation at lower voltages. However,
as operating frequencies at NTV are expected to be low, it has been suggested that pipeline stages
consist of fewer gates to regain lost throughput. This will consequently lower the benefit of both
logical and electrical masking. In addition, the electrical attenuation is lowered at low supply voltages as large pulse-width transients are created. However, there is a positive effect on masking due
to latching-window masking as operating frequencies are lowered. The latching-window masking
30

is also dependent on the design of flip-flop utilized [54], where some designs show more SER immunity as compared to others. Overall, reduced pipeline depths, technology scaling, and voltage
reduction can be anticipated to have detrimental impact on logic SER. Thus, there is a need to
develop effective soft error mitigation techniques for reliable NTV operation.
Soft Error Masking: The SER in logic paths can be reduced by schemes such as gate-sizing [188]
or dual-domain supply voltage assignments [180] to harden components which are more susceptible to soft-errors. These techniques tradeoff increased area and/or power to reduce SER of the
logic circuit, but may not be able to provide comprehensive coverage. For instance, SER reduction
of only 33.45% is demonstrated in [180] using multiple voltage assignments. One option identified
for masking soft errors is spatial redundancy, and in particular the readily-accepted use of TMR, as
being effective for mitigating soft errors. It is considered to be appropriate for applications which
demand immunity to soft errors and also are able to accommodate its inherent overheads.
Spatial redundancy is often employed in mission-critical applications to ensure system operation
even in unforseen circumstances, such as autonomous vehicles, satellites, and deep space systems [37],[126],[7]. It is also been employed in commercial systems such as HPC applications [58]
where significant increase in compute node availability is sought. Here, use of compute-node level
redundancy at the processor, memory module, and network interface can improve reliability by a
factor of 100-fold to 100,000-fold.
Identical yet invalid outputs in a spatially redundancy system with a multiple-bit word output
require the transient(s) to impact distinct instances at the corresponding functional locations to
manifest identically incorrect outputs. In the case of an isolated Single Event Transient (SET) in
a single instance during a computation interval, the resultant soft error is masked. However, if
more than one bit is upset then a Multi-Bit Upset (MBU) results. Spatial MBUs occur when a
single particle upsets multiple bits which reside within the same physical neighborhood. Temporal

31

MBUs occur when two or more particle strikes independently upset distinct instances in a spatial
redundancy system. MBUs may still generate a diagnosable error from a word-wise voted output.
In such scenarios, word-wise voting can be advantageous compared to bit-by-bit voting [114].
Finally, even though spatial MBU feasibility has increased due to technology scaling, non-planar
devices offer a means to reduce SER. For example, 22nm Tri-Gate technology is shown to reduce
neutron and alpha-particle induced SER at nominal voltage on the order of 1.5-fold to 4-fold and
in excess of 10-fold respectively, compared to a 32nm planar process [139].
Under nominal operating conditions, the energy consumption of a spatially redundant system with
N -modules (N -Modular Redundancy or N MR) is about N -fold as compared to a simplex (N = 1)
system, which lack soft error masking capability. This work explores the tradeoffs of operating
N MR systems at NTV beyond processor caches where a low-complexity means for improved
resilience compared to error correcting schemes has been sought [143],[46]. For cache memories,
Orthogonal Latin Square Codes (OLSCs) have been employed to encode orthogonal groups of
checking bits without syndrome generation, yet enable recovery with majority voting, and further
extensions to Variable-Strength Error Correcting Codes (VS-ECCs) have been employed which
combine the use of ECC and memory tests to ensure reliable cache operation under aggressive
voltage scaling [179]. In this dissertation, the focus is on spatial redundancy for soft-error masking
in logic paths as opposed to memory elements.

2.4

Summary

Fault-handling techniques for FPGAs involving the use of genetic algorithms are shown to generate
configurations that are able to by-pass faulty resources at runtime. These techniques are oblivious
to the underlying failure mechanisms and do not require exhaustive resource testing. The proposed
NDER technique is based on these principles with major differences as highlighted in Table 2.1.
32

Use of design-diversity is also seen as an effective technique to mitigate CMFs. Multiple techniques can be utilized to generate physically unique but functionally identical implementations
as discussed in Table 2.2. The diversity metric devised in [116] as used later in this dissertation is shown to successfully quantify uniqueness among diverse designs. Then, factors effecting
transistor-aging are discussed through predictive models. Multiple related works are presented to
handle these aging-degradations in ASICs as summarized in Table 2.5. Post-fabrication adaptability can be enabled in ASICs through adaptive resource management schemes which in turn can
reduce the worst-case overheads of design-time approaches. Finally, reduced voltage operation
and technology scaling has caused an increased need to mitigate soft-errors at the circuit-level and
study their effects on large-scale systems.

33

CHAPTER 3: DESIGN DIVERSITY APPROACH TO FAILURE
MITIGATION

This chapter investigates the ability to provide improved Reliability of TMR systems at comparable
area and time cost using design diversity. Namely, multiple implementations of the same functional
design using a repository of methods: Templates, Case-Based, Inverted-Output, and NAND/NORBased are evaluated. The design methods are tested on multiple benchmark circuits in different
TMR setups for each of which design diversity and fault tolerance are examined. The results
show that extensive design diversity can be achieved at design-time using one or a combination
of these methods, and verifies the increased fault-tolerance of TMR-based systems with diverse
designs in multiple failure modes at run-time. Moreover, results indicate that improved system
fault-tolerance can be achieved using designs from different classes of design techniques, rather
than using variations of the same design method without incurring a run-time expense.

3.1

Design-For-Diversity for Improved Fault-Tolerance of TMR Systems on FPGAs

FPGAs are frequently employed in harsh environments such as within space or nuclear applications, where these devices may be affected by radiation effects and/or the aging process [155].
Many methods have been proposed to build fault-tolerant systems that can sustain single/multiple
faults, and redundancy is one of the simplest and widely used methods [67]. Redundancy can
take many forms, like cold/hot spares, or N-Modular-Redundancy (N MR). In the latter case, an
N number of functionally-identical modules are operated on the same set of inputs simultaneously
and a majority vote is used between the multiple outputs to produce the final output. For example,
N = 2 represents a concurrent redundant system that can detect faults instantly, but fails to isolate
the faulty module. Whereas, in a TMR system as shown in Figure 3.1, three functionally-identical
34

modules can detect fault in a single module and mask it instantly through bitwise, or word-wise
voting.

Figure 3.1: TMR system.

The modules used in TMR systems must be functionally-identical, i.e., they should have matching
input/output responses. However, this does not impose exact physical implementation. This raises
the concept of design diversity in redundant systems, in which the same functionality can be implemented using physically different, yet functionally identical designs. Granted, the meaning of
“physically different” differs when referring to FPGAs than when referring to Application Specific
Integrated Circuits (ASICs). In FPGAs, two modules are said to be physically different if most
LUTs in the same relative location on both modules do not implement the same logical function.
TMR systems based on a single-design have less immunity towards Common-Mode Failures that
affect more than one module at the same time in the same manner, generally due to a common
cause [16]. This may be due to a design oversight, power disturbance, or especially in sub-90nm
technology, due to the aging phenomenon where components might be affected uniformly in a region, causing multiple failures in the same manner [155]. For example, HCI, TDDB and EM can
cause permanent faults. This can manifest Common-Mode Failure if identical datapaths are used
in the TMR arrangement. CMFs are quite common in redundant systems using the same designs
as shown in [115].

35

Design diversity provides a solution to CMF. Techniques need to be investigated to generate diverse designs, which can be used in redundant systems as well as in other applications. For example, using a diverse population of individuals could achieve better performance in evolutionaryalgorithm-based repair, as the population will offer many solutions, rather than creating new ones
gradually through genetic operators [80]. Moreover, TMR systems implemented on FPGA can
benefit from diversity by loading different designs generated online without the requirement of
evolutionary algorithms, and hence might offer autonomous repair by “jiggling” the modules with
diverse but functionally-equivalent design configurations. The diversity techniques introduced and
the concept of Diversity-TMR for improved reliability is also applicable to ASICs, but the ability
for autonomous reconfiguration of ASICs will likely be limited and will certainly be application
dependent.
This chapter studies the synthesis of distinct designs using the Template-based (TB) method so that
they can be used in redundancy-based fault-tolerant applications. Further, the Case-based (CB),
Inverted-output (IO) and NAND/NOR-based design techniques are studied. The chapter explicitly
highlights how diversity can benefit a TMR system in different stuck-at faults in different failure
modes (CMF and Random faults). It studies the possibility of generating a better TMR system
using diverse designs generated from different classes of design techniques.

3.2

Design-For-Diversity Techniques

3.2.1

Template-Based Method

To understand the Template-Based method as proposed in [52], consider the system in Figure
3.2. The system function is implemented through blocks (templates) whose inputs/outputs are
connected accordingly to perform the required functionality. Therefore, Figure 3.2 represents a

36

general block diagram of the operation itself, implemented internally through blocks A, B, and C.
The number of different designs obtainable for the whole system by replacing each block with a
possible design option (DO) is a function of the number of DOs available for each block. Therefore for any system represented by a block diagram of functional blocks (templates), the number
of possible design options to achieve the same function is calculated by the following equation:

DOsys =

C
Y

DOi

(3.1)

i=A

Where DOi represents the design options available for each block i. For example, in the above
system, if DOA = 2, DOB = 3 and DOC =1, then there are 6 different design options for the full
system as: DOsys = DOA × DOB × DOC = 6.

Figure 3.2: System block diagram for Template-Based approach.

In short, the template-based technique can be applied to any system that is described by internal
blocks each implementing a sub-functionality by replacing any block with a different design template. Therefore, by generating multiple designs for each block during design-time and storing the
associated partial bit-files, the system can automatically use combinations of them online to create
multiple diverse-designs at runtime.

37

3.2.2

NAND/NOR-Based Method

Combinational circuits can be implemented in multiple ways. NAND and NOR functions, which
are known as the universal gates can implement any specified digital circuit. So it is possible to
convert any given digital circuit into a NAND or NOR-only implementation. This makes way for
design diversity as, given any digital circuit, it is true that there is always an alternative implementation which is in terms of NAND gates only (given the original implementation is not already
in terms of NAND gates). This work explores the applicability of this classical design technique
for implementation in FPGA devices through the manipulation of the User Constraint File (UCF)
with the goal of achieving design diversity. A circuit implemented through this technique tends to
have a higher component count which might achieve better reliability at the expense of increased
resource usage. Results of this approach for an FPGA implementation are presented in the results
section.

3.2.3

Case-Based and Inverted-Output Methods

Case-Based Synthesis is an informal and very simple design technique that involves describing
the function to be implemented in the form of a truth table. This truth table is then translated
into a HDL case statement which is fed to the synthesizer for logic extraction. This process is
straight-forward for small combinational circuits. However, the length of the case statement grows
exponentially with the number of input bits. This can be overcome by automating the case statement generation process given a functional description of the circuit, or by dividing the complex
system into smaller, more manageable sub-circuits.
Once the case statement described above is generated, an Inverted-Output description of the system
is easily produced. This is done by inverting the outputs associated with each input case.

38

The synthesis of logic functions in their true and complemented forms during duplication was
first proposed in [113], and, depending on the synthesizer used and the optimization parameters
set, synthesis of the Inverted-Output description of a particular function will result in a different
implementation from that obtained from the synthesis of a true case-based description.
A challenge arises when dealing with sequential circuits since the output does not depend solely
on the inputs. In [166], it is shown that given the specification of a sequential logic circuit, and an
encoding of its internal states, the problem of synthesizing the sequential circuit can be mapped to
a combinational logic synthesis problem.

3.3

3.3.1

Experimental Setup

Simulation Objectives, Tools and Workflow

A total of 6 experiments were performed on many TMR systems with different stuck-at failuremodes. The objectives of the experiments are to:

1. Find the diversity values among designs generated using the proposed design techniques.
2. Evaluate the sensitivity of different designs to the type of stuck-at fault (Zero or One) injected.
3. Compare the performance of many TMR systems in two fault modes: Common-Mode Fault
(CMF) and Random Single Fault (RSF). The purpose is to find the best TMR system to
provide the highest reliability in the presence of CMF and RSF faults.
4. Study the effect of using different design techniques vs. uniform design techniques in a TMR
system in order to achieve an improved overall fault-tolerance.
39

All experiments were carried out using Xilinx ISE Design Suite 12.2 equipped with the ISim FPGA
simulator. The target configuration was that of a Xilinx Virtex-4 FPGA device. The experiments
involved multiple implementations of two benchmark circuits, i.e., a three-bit Multiplier (combinational circuit) and a 8-state Finite State Machine as defined in dk17 benchmark circuit (sequential
circuit) [182] as a case study. All implementations were generated using the XST synthesizer provided with the Design Suite, and Stuck-At faults at the inputs of the LUTs were injected directly
into the post-Place&Route model file before simulation. An implementation, referred to as the
base design (BASE), was generated using behavioral Verilog HDL code for both the designs. The
synthesizer had complete control over the implementation process in this case.
In the following, Reliability is defined as the probability that the output obtained from a certain design (or the TMR system) is not erroneous. For example, if the TMR system provides A erroneous
outputs out of 2n outputs (where n is the number of input bits) in the presence of a fault sequence
(fi , fj , fk ), then the reliability R relative to the fault (fi , fj , fk ) is defined as:

R(fi ,fj ,fk ) = 1 −

A
2n

(3.2)

If the number of fault sequences recorded is B, then the reliability can be expressed as:

R=1−

A
B × 2n

40

(3.3)

3.3.2

Experimental Configurations

Experiment 1: The Diversity Values among various designs are calculated in these experiments.
Two instances of each design method are used to calculate the intra/inter-design diversity values to
get an insight of the diversity of the designs generated by the same class of techniques.
Experiment 2: TMR with Diverse Designs (BASE, Da, Db) and a single CMF per module is used
in these experiments. Da and Db are generated using the same design technique, though they are
physically distinct. CMF study is conducted in line with McCluskey’s et. al. [115] work and it
assumes that the LUTs in the two designs are affected by a Stuck-At fault of the same type and at
the same input pin location.
Experiment 3: TMR with Diverse Designs (BASE, Da, Db) and a single RSF per module is used
for these experiments. This experiment utilizes the same setup as Experiment 2, but injects one
random stuck-at fault at a random location in each module. In conjunction with experiment 2, this
will indicate the benefit (if any) of using design diversity over replicated design in a TMR system.
Experiment 4: TMR with replicated design and a single random fault per module is used for
these experiments. In this experiment, the TMR system is composed of three identical datapath
modules. Random stuck-at faults are injected at random location in each module. The hypothesis
of this experiment is that the modules will produce different outputs since the fault locations are
not similar, and therefore the system might not behave significantly worse than the implementation
of diverse modules.
Experiment 5: TMR with Diverse Designs (Inverted-Output, Template-Based, and NAND-Based)
and a single CMF per module is used for these experiments. Single CMF were injected in this
case per module. The diverse designs are expected to fail in different ways, thus most of the time
correct output should be produced.
41

Experiment 6: TMR with Diverse Designs (Inverted-Output, Template-Based, and NAND-Based)
and a single RSF per module is used for these experiments. This experiment uses the same setup
as in Experiment 5, but faults are injected randomly at random locations in each module. The
performance in this case is expected to be better than CMF case. Combined with the results of
experiments 3 and 4, the presence of any advantage for using diverse designs for TMR should be
confirmed (or refuted) in the presence of random faults.
Table 3.1: Diversity value obtained through comparison with the BASE design.

D1

D2

CB

IO

NAND

NOR

CMF

1

0.996

1

0.987

0.965

0.958

RSF

0.971

1

0.984

1

0.992

1

Average Diversity Value

0.986

0.998

0.992

0.994

0.979

0.979

3.4

3.4.1

Experimental Results and Analysis

Quantifying Diversity for Proposed Techniques

Table 3.1 summarizes the diversity values obtained through CMF and RSF when compared to the
BASE design for a 3x3 Multiplier. D1 & D2 are designs obtained through Template-based method,
CB and IO are obtained through Case-based and Inverted Output method, and NAND & NOR are
obtained through NAND and NOR-based method, respectively.
The diversity value of the different designs was calculated through the injection of m1 = 64 CMF
faults and m2 = 8 RSF with a total of (mtotal = m1 + m2 ) 72 fault pairs. Table 3.1 shows
high values of design diversity for the different designs when compared to the same BASE design,
indicating the success of the proposed diversity techniques in creating diversity.

42

Intra-design diversity values are calculated among selective members of the same design class to
get an insight of how diverse the designs generated by the same technique can be. Table 3.2 shows
the intra-design diversity values for the 3x3 Multiplier.
Table 3.2: Intra-Design Diversity values.

D1 & D2

CB & IO

NAND & NOR

CMF

0.9956

0.9868

0.9731

RSF

0.9961

0.9844

0.9883

Average Diversity Value

0.9959

0.9856

0.9807

Results indicate that all design techniques provide very diverse designs in response to fault pairs
injected in different failure modes. D1 & D2 designs represent the extreme conditions templatebased designs can generate, in which D2 is obtained by replacing all the Half and Full adders
by different design templates than the ones used in D1. Therefore, their diversity values were
high, and this sets the upper limit for the highest diversity value the Template-based technique can
provide for the 3x3 Multiplier design.
To further comprehend the inter-design diversity of the designs generated from different classes, a
CMF test was carried out. Results for the 3x3 Multiplier are reported in Figure 3.3.

3.4.2

Diversity Metric for Multiple TMR Arrangements

To understand the effect of the generated designs on the overall TMR system reliability, the following TMR system topologies are utilized with a bitwise-voter output:

• TMR BASE: Implemented by three modules using the same (BASE) design.

43

Figure 3.3: Inter-Design Diversities in CMF.

• TMR TB: One module uses the BASE design and the other two modules use different designs
(D1 and D2) obtained using template-based method.
• TMR DD: Each module is implemented using different design technique, in which module1
is implemented using template-based, module2 implemented using inverted output, and
module3 implemented using NAND functions only.

All TMR arrangements mentioned above are evaluated using actual physical (post-P&R) designs
synthesized with the aforementioned Xilinx toolset on the Virtex-4 FPGA device. 64 test runs are
made in CMF for TMR BASE and TMR TB systems, while 12 test runs are made for TMR DD.
All RSF tests were composed of 8 runs per TMR system. In all test runs, an exhaustive evaluation
is done for all possible input patterns and erroneous outputs are recorded and counted. Note, the
dk17 benchmark has 2 inputs and 8 possible states, thus it is evaluated for all 32 state transitions.
Using equation 3.3, the reliability of each TMR system is then calculated and reported in Tables
3.3 and 3.4 for 3x3 Multiplier and dk17 benchmark (n = 5) respectively.
Results indicate that TMR systems based on diversely-designed modules provide higher reliability
in Common-Mode Failures. This is consistent with McCluskey’s et. al. [115] conclusion about

44

Common-Mode Failures in diverse design redundant systems, thus closing the design hypothesis
with the objectives of this research.
Table 3.3: TMR Reliability evaluation for combinational circuit.

TMR BASE

TMR TB

TMR DD

CMF

0.9473

0.989

0.9674

RSF

0.9746

0.9316

0.9844

Table 3.4: TMR Reliability evaluation for sequential circuit.

TMR BASE

TMR TB

TMR DD

CMF

0.8307

0.9479

0.9922

RSF

0.8724

0.8880

0.9167

Figures 3.4 and 3.5 show reliability values of different TMR systems in different failure modes
(CMF and RSF) for 3x3 Multiplier and dk17 benchmark respectively. In these figures, TMR IO
is implemented by replicating the inverted output design. Similarly, TMR NAND uses the NAND
design while TMR CB IO uses the Case-Based and Inverted-Output designs besides the BASE
design. The first three sets of bars represent a TMR system utilizing the same design replicated,
while the last three represent a TMR system utilizing diverse designs from different design techniques. Results clearly show that a TMR system using diverse designs is superior in reliability to
replicated design, irrelevant of the design technique used. This is consistent for both benchmarks
evaluated in this chapter.

45

Figure 3.4: Reliability evaluation for TMR systems based on uniform and diverse implementations
(combinational circuit).

3.5

Summary

The Template-based technique is studied and evaluated to automatically generate different designs
during runtime. Additionally, the Case-based, the Inverted-output and NAND/NOR-based design
techniques are evaluated. The generated designs showed a high degree of design diversity, even
though additional design-time effort is required. To measure diversity, McCluskey’s diversity metric was used by injecting fault pairs and recording exact fault responses of the design pairs. In
some cases, the metric has a counterintuitive behavior as same designs show high design diversities when fault pairs were injected at different locations, which hints that this metric might be
valid for CMFs only. Several TMR based systems were implemented using different topologies

46

from different design techniques, and their reliability was studied accordingly. Results indicate that
diverse-design-based TMR systems show higher reliability to CMF exposure, and using different
design techniques offers improved reliability for randomly injected faults at minimal additional
cost and effort.

Figure 3.5: Reliability evaluation for TMR systems based on uniform and diverse implementations
(sequential circuit).

47

CHAPTER 4: NETLIST-DRIVEN EVOLUTIONARY REFURBISHMENT
FOR FAULT-TOLERANCE IN RECONFIGURABLE HARDWARE

In this chapter, FPGA reconfigurability is exploited for autonomous fault recovery in missioncritical applications by facilitating EHW feasibility at runtime. A stand-alone evolutionary algorithm with modifications to dynamically reduce its search space is developed as a technique to
guide the necessary hardware reconfiguration. The proposed technique utilizes design-time information from the circuit netlist to constrain the search space of the algorithm by up to 98.1%
in terms of chromosome length representing reconfigurable logic elements. This allows refurbishment of relatively large-sized FPGA circuits as compared to previous works. Hence, the scalability
issue associated with evolvable hardware is addressed. Experiments are conducted with multiple
circuits from the MCNC benchmark circuit suite and successful refurbishment of apex4 circuit
having a total of 1252 LUTs with 10% spares is achieved in as few as 633 generations on average
when subjected to randomly injected single stuck-at faults. This demonstrates that a small amount
of selected design-time information can achieve a significant increase in the tractability of dynamic
EHW approaches.

4.1

Scalable FPGA Refurbishment Using Netlist-Driven Evolutionary Algorithms

Evolvable Hardware has the benefits of providing a self-adapting architecture and/or behavior on
the fly to meet varying mission objectives and/or changing environmental conditions. The focus
of this work is to extend EHW to mitigate hard faults incurred due to the effect of transistor aging [65], [7], [99]. TDDB and EM due to aging phenomenon are major exposures to permanent or
non-recoverable faults [155], [158]. Further, future trends in VLSI devices such as process scal-

48

ing and lower operating voltages for power savings increasingly impact the reliability of systems
employing such devices [110], [153].
SRAM-based FPGAs are the most commonly used reprogrammable logic platform for EHW due
to their dynamic reconfiguration capability [35] as opposed to one-time-programmable FPGAs.
The basic architecture of an FPGA is composed of an array of logic blocks for digital logic implementation and interconnect resources for connecting the logic blocks. The main components of a
logic block are a k-input Lookup Table (LUT) and a Flip-Flop (FF). A typical FPGA is comprised
of millions of Logic Blocks in conjunction with Input/Output Blocks (IOBs). All of the logic and
interconnect resources are SRAM-based, providing it with extensive reconfigurable properties.
Whereas the LUTs implement a Boolean logic function of k inputs, the FFs realize sequential or
memory capabilities. In this chapter, combinational circuits are considered, although NDER can
be extended to sequential circuits by decomposing them into pipelined stages of combinational
logic and memory elements, and then applying the developed methods individually to each stage.
A survey of the objectives and techniques handling hard faults in SRAM-based FPGAs is presented
in [125]. GA-based fault handling schemes have been successfully employed as a reconfiguration
mechanism for autonomous self-adaptive EHW systems [35], [52], [136]. The GA is utilized
to devise a functional configuration for the FPGA in the presence of faults with the benefit of
operating without the information of location and nature of faults. This technique mimicking the
natural phenomenon of evolution, utilizes multiple candidate solutions (Population) to meet the
desired objective(s). A fitness score is assigned to each candidate solution, hereafter referred to
as an Individual, which is based on an objective function. In every generation, genetic operators
such as mutation and crossover modify the selected individuals to form the new population which
is utilized in the next generation. The process is repeated for a maximum number of generations or
until a suitable individual has been found in the population. In this chapter, the widely established
concern of scalability associated with the GA-based refurbishment technique is addressed [35],
49

[66], [70], [156]. Namely, with increasing sizes of the configurations/applications to be regenerated
on the FPGA, the search space also increases resulting in an intractable problem for the GA to
solve. The goal of this work is to extend the feasibility of using EHW techniques by constraining
the search space.
Aging induced degradation due to TDDB, EM, NBTI and HCI increases the probability of failure
within 3-5 years of operation as indicated by results obtained by modeling a 65-nm technology
FPGA device [155]. Similar results obtained via accelerated life testing using elevated temperature and voltage on a 45-nm FPGA in [158] demonstrate a reduction of up to 15% in maximum operating frequency over a period of 75 days. Furthermore, aging aggravates fabricationinduced process variation, which is destined to become worse with continued scaling of technology
nodes [29], [104]. Experiments conducted in [29] to determine the effect of PV alone with 15 Xilinx Virtex-II Pro FPGA devices based on 130-nm technology demonstrate considerable intra-die
and die-to-die variations. Thus, there is a growing need to address these issues in the context of
device aging.

4.1.1

NDER Research Objectives

In this dissertation, a real-time approach to refurbish reconfigurable circuits is developed using
only selected design-time information. The NDER technique isolates the functioning outputs of the
FPGA configuration from unwanted evolutionary modification. This allows the EHW technique to
focus exclusively on the discrepant outputs. Thus, the genetic operators which modify the design to
avoid faulty resources are only applied to selected elements of the FPGA configuration which are
suspected of corrupting the current output. Suspect resource assessment is performed at runtime
using design netlist information to form the pool of suspect resources. This significantly decreases
the search space for the GA leading to more tractable refurbishment times.

50

Research Objectives: Having presented the research problem, the following approaches are adopted:

(i) Research Objective I (RO-I) Attain Reduced Search Space: enables refurbishment in fewer
generations of GA.
(ii) Research Objective II (RO-II) Partition the LUTs into suspect and non-suspect subsets: the
non-suspect subsets do not get inadvertently modified by the GA during recovery.

4.1.2

Organization of the Chapter

The remainder of the chapter is structured as follows: Section 4.2 highlights the ability of the
proposed methodology to prune the search space of the genetic algorithm responsible for fault recovery/reconfiguration as illustrated by a motivating example. Section 4.3 introduces the proposed
architecture of the overall autonomous system composed of the computing units and reconfiguration controller. Section 4.4 gives details of the refurbishment algorithm. Section 4.5 discusses the
experimental setup and results to establish the effectiveness of the proposed fault recovery algorithm with circuits from the MCNC benchmark suite. Finally, last section draws some conclusions
on the proposed methodology of employing evolutionary algorithm based reconfiguration engines
for achieving autonomous self-adaptive resilience.

4.2

Fault Isolation via Back Tracing

First, a set of heuristics are presented to prune the search space for Evolutionary Algorithms utilized for self-healing in EHW. Generally, fault diagnosis in the form of knowing the location and
nature of faults is not a requirement for such algorithms, which is often considered as one of the
strengths of this technique [35], but to address the issue of scalable refurbishment of EHW sys51

tems, it can be leveraged to increase scalability as demonstrated through the presented work. The
pruning of the search space is achieved by selecting a subset of resources from the entire pool of
resources based on both runtime and design-time information. This selection is done based on implication by only the discrepant primary output lines which are implied to be generated by corrupt
logic and/or interconnect resources.
The candidate heuristics to identify the set of suspect resources in a given corrupt FPGA configuration are best explained through a motivating example. Let’s consider the FPGA configuration
with six Primary Inputs (PIs) and four Primary Outputs (POs) as shown in Figure 4.1. The PIs are
represented through letters a, b, c, d, e and f whereas the POs are represented with out1 , out2 , out3
and out4 . The LUTs are represented with capital letters {A, B, ..., O} which is the complete set
of resources. Each LUT is tagged with the index(es) of the PO(s), on which datapath(s) it is used.
This information can be obtained by back-tracing from each PO of the circuit and tagging the
LUT(s) which are in its datapath. The process is recursively continued from each LUT input until
PI is reached. This process is analogous to traversing a tree structure starting from the leaf node,
which is the PO in this case, until the root is encountered. For example, LUT D contributes to the
generation of both POs out1 and out4 , whereas LUT K is only responsible for out4 . To illustrate
different scenarios in which the suspect resources are selected, two distinct failure cases are presented as shown in Figure 4.1. This selection procedure will also be referred to as the process of
“Marking of LUTs”. The resources identified using one of the candidate heuristics as presented
below are to be utilized during the execution of the evolutionary refurbishment algorithm.

52

Failure
Case 1
f
a
d

b
c

B

A

{1}

{1}

Failure
Case 2
out1

D
{1,4}
e

C

LUT name

{1}

Output set
c
e
b

E

{2}

a
b
e
f
a
c
f

G

d

{1,2,4}

a
b
c

F
{2,4}

H

J

a

{2,3,4}

{3}

X

out2

X

out3

X

out4

I
{3,4}
K
{4}

c

M
{4}
f

L

N

O

{4}

{4}

X

{4}

Figure 4.1: Motivating example for search space tradeoffs through multiple heuristics.

4.2.1

Aggressive Pruning Heuristic (HA )

An aggressive pruning heuristic is adopted in this case where all those LUTs with output set containing any of the non-corrupt outputs are not marked. Let’s consider the Failure Case 1 as shown
in Figure 4.1, when only a single output line of the circuit {out4 } is identified as corrupt at runtime.
Then LUTs {K, L, M, N, O} are marked in the fault diagnosis phase by considering the aggressive pruning heuristic of not marking LUTs which contribute to non-corrupt outputs. For example,
LUT I is not selected because it is responsible for POs {out3 , out4 } and out3 is not identified as
corrupt. In Failure Case 2 as shown in Figure 4.1, multiple circuit output lines are identified as
corrupt, which considers an interesting scenario, where initially, only the LUT(s) with output set
“equal” to the set of corrupt POs is(are) selected. For example, if {out2 , out3 , out4 } forms the set

53

of corrupt POs, then only LUT {H} is marked as it is the only LUT with maximal output set cardinality of 3 and has matching elements. Based on the demand of the refurbishment algorithm, if
more LUTs need to be marked to potentially increase the size of the search space, the next choice is
to find all LUTs with output set equal to any of the possible combinations of subsets of cardinality
2 from the set of corrupt POs, i.e., {out2 , out3 } or {out2 , out4 } or {out3 , out4 }. Thus, the set of
marked LUTs at this iteration is {H, F, I}. Yet, another iteration of marking is achieved by finding
all LUTs with output set equal to {out2 } or {out3 } or {out4 }. Thus, the final set of marked LUTs
is {H, F, I, G, J, K, L, M, N, O}. The size of the set of marked LUTs is not increased further
based on the selected pruning heuristic.

4.2.2

Exhaustive Pruning Heuristic (HE )

This heuristic is based on marking all those LUTs with an output set having elements of any of
the corrupt outputs. For Failure Case 1, all LUTs with output set containing out4 are marked as
shown in Table 4.1. Similarly, for Failure Case 2, all LUTs with output set containing any of the
outputs out2 , out3 , or out4 are marked. The drawback of this intuitive heuristic is that more number
of LUTs tend to be selected as compared to other presented heuristics. On the positive side, the
algorithmic realization based on this heuristic is less complex.

4.2.3

Hybrid Pruning Heuristic (HAE )

The selection based on this heuristic is an extension of HA . An additional iteration over HA is
used to mark all the remaining LUTs with output set having elements of any of the corrupt outputs
which have not been marked in the previous iterations. The LUTs marked in Failure Cases 1 and
2 for HAE are shown in Table 4.1.

54

4.2.4

Dynamic Pruning Heuristic (HD )

The selection based on this heuristic is an extension of HAE . The additional iterations of marking
are based on observing the status of corrupt POs during the fault recovery process. New potential
markings only take place if the cardinality of set of corrupt PO(s) increases from its initial value.
Thus, iterations 3, 4 and 5 shown for Failure Case 1 in Table 4.1 can take place in any order. Note,
only LUTs with fan-out of one are marked in these additional iterations. For instance, if {out2 }
is added to the set of corrupt POs, then LUT {G} will be added to the set of marked LUTs. For
this particular example, all the LUTs are selected by the end of all iterations for both the failure
cases. However, this will not be the case with much bigger circuits with large number of outputs.
The execution complexity of HD is higher than HAE and others. In summary, the heuristics in
decreasing order of complexities are listed as: HD , HAE , HA , HE .
Table 4.1: Listing of LUTs Marked through presented Heuristic approaches.
Failure Case 1
Failure Case 2
Pruning Reduction
Pruning Reduction
Heuristic Iteration Marked LUTs % Non-Marked LUTs Iteration Marked LUTs % Non-Marked LUTs
HA
1
K,L,M,N,O
66.7
1
H
93.3
2
F,I
80.0
3
G,J,K,L,M,N,O
33.3
HE
1
D,E,F,H,I,K,
33.3
1
D,E,F,G,H,I,J,K,
20.0
L,M,N,O
1
L,M,N,O
HAE
1
K,L,M,N,O
66.7
1
H
93.3
2
D,E,F,H,I
33.3
2
F,I
80.0
3
G,J,K,L,M,N,O
33.3
4
D,E
20.0
HD
1
K,L,M,N,O
66.7
1
H
93.3
2
D,E,F,H,I
33.3
2
F,I
80.0
3
A,B,C
13.3
3
G,J,K,L,M,N,O
33.3
4
G
6.7
4
D,E
20.0
5
J
0.0
5
A,B,C
0.0

55

4.2.5

Evaluating the Diagnostic Performance

The efficacy of the presented heuristics for fault diagnosis is evaluated through experimentation on
circuits selected from the MCNC benchmark suite [182]. The experiments are setup with a single
Stuck At-fault which is randomly injected at the input of the LUTs in the FPGA circuit. It is to be
determined whether the LUTs marked by the heuristics as presented above contain the LUT with
the fault. Table 4.2 lists the observed coverage of faults for multiple experimental runs using the
first iteration of Heuristic HA under the single-fault assumption. The number of experimental runs
as indicated by “# of Runs” column, which actually articulates the fault are variable for each circuit
as the results are sampled from a total of 700 runs with cases where faults lie on a spare LUT. The
percentage of runs with corrupt set having cardinality greater than one ranged from 5.8% to 66.9%
according to characteristics of benchmark circuits. The breakdown of marked LUTs in terms of
the percentage having fan-out equal to one and fan-out greater than one is also given with more
LUTs being marked with fan-out equal to one. The former markings takes place when a single PO
is identified as corrupt. In contrast, LUTs having fan-out greater than one are marked as a result
of multiple POs being corrupt. The size of the set of marked LUTs can often be reduced in the
latter case, although dynamically changed as required. Majority of experimental runs demonstrate
successful inclusion of faulty LUT into the set of marked LUTs by Heuristic HA in as soon as
the first iteration as indicated by “Coverage Faulty LUT” column. The subsequent iterations of
Heuristic HA or subsequent application of Heuristic HE provides 100% coverage for cases where
the actual faulty LUT is not marked in the first iteration. Thus, full coverage can be attained by
using Heuristic HAE with a good tradeoff of search space size as compared to HE as concluded
from earlier discussion. Up to 98.1% of the FPGA resources are not marked during the initial
phase of fault diagnosis using Heuristic HA for all the experiments.

56

Table 4.2: Percentage of LUTs marked and the probability of covering the faulty LUT with Heuristic HA .
Total # of Runs % Runs Corrupt
Marked LUTs on Avg.
Avg. Pruning Reduction Coverage Faulty Avg. Correctness /
Benchmark LUTs
Outputs > 1
Fan-out = 1 Fan-out > 1
Non-Marked LUTs
LUT
Max. Fitness
alu2
185
483
66.9
2.54%
18.98%
78.5%
100%
6086 / 6144
5xp1
49
462
6.5
16.20%
0.27%
83.6%
100%
1265 / 1280
apex4
1252
553
7.4
4.84%
0.01%
95.2%
99.1%
9713 / 9728
bw
68
501
5.8
2.94%
0.09%
97.0%
100%
891 / 896
ex5
352
481
17.3
1.82%
0.07%
98.1%
96.5%
16102 / 16128

4.3

Architecture Supporting NDER
Reconfiguration
Port

GA Engine

Test Pattern
Generator

Configure

Reconfiguration
Manager

Discrepancy
Report

Functional Element 1
(FE 1)

Discrepancy
Sensor

Input Selector

Select

Functional Element 2
(FE 2)

Voting Logic

Functional
Input

Voting
Report

Functional
Output

Functional Element 3
(FE 3)

Output Actuator

Figure 4.2: RARS architecture as utilized in NDER.

The Reconfigurable Adaptive Redundancy System (RARS) architecture [7] provides a flexible arrangement to investigate and improve EHW scalability for self-recoverable autonomous systems.
As shown in Figure 4.2, redundancy can be reconfigured dynamically in a RARS arrangement
according to changing mission requirements; the system can be operating in uniplex mode for
57

non-critical portion of the mission, duplex mode for fault detection, or TMR mode for fault masking. The resources which are dynamically vacated by the adaptive redundancy mechanism can be
utilized by other application tasks via FPGA partial reconfiguration. Three Functional Elements
(FEs) which constitute the main computational part of the application are shown in Figure 4.2. The
input to the individual FEs can be selected from the functional input or the test pattern generator’s
output. The test patterns are applied to the FEs during the refurbishment phase as described below.
The Discrepancy Sensor is used to detect a failure and identify the faulty FE. Whereas, Voting
Logic does majority voting to mask the faulty output and is enabled upon the activation of TMR
mode. The Reconfiguration Manager activates different redundancy states of the system and is
used for reconfiguration by the GA Engine during the refurbishment phase. The Output Actuator
is used to select one of the outputs of the FEs based on reports from the Reconfiguration Manager
and Voting Logic.
Now, the fault handling scenario is presented assuming the system is running in duplex mode for
conservation of power. The following steps are taken when an error is detected by the Discrepancy
Sensor:

1. The third FE is also brought online (TMR) to identify the faulty FE.
2. The faulty FE and another known healthy (golden) FE are taken offline for recovery, whilst
the system continues to provide functional output in uniplex mode.

The fault-free FE serves to evaluate the fitness of the faulty FE. This alleviates the requirement
of a golden fitness evaluation method as opposed to other schemes [171] and is referred to as
the model-free fitness evaluation [7]. The fitness scores are reported to the GA engine. Upon
successful refurbishment operation, the system can return to duplex mode or stay in uniplex mode,
as required by mission objectives.
58

4.4

NDER Technique

An Evolutionary Algorithm based regeneration technique utilizing the proposed diagnosis methodology is described using a top-down approach. It is utilized after a fault has been detected, and the
faulty module has been isolated for refurbishment as described in Section 4.3. The overall flow of
the algorithm also referred to as the Netlist Driven Evolutionary Refurbishment (NDER) technique
is shown in Figure 4.3 with the nomenclature listed in Table 4.3. The corresponding algorithmic
formulation based on [17] is also presented in Algorithm 1.
Initially, multiple FPGA configurations are utilized to form a population for the GA. This population could be formed with the single under-refurbishment configuration replicated to form a
uniform pool, or with functionally equivalent though physically distinct configurations generated
at design-time to form a diverse pool as utilized in [52]. Various techniques can be utilized to
form diverse individuals at design-time as demonstrated earlier in Chapter 3, e.g., an adder can
be implemented using alternative approaches such as ripple-carry, carry-lookahead, carry-select,
etc. Furthermore, additional designs can be generated by modifying the placement and routing
constraints of each configuration. However, diverse designs do not necessarily avoid the faulty resources, and in that situation a technique to deal with this is desirable. Herein, a uniform population
seed is utilized as it requires less storage and reduced design-time effort compared to generating
fault-insensitive designs with sufficient coverage. Once synthesized, these configurations are suitable for storage in a fault-resilient non-volatile memory.
The FPGA configurations are mapped into Individuals, represented by Ik (t), where k uniquely
identifies an individual in the population and t represents the time instance, e.g., 0 identifies initial
formation time of the population. The fitness of the individuals in the population is evaluated
by using the fitness function φ(Ik (t)). During the initial fitness evaluation, the corrupt output
~ k is updated with the relevant information for an individual Ik .
line(s) is(are) also identified and Ω
59

GA OPERATIONS
(crossover pc, mutation pm)

FORM Population P(0)

EVALUATE Off-spring
Population P’’(t)
{I’’1(t), …, I’’|P|(t)}

{I1(0), …, I|P|(0)}

{F(I’’1(t)), … F(I’’|P|(t))}
EVALUATE Population P(0)
& IDENTIFY corrupt output
line(s) for all Individuals Ik,

k=1

with k ϵ {0, |P|}

SELECTION phase
NO

{F(I1(0)), … F(I|P|(0))},{Ω1, … Ω|P|}

k <= |P|
YES

P(t+1)

levelk = CountOnes(Ωk);
MARK LUTs (including
amorphous spares);
Select PIs, for all Individuals
Ik, with k ϵ {0, |P|}

t = t +1

(F(I’’k(t)) <
F(Ik(t))*levelThreshold)
&& level’k> 0
YES

{level’1, …, level’|P|}, {M1(0), … M|P|(0)},
{l1, … l|P|}

Exit Criteria is Satisfied
NO

END

level’k – –; MARK LUTs
(including amorphous
k++
spares); Select PIs, for
Individual I’’k
M’’k(t), level’’k, l’’k

Figure 4.3: Algorithmic flow chart describing NDER.

Table 4.3: Nomenclature for NDER.
Symbol

Description

P
l
n
m
pc
pm
Ik
ak,x
ck,y

Representation of the Population
Number of LUTs in the FPGA Configuration
Number of Primary Inputs of the Circuit
Number of Primary Outputs of the Circuit
Crossover Rate
Mutation Rate
Representation of the k th Individual in the Population
Logical content of a LU Tx ∈ Individual k having a 16-bit unsigned binary value, ∀x ∈ {1, ..., l}
Connection of a LU Tx ∈ Individual k having an integer value ∈ {1, ..., n, n + 1, ...n + l}, ∀x ∈
{1, ..., l} & y ∈ {1, ..., l ∗ 4}
Primary Output i ∈ Individual k having an integer value ∈ {1, ..., n, n + 1, ...n + l}, ∀i ∈ {1, ..., m}
Corrupt Output Flag corresponding to Primary Output i ∈ Individual k having a binary value ∈ {1, 0},
∀i ∈ {1, ..., m}
Primary Inputs i ∈ Individual k having a binary value ∈ {1, 0}, ∀i ∈ {1, ..., n}
Marking of a LU Tx ∈ Individual k having a binary value ∈ {1, 0}, ∀x ∈ {1, ..., l}
Iteration of Marking of LUTs for an Individual k
Datapath usage corresponding to m Primary Outputs for a LU Tx ∈ Individual k having a m-bit vector,
∀x ∈ {1, ..., l}

outk,i
Ωk,i
λk,i
Mk,x
levelk
Xk,x

60

Algorithm 1 The Netlist Driven Evolutionary Refurbishment Algorithm
t := 0;
Initialize P (0) = {I1 (0), ..., I|P | (0)} ∈ I|P | ;
~ 1 , ..., Ω
~ |P | }}, where Φ(Ik (0))
Evaluate P (0) : {{Φ(I1 (0)), ..., Φ(I|P | (0))}, {Ω
∈ Zm∗2n ;
~ k ), ∀k ∈ {1, ..., |P |};
levelk := CountOnes(Ω
0
~ k, X
~ k , levelk ), ∀k ∈ {1, ..., |P |};
~
Mark {Mk (0), levelk } := M ark LU T s(Ω
~λk := Select P Is(M
~ k (0)), ∀k ∈ {1, ..., |P |};

is the fitness of an individual Ik with a value

while Exit Criteria (P (t)) 6= true do
for k = 1 to |P | do
~ 0 k (t)} := r0
Crossover {a~0 k (t), c~0 k (t), out
(P (t));
{pc }
~ 00 k (t)} := m0
~ 0 k (t));
Mutate {a~00 k (t), c~00 k (t), out
(a~0 k (t), c~0 k (t), out
{pm }

end for
Evaluate P 00 (t) : {Φ(I00 1 (t)), ..., Φ(I00 |P | (t))};
for k = 1 to |P | do
if Φ(I00 k (t)) < Φ(Ik (t)) ∗ level threshold && levelk0 > 0 then
levelk0 = levelk0 − 1;
~ k, X
~ k , level0 );
Mark {M~ 00 k (t), levelk00 } := M ark LU T s(Ω
k
λ~00 k := Select P Is(M~ 00 k (t));
else
~ k (t);
M~ 00 k (t) := M
levelk00 = levelk0 ;
end if
end for
Select P (t + 1) := s{τ } (P 00 (t)),
t := t + 1;
end while
return: Refurbished

where τ is the tournament size;

~ ) having Φ(If inal ) == m ∗ 2n OR max{Φ(I1 (t)), ..., Φ(I|P | (t))};
Individual If inal = (~a, ~c, out

This is done via bit-by-bit comparison of the POs of the under-refurbishment module with the
known healthy module (golden) by application of adequate test vectors. These modules had been
taken offline during the fault isolation phase without affecting the system throughput as described
earlier in Section 4.3. The under-refurbishment module is configured with the individuals from the
population as dictated by the algorithm.
~k is used to mark
The index(es) of the corrupt PO(s) and design netlist information contained in X
LUTs for the individuals of the population. Here, the netlist information corresponds to the contribution of each LUT to the datapath of PO(s) as described earlier in Section 4.2. Then, the
~ k . The set of marked
information about all the marked LUTs for an individual Ik is updated in M
LUTs is then utilized to form the subset of PIs whose corresponding test patterns need to be gener61

ated for fitness evaluation. The evolutionary loop begins with the application of GA operations of
crossover and mutation to the individuals in the population based on user-defined rates pc and pm
respectively, and evaluating the fitness of the resulting off-spring population. Then, an off-spring
I00 k will undergo the process of marking of LUTs if and only if its fitness is less than a user-defined
percentage (level threshold) of the fitness of its corresponding genetically unaltered individual Ik .
New LUTs may be marked for individuals according to adopted heuristic during each invocation
as dictated by its current iteration or level. Finally, selection phase is used to form the population
which is to be utilized in the next generation. This work employs Tournament-based selection [17].
The loop is terminated when user-defined criterion is met.

4.4.1

Representation

~ ~λ, M
~,
~ , Ω,
An individual I denoting an FPGA configuration is encoded by the following: ~a, ~c, out
~ The vector ~a is composed of l 16-bit binary numbers representing the logical contents
level and X.
of l LUTs. The connection vector ~c has 4l integer elements each ∈ {1, ..., n, n+1, ...n+l} such that
each represents the connection of a LUT input with either one of the n PIs or one of the outputs of
l LUTs. It has a cardinality of 4l as a maximum of 4 inputs are assumed for each LUT. The output
~ has m integer elements corresponding to m POs each ∈ {1, ..., n, n+1, ...n+l} with the
vector out
~ containing multiple(m)
same definition as for ~c. The health of each PO is maintained in vector Ω
single bit values with 1 denoting corruption. The subset of PIs used for fitness evaluation is updated
in ~λ which has a cardinality of n and 1 indicates the corresponding PI is a part of the subset. The
~ containing single bit values with a 1 indicating
markings of all(l) LUTs is maintained in vector M
~ with each member
a marked LUT. The datapath usage for all the LUTs is stored in vector X
having m-bit binary values to indicate the m POs. A bit is set for a particular PO, if and only
if the corresponding LUT is part of its datapath. Figures 4.4 and 4.5 demonstrates the proposed
representation with an example of a simple FPGA circuit. The configuration for an FPGA is
62

~ which undergo GA operations such as mutation and
constructed directly through ~a, ~c and out,
~ M
~
crossover. Other variables are not evolved and are modified deterministically. For instance, Ω,
~ is set at design-time.
and level are set at time of the failure, whereas X
w(1)
x(2)
w
x

5

w
z

a1 = 0xAAAA
c1 = 1, c2 = 2, c3 = 1, c4 = 2

y(3)
z(4)
y
z

6

out1

a2 = 0x21A3
c5 = 5, c6 = 7, c7 = 1, c8 = 4

7

w
y

a3 = 0x7777
c9 = 3, c10 = 4, c11 = 3, c12 = 4

8

out2

a4 = 0xFEFE
c13 = 5, c14 = 7, c15 = 1, c16 = 3

out = (6, 8),
X = ({1,1}, {1,0}, {1,1}, {0,1})

Figure 4.4: Example FPGA configuration with GA encoding.

Figure 4.5: GA phenotype in NDER.

63

4.4.2

Marking of LUTs

Heuristic HAE is selected for Marking of LUTs in this work due to the prime reason of constraining
the search space for the evolutionary algorithm to meet RO-I. Results in Table 4.2 indicate 100%
fault coverage can be obtained by augmenting Heuristic HA with HE . However, the evolutionary
refurbishment algorithm can be adopted for any of the heuristics presented in Section 4.2.
Figure 4.6 describes the algorithmic flow where Aggressive Pruning is done first (Heuristic HA )
and then increased search space is adapted for GA via Exhaustive Pruning (Heuristic HE ) only if
convergence is not achieved. Algorithm 2 [4] describes the algorithmic steps to achieve the desired
marking in more detail. The output set of each LUT is compared with the set of corrupt PO(s)
to evaluate a possible marking according to the adopted heuristic. According to HA , a LUT with
a non-corrupt PO in it’s output set is not marked. A marking in this case takes place only if the
output set of a LUT has matching elements from the set of corrupt PO(s) and it’s fan-out is equal
to level. The last iteration of marking takes place according to HE , where a LUT is marked if
it’s output set has any element from the set of corrupt PO(s). Afterwards, spare LUTs allocated at
design-time, are also marked such that the feed-forward condition for the FPGA configuration is
not violated.

4.4.3

Fitness Evaluation

Genetic Algorithms need a method to assign positive real values to individuals of the population,
for the selection mechanism to work. The elite individual with the highest fitness value in the
current population is retained for the following generation to ensure monotonically increasing performance. Fitness value is assigned to an individual via bit-by-bit output comparison with the
known correct output, which is obtained from the isolated fault-free module as described earlier.

64

Algorithm 2 Marking of LUTs
~ and runtime information
Marking of LUTs is performed for a single individual I using design-time information X
~
Ω
~ X,
~ level} → {M
~ , level0 }
M ark LU T s : {Ω,
count := 0; index last marked LU T = −1;
while index last marked LU T == − 1 do
for j = 1 to l do
for k = 1 to m do
temp = Ωk − Xj [k];
if level > 0 then
// Heuristic HA
if temp == 0 && Ωk then
count + +;
else
if temp == − 1 then
count = 0; break; // LU Tj
end if
end if
else

is not marked

// Heuristic HE
if temp == 0 && Ωk then
count + +;
end if
end if
end for
if level > 0 then
if count == level then
Mj = 1; {Mark LU Tj }
index last marked LU T = j ; {Used
end if
else
if count > 0 then

in Marking of Spare LUTs}

Mj = 1; index last marked LU T = j ;
end if
end if
count = 0;
end for
if index last marked LU T != -1 then
for j = 1 to index last marked LU T do
if CountOnes(Xj ) == 0 then
Mj = 1; {Mark a spare LU Tj }
end if
end for
else
level = level − 1;
if level < 0 then
level = 0; break;
end if
end if
end while
level0 = level;
~ , level0 }
return: {M

65

level--

level < 0

j=1
NO

NO

YES

if any LUT is
Marked

j++
NO

j <= l
YES

YES
Mark Spare
LUT(s)

END

level > 0

YES

NO

Evaluate if LUTj can be
Marked acc. to HA

Evaluate if LUTj can be
Marked acc. to HE
NO

NO
YES

YES

LUTj is Marked
(Mj = 1)

Figure 4.6: Flow chart describing the Marking of LUTs in NDER.

Same input test vectors are applied to both the fault-free module and the module under refurbishment. Bit-by-bit comparisons are done for all (m) POs, to establish the fitness of the module under
refurbishment as well as the index(es) of the corrupt PO(s), i.e., ∀index ∈ {1, ..., m} , where POs
do not match for a particular test vector, assign Ωk,index = 1 ∈ I. The maximum value of fitness
for an individual, achieved via exhaustive evaluation is 2n ∗ m, where n and m are the number of
PIs and POs of the circuit respectively. The fitness value is incremented by one for each matching
output corresponding to application of 2n input test vectors with each case having a maximum
score of m. The detailed algorithmic steps are listed in Algorithm 3. Other fitness measurement
criteria are also possible as an alternative to the adopted method, e.g., normalized difference between the functional outputs of the under refurbishment module and fault-free module. However,

66

the bit-by-bit comparison fitness evaluation scheme is adopted due to its simplicity in hardware
implementation.
Algorithm 3 Initial Fitness Evaluation of Individuals in the population
1: All individuals in the population are assigned a fitness value and the set of corrupt Primary Outputs is formed
~ k }}, ∀k ∈ {1, ..., |P |}
2: Evaluate : P (t) → {{Φ(Ik )}, {Ω
3: for k = 1 to |P | do
4:
Φ(Ik ) = 0;
5:
for x = 0 to 2n − 1 do
6:
Ox := Evaluate Circuit(Ik , Tx );
7:
// Ox is the output-bit vector corresponding to input test vector Tx for the circuit represented by individual
I~k

8:
if Ox 6= Golden Outputx then
9:
for index = 1 to m do
10:
if Ox [index] 6= Golden Outputx [index] then
11:
Ωk,index = 1;
12:
else
13:
Φ(Ik ) + +;
14:
end if
15:
end for
16:
else
17:
Φ(Ik ) = Φ(Ik ) + m;
18:
end if
19:
end for
20: end for

A smaller subset of test patterns can be used for fitness evaluation from the set of all possible
input test combinations, once the set of marked LUTs has been determined. This is possible when
some PI(s) are connected only to the non-marked LUTs and thus would not effect the corrupt POs
of the circuit. All the required PI(s) can be determined by back-tracing from each input of all the
marked LUTs. The final set is determined by taking the union of each of these input subsets and this
information is updated in ~λ by assigning a 1 for corresponding selected PI(s). This process referred
to as Select P Is in Algorithm 1 and needs to be repeated for every new addition to the pool of
marked LUTs. The test patterns can be generated by providing all possible combinations to only
these PI(s) while other(s) can be considered to be Don’t Care conditions. The fitness score needs
to be adjusted to have a maximum value of 2n ∗ m for checking the exit criteria as described later.
Alternatively, runtime inputs can also be utilized as proposed in [52] which demonstrates viable

67

and robust performance for self-adaption using evaluation of actual inputs, rather than exhaustive
test vectors. This utilizes a confidence interval during which the runtime inputs are monitored
based on the premise that not all input combinations need to appear within a given interval.

Figure 4.7: Mutation operation in NDER.

4.4.4

Mutation Operation

Mutation is a unary operation, where an individual I from the population undergoes mutation
according to the probability of mutation pm ∈ {0, 1}, which is usually small to ensure that the mutated individual does not differ significantly in behavior from its ancestor. The proposed approach
is different as compared to standard mutation operation in genetic algorithms due to the selective
nature of mutation adopted as described in Algorithm 4 and illustrated in Figure 4.7.
Notably, mutation is performed on the logical functions and connections of “marked” LUTs belonging to a selected individual, i.e., ~a & ~c. It is worth noting that mutation does not take place
if a non-marked LUT is selected, thus the effective mutation probability could be lower than the
user-defined value pm . Similarly, mutation also takes place to select the node(s) of the corrupt
PO(s).

68

Algorithm 4 Mutation Operator
A single individual I is mutated
value sampled from the uniform random variable such that χ ∈ [0, 1]

χ is a
m0{p

m}

: I → I0

for i = 1 to l do
if Mi == 1 then
if χi ≤ pm then
a0i = F unction M utation(ai );
else
a0i = ai ;
end if
for j = 4i − 3 to 4i do
if χj ≤ pm then
c0j = Connection M utation(i);
if c0j > n then
if Mc0 −n 6= 1 then
j

c0j = cj ; {Reverse
end if
end if
else
c0j = cj ;
end if

Mutation as the LUT selected for connection is not Marked}

end for
end if
end for
for i = 1 to m do
if Ωi == 1 then
if χi ≤ pm then
out0i = OutputLine M utation();
if out0i > n then
if Mout0 −n 6= 1 then
i
out0i = outi ;
end if
end if
else
out0i = outi ;
end if
end if
end for
~ 0)
return I0 = (a~0 , c~0 , out

As described in the Algorithm 5, the function of a selected LUT is mutated (F ucntion M utation)
by inverting a randomly selected bit from the 16-bit logical content. Whereas, its connection
is mutated (Connection M utation) by randomly assigning an integer c0j such that it belongs to
{1, ..., n + i − 1}, where i is used to ensure that only preceding LUTs ({n + 1, ..., n + i − 1})

69

in the circuit are connected to the ith LUT and {1, ..., n} represents the n PIs as highlighted in
Algorithm 6.
It is worth mentioning that connection mutation takes place, if and only if, connection is made to
a marked LUT OR a PI, i.e., (c0j > n & Mc0j −n == 1) OR (c0j ≤ n), otherwise the connection
is left unaltered. Similarly, the corrupt PO as determined by the condition (Ωi == 1) being true is
selected for mutation (Algorithm 7: OutputLine M utation) and is randomly assigned an integer
∈ {1, ..., n + l}, where 1 through n represents connection to one of the n PIs and n + 1 through
n + l represents connection to one of the l LUTs. Again, it is ensured that the connection is only
made to a marked LUT.
Algorithm 5 Function Mutation
1:
2:
3:
4:
5:
6:
7:
8:
9:

Method to mutate the 16-bit content of a given LUT
F unction M utation : ax → a0x
Randomly pick an integer index
if ax [index] == 1 then
a0x [index] = 0;
else

such that index ∈ Z16

a0x [index] = 1;
end if
retrun a0x

Algorithm 6 Connection Mutation
1:
2:
3:
4:
5:

Method to mutate the interconnections
Connection M utation : {cx , i} → c0x
Randomly pick an integer z such that z ∈ {1, ..., n + i}
c0x = z ;

retrun c0x

Algorithm 7 OutputLine Mutation
1:
2:
3:
4:
5:

Method to mutate the output lines
OutputLine M utation : outx → out0x
Randomly pick an integer z such that z ∈ {1, ..., n, n + 1, ..., n + l}
out0x = z ;
retrun out0x

70

4.4.5

Single-point Crossover Operation

Crossover is a binary operation, where two individuals Iα and Iβ (parents) are selected from the
population with a probability pc and recombined to form a new individual I0 . The detailed algorithmic steps are described in Algorithm 8. In Single-point Crossover, a point χ is chosen randomly
such that χ ∈ {1, ..., l −1}, i.e., within the range of LUTs. Then, the offspring individual is formed
by LUTs with their logical contents, interconnects and associated markings up till χ from Iα and
the remaining l − χ LUTs from Iβ .
Algorithm 8 Single-point Crossover Operator
1:
2:
3:
4:
5:
6:
7:
8:

Crossover operation is performed between two individuals
0
r{p

c}

: I2 → I0

Choose two individuals Iα and Iβ from Population P (t) with probability pc
Randomly pick a crossover point χ such that χ ∈ {1, ..., l − 1}
~a0 = (aα,1 , ..., aα,χ−1 , aα,χ , aβ,χ+1 , ..., aβ,l );
~c0 = (cα,1 , ..., cα,χ∗4−1 , cα,χ∗4 , cβ,χ∗4+1 , ..., cβ,l∗4 );
~ 0 = (Mα,1 , ..., Mα,χ−1 , Mα,χ , Mβ,χ+1 , ..., Mβ,l );
M
~ 0)
return I0 = (a~0 , c~0 , M

The individuals selected for crossover operation might have distinct or similar markings of LUTs.
In case of similar markings, the offspring individual will have no difference in markings from the
parents as shown in Figure 4.8. Otherwise, it will have markings of LUTs which is distinct from
both the parents as shown in Figure 4.9.

4.4.6

Exit Criteria

The criterion for recovery is checked at the start of every GA iteration using the algorithmic steps
in Algorithm 9. During this procedure, it is to be determined whether any Individual Ik in the
population has a fitness value:
Φ(Ik ) ≥ recovery threshold ∗ 2n ∗ m

71

Crossover point

Parent A

Parent B

LUT 1

LUT 2

X

X

LUT 1

LUT 2

X

X

LUT 1

LUT 2

X

X

LUT 3

LUT 4

LUT 5

X

LUT 3

LUT 4

LUT 5

X

OffSpring
LUT 3

LUT 4

LUT 5

X

Figure 4.8: Crossover operation in NDER with parents having similar Markings.

Crossover point

Parent A

Parent B

LUT 1

LUT 2

X

X

LUT 1

LUT 2

X

X

LUT 1

LUT 2

X

X

LUT 3

LUT 4

LUT 5

X

LUT 3

LUT 4

LUT 5

X

X

LUT 4

LUT 5

OffSpring
LUT 3

X

Figure 4.9: Crossover operation in NDER with parents having dissimilar Markings.

72

The recovery threshold is a user-defined value which indicates the required fitness level, e.g., a
system meeting mission requirements at 95% of maximum fitness will have recovery threshold =
0.95. GAs usually show a significant gain in fitness in a few generations, thus this parameter can
also be set by the user according to the desired recovery rate. If an individual meeting the above
requirement is found, then a successful recovery is achieved and the refurbishment algorithm terminates. At the same time, it is checked whether maximum allocated time tmax in terms of number
of generations has elapsed or not. The determination of maximum allocated time is dependent on
the use of the application circuit and recovery time constraints that would violate mission requirements. The exit criterion can be utilized in case the evolutionary repair is not making progress in
terms of improving the fitness value. This may then be resolved by adjusting the GA parameter
settings such as raising the mutation rate [156]. If the time constraint elapses, the refurbishment
algorithm returns the individual with the maximum fitness achieved thus far. Then it can subsequently be restarted utilizing this individual as a seed.
Algorithm 9 The exit criteria
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:

The exit criterion is used to terminate the GA-based refurbishment
tmax is the maximum number of generations
T hreshold can be set according to throughput requirements
Exit Criteria{tmax } : {Φ(I1 ), ..., Φ(I|P | )} → {true, f alse}
if t >= tmax then
return true;
end if
for i = 1 to |P | do
if Φ(Ii ) ≥ T hreshold ∗ m ∗ 2n then
return true;
end if
end for

return f alse;

4.5

Experiments and Results

Experiments were devised to gauge the efficacy of heuristic-based fault diagnosis in reducing the
search space for evolutionary refurbishment of FPGA-based EHW systems. Further, scalability
73

in terms of the size of circuits refurbished by NDER is assessed. Experiments are also conducted
with Conventional GA refurbishment technique for a comparison in terms of the fitness and recovery times achieved. In Conventional Evolutionary Refurbishment (CER), the GA implementation
involves no marking of LUTs to limit the search space and standard GA operators. The goal of
conducted experiments was to attain 100% fault recovery as it represents the most demanding case.

Figure 4.10: Distribution of fan-out for LUTs in included benchmark circuits.

2-input LUT

3-input LUT

A

a
b
a
b

A
A

Re-Mapped Inputs
for Utilization by
Genetic Operators

A

a
b
c
a

5
D
5
D

Figure 4.11: Design-time allocation of redundancy.

Benchmark circuits with various number of output lines and sizes in terms of the number of LUTs
are selected from the MCNC benchmark suite [182]. The largest sized circuit is apex4, which
has 1252 LUTs. Whereas, ex5 has the highest number of output lines, i.e., 63. The I/O characteristics of the circuits are listed within the parenthesis having inputs first as follows: alu2(10, 6),
5xp1(7, 10), apex4(9, 19), bw(5, 28) and ex5(8, 63). The circuits are synthesized and
mapped using the open source ABC tool [162]. Then, Xilinx ISE 9.2i is utilized for technology mapping, placement and routing on a Xilinx Virtex-4 FPGA device. The placement process
74

is constrained through the User Constraints File with the post-place and route model generated to
randomly allocate spare LUTs (having a fan-out of zero) which are approximately 10% of the total
LUTs as shown in Figure 4.10. In addition, redundancy is also exploited at unutilized inputs of
LUTs by allocating valid nets as illustrated in Figure 4.11. This redundancy can also be exploited
by the GA to recycle a faulty LUT by appropriately altering its logical content.
In this work, single Stuck-At fault model is adopted to model Local Permanent Damage [65]. It
covers most hard faults occurring in SRAM-based FPGAs due to phenomenon as described earlier.
The fault is randomly injected by asserting a 1 or 0 at one of randomly chosen inputs of a LUT
in the post-place and route model of the circuit. Afterwards, the GA individuals are encoded for
both NDER and CER techniques. The GA refurbishment experiments are conducted using these
designs on a simulator able to evaluate the functionality of LUT-based FPGA circuits. Further,
both the conventional and proposed GAs are implemented in software.
A finite population standard GA is utilized for both NDER and CER. A population size of 10 is
maintained for all the experiments. The crossover rate and tournament size are fixed at 0.6 and
4 respectively. The NDER parameter level threshold varies from 0.96 to 0.99. Mutation rates
vary from 0.07 to 0.09 for the experiments with no mutation being performed on the selection
~ is not altered in the
of LUTs or PIs which are responsible for the generation of POs, i.e., out
NDER setup. These GA settings are based on similar previous works [7],[123] and validated
through experiment. This parameter set produced the best tradeoff in GA performance and memory
requirements to store the individuals of the population. Additionally, to achieve 100% recovery,
the recovery threshold is set to 1.0.

75

4.5.1

NDER Recovery Performance

The results obtained for refurbishment of faulty FPGA configurations describing the selected
MCNC benchmark circuits via NDER as well as CER are presented in Table 4.4. The “Post-Fault
Fitness” column indicates the fitness of the FPGA configuration after the assertion of a random
fault at time 0. Similarly, “Gen. for 100% recovery” or “Gen. for partial recovery” columns
indicates the number of generations required to attain an FPGA configuration with fitness value
reported in “Max. Fitness in Last Gen.” column. The “Avg. Fitness in Last Gen.” column indicates the average fitness of individuals in last generation over the population size |P |. All results
are reported with the corresponding average, minimum, maximum, and standard deviation over the
indicated number of experimental runs.
The results in Table 4.4 indicate that NDER is able to evolve configurations with 100% correctness
for all circuits. In particular, the average value of maximum fitness in the last generation is equal
to the maximum possible fitness value for respective circuits. Conventional evolution was unable
to realize useful recovery even when up to 100-fold more generations are allowed. In fact, the
performance of the population as a whole as indicated by the last generation’s average fitness
degrades to as low as 32.5% when compared to average post-fault fitness value for experimental
runs of the apex4 circuit. Typical performance plots averaged over 15 runs showing fitness with
respect to time for both NDER and CER are illustrated in Figure 4.12. This overall behavior of
conventional GA may be attributed to the I/O characteristics and the sizes of the chosen circuits,
as compared to the ones used in previous works [125, 171, 141, 13, 123, 136]. For instance, it is
shown in [13] that the recovery time increases with increasing number of output lines.

76

Table 4.4: Fitness and generations for refurbishment via Conventional and Netlist-Driven approaches.
Conventional GA Refurbishment
Netlist Driven GA Refurbishment
Gen. for Partial Avg. Fitness Max. Fitness
Post-Fault Gen. for 100% Avg. Fitness Max. Fitness
Recovery
in Last Gen.
in Last Gen.
Fitness
Recovery
in Last Gen.
in Last Gen.
7
21
6084
25000
4902
6084
6041
1426
6073
6144
[5896,6131]
[25000,25000]
[4677,5019]
[5896,6131]
[5632,6136]
[6,11778]
[5901,6134]
[6144,6144]
84
0
109
84
143
2829
84
0
20
52
1272
250000
1101
1273
1265
1784
1269
1280
[1249,1279] [250000,250000] [997,1179]
[1249,1279]
[1216,1279]
[3,4939]
[1228,1279]
[1280,1280]
8
0
55
7
19
1457
9
0
5
20
9620
25000
6493
9620
9722
633
9724
9728
[9271,9727]
[25000,25000]
[6327,6709]
[9271,9727]
[9712,9727]
[2,8250]
[9718,9727]
[9728,9728]
197
0
169
197
4
1822
3
0
45
172
891
250000
784
891
891
619
894
896
[883,895]
[250000,250000]
[707,841]
[883,895]
[882,895]
[4,2498]
[887,895]
[896,896]
4
0
34
4
3
669
1
0
29
38
16091
20000
13714
16091
16114
784
16120
16128
[15966,16127] [20000,20000] [12743,14802] [15966,16127] [16015,16127]
[10,3752]
[16085,16127] [16128,16128]
53
0
468
53
19
997
8
0
Post-Fault
Fitness

alu2

# of runs
Average
[Min.,Max.]
Std. Dev.
5xp1 # of runs
Average
[Min.,Max.]
Std. Dev.
apex4 # of runs
Average
[Min.,Max.]
Std. Dev.
bw
# of runs
Average
[Min.,Max.]
Std. Dev.
ex5
# of runs
Average
[Min.,Max.]
Std. Dev.

Figure 4.12: Fitness versus generations for 5xp1 benchmark circuit.

In addition, another set of CER experiments is conducted to find the application of evolutionary
operators on components which would not have been marked by NDER. It is revealed that 84.7%
of mutations for 5xp1 to 98.2% of mutations for ex5 are undesirable. Thus, NDER achieves
Research Objective RO-II by design since only marked LUTs are selected for application of evolutionary operators and the working components are preserved resulting in a much more tractable
technique for online evolutionary recovery.
The recovery of ex5 benchmark having a total of 352 LUTs and 63 output lines is achieved in
as few as 784 generations on average over 38 experimental runs. Further, the recovery of largest
sized benchmark circuit, i.e., apex4 with 19 POs is achieved in as few as 633 generations on
average over 20 experimental runs. Thus, the scalability of the proposed approach is established

78

as successful refurbishment is achieved for circuits with various sizes and with varying number of
output lines.
Table 4.5: Speedup in recovery time achieved via NDER.

Benchmark
alu2
5xp1
apex4
bw
ex5

Speedup
tCER /tN DER
≥ 17.5
≥ 140.1
≥ 39.5
≥ 404.1
≥ 25.5

The speedups achieved in refurbishment times with NDER as compared to CER for the selected
benchmarks are reported in Table 4.5. It is obtained by the ratio of time taken by CER (tCER ) and
time taken by NDER (tN DER ). The results are significant as the recovery times for CER did not
achieve useful recovery and indicate it likely requires significantly more generations corresponding
to increased speedups.

4.5.2

Discussion of NDER Results

Utilization of Spare LUTs and Power Overheads: An observation of the circuits recovered with
the NDER technique indicates that refurbishment is possible either with or without spare LUTs.
In the majority of cases, no spare LUTs were required. In such cases, refurbishment is achieved
by configuring the functionality of existing LUTs. This may also be possible by exploiting designtime redundancy at the LUT level as discussed earlier. In other cases, the utilization of spare LUTs
is dependent on the benchmark circuit. For experimental runs of the alu2 benchmark, at most a
single spare LUT was utilized. For experimental runs of the 5xp1 benchmark, a maximum of 3
spare LUTs were utilized. For this benchmark, some instances of a single spare LUT were utilized.
For apex4, a maximum of 6 spare LUTs and a minimum of single spare LUTs were utilized. For
79

the bw benchmark, a maximum of 2 spare LUTs and a minimum of single spare LUT were utilized.
Lastly, for ex5 benchmark, a maximum of 4 spare LUTs and a minimum of single spare LUTs
were utilized. Table 4.6 summarizes the maximum and minimum utilization of spare LUTs for all
the benchmark circuits.
Use of spare LUTs, however, does not increase the critical path delay by an amount significant
enough to impact the circuit’s operation. This can be demonstrated via contradiction. Namely, it
could not have violated timing constraints by virtue of the fact that the configuration was evaluated
as discrepancy-free during fitness evaluation. Meanwhile, the energy consumption of the repaired
circuit in such cases can also be increased slightly due to spare LUTs included in the datapath.
However, such an increase is negligible owing to the small percentage of spare LUTs as compared
to the overall original active circuit area as noted in Table 4.6. Further, this increase in component
count can be considered to be small when compared to full-time TMR approaches.
Another component of power consumption in NDER occurs during reconfiguration during the
refurbishment phase. However, this component is small as compared to the mission lifetime power
consumption of the application itself. Furthermore, as described in [100], partial reconfiguration
as opposed to full reconfiguration reduces this component further through utilization of an internal
reconfiguration port such as ICAP.
Table 4.6: Utilization of spare LUTs for 100% recovery via NDER.

Benchmark
alu2
5xp1
apex4
bw
ex5

Spare LUTs required (% utilization)
Minimum
Maximum
0(0%)
1(4.3%)
0(0%)
3(30%)
0(0%)
6(4.8%)
0(0%)
2(20%)
0(0%)
4(10%)

80

% Increase
Active Logic
0.6%
7.1%
0.5%
3.4%
1.3%

Impact of Multiple Corrupt POs: The NDER results from Table 4.4 are analyzed to specific
cases of single corrupt PO and multiple corrupt POs in Table 4.7. The recovery times reported
in Table 4.8 are calculated, where the fitness evaluation is the main time consuming part of the
evolutionary refurbishment process as pointed out in [66]. If the time to download a configuration
onto the FPGA for evaluating the fitness of an individual is denoted by td and the time to do the
fitness evaluation itself is denoted by tf . Further, assume g generations are required to find a
refurbished FPGA configuration according to the user criterion. Then, the total reconfiguration
time Tr consumed by the evolutionary algorithm with a population size of |P | is given by Tr =
g|P |(td + tf ), which can be approximately considered equal to the recovery time [66].
The population size |P | is usually fixed at design-time. The time to download a partial configuration onto the reconfigurable logic depends upon the size of the benchmark circuit (FE) and the
reconfiguration interface utilized. An evolvable hardware system developed in [136] utilizes an
on-board processor to implement the evolutionary algorithm and reconfigures the FPGA using the
Internal Configuration Access Port (ICAP) interface. The ICAP interface in Xilinx Virtex FPGA
devices can provide configuration bandwidths up to 3.2Gbps [3]. Thus, the times to download the
partial configurations for the selected circuits onto a Xilinx Virtex-4 FPGA device are utilized in
the calculation of reported recovery times and range from 5 µseconds to 130 µseconds. Additionally, the fitness evaluation time tf depends on a test pattern generator operating at 100 MHz and it
ranges from 0.5 µseconds to 10 µseconds for these experiments assuming the worst case.
In most cases, the recovery times are greater for the cases when the fault affects a single corrupt
PO, which is due to the increased number of suspect resources. Whereas, the heuristic may limit
the number of marked LUTs to as low as one in some cases due to the unique signature of failure.
Thus, the proposed scheme can be effectively utilized to reduce the search space of the genetic
algorithm to a tractable size when a fault results in multiple corrupt output lines.

81

Table 4.7: Performance for 100% recovery via NDER with single vs. multiple corrupt output lines.

alu2

# of runs
Average
[Min.,Max.]
Std. Dev.
apex4 # of runs
Average
[Min.,Max.]
Std. Dev.
bw
# of runs
Average
[Min.,Max.]
Std. Dev.
ex5
# of runs
Average
[Min.,Max.]
Std. Dev.

Single Corrupt Primary Output
Multiple Corrupt Primary Outputs
Post-Fault
Gen. for 100% Avg. Fitness
Post-Fault
Gen. for 100% Avg. Fitness
Fitness
Recovery
in Last Gen.
Fitness
Recovery
in Last Gen.
9
12
6074
3147
6099
6015
134
6054
[5632,6136]
[73,11778]
[6027,6134]
[5874,6128]
[6,373]
[5901,6131]
166
3762
35
125
131
104
9
11
9725
1294
9726
9720
92
9723
[9721,9727]
[4,8250]
[9721,9727]
[9712,9725]
[2,306]
[9718,9727]
2
2642
2
4
104
4
158
14
891
668
894
891
65
893
[882,895]
[4,2498]
[887,895]
[886,894]
[10,152]
[889,895]
3
677
1
2
48
2
23
15
16114
1008
16119
16114
441
16121
[16015,16127]
[10,3752]
[16085,16127] [16082,16124]
[43,2379]
[16111,16126]
24
1143
10
10
604
5

Table 4.8: Multiple output discrepancies reduces recovery time.

Benchmark
alu2
apex4
bw
ex5

Single Corrupt
PO (msecs)
983
1765
46
396

Multiple Corrupt
POs (msecs)
42
126
5
173

Figure 4.13: Distributions of the size of subset of PI(s) required for fitness evaluation.

4.5.3

Scalability of Evaluation

Recovery time has a direct correspondence with the number of test patterns utilized during the
fitness evaluation of a candidate solution. NDER attempts to reduce the required number of test
patterns by pruning circuit elements to form its pool of marked LUTs. In particular, NDER assessment would at its worst perform identical to a conventional GA whereby all PIs have been marked
for evaluation. The conventional GA approach requires repair evaluation having the maximal cardinality of 2n tests. On the other hand, Figure 4.13 shows the expected value of this same metric
for the NDER technique over 500 random faults. Here the benchmark-specific probability distribution of the number of PIs expected for fitness evaluation is based on a combinatorial analysis of
83

each circuit output line. For example, the probability of NDER incurring exhaustive evaluation for
the alu2 benchmark is only 0.55, even if fault location is equiprobable among the resource pool.
Furthermore, the outcome that the cardinality of test pattern set is 27 has probability 0.27. In other
words, Figure 4.13 quantifies the expectation of reduced evaluation time for NDER based on the
structure of the circuit at hand and its fanout. In particular, it shows the trend that the probability of
exhaustive evaluation decreases as the fanout of the circuit under test is increased, thus improving
refurbishment scalability for circuits with larger fanout.

4.6

Summary

Significant speedup in recovery times exceeding 400-fold as compared to conventional evolutionary refurbishment are readily achievable via NDER. This estimate is conservative considering that
conventional EHW was unable to achieve recovery in the benchmarks under test. Finally, high
fan-out of suspect resources has been shown to be useful to reduce recovery times achieved via
NDER. This effect can be observed due to the intersection of the failed component’s effects on
the overall circuit correctness. In particular, failures resulting from multiple disagreements were
shown to decrease recovery times up to 23.4-fold.

84

CHAPTER 5: SELF-RECOVERY ENABLED LOGIC FOR ANTI-AGING
IN ASICs

The Design Space for circuit synthesis is explored and power-gating is utilized to realize a postfabrication self-adapting circuit-level approach to mitigate timing degradations caused by transistor
aging effects in nanometer-scale CMOS logic. Power-gating is exploited to enable stress reduction,
which is known to mitigate transistor aging effects such as BTI and HCI. However, power-gating
may typically incur complexities such as throughput degradation and aging of the sleep transistor
itself. Thus, techniques are sought to maintain circuit throughput while providing rejuvenation
effects to just the aging-critical portions of the logic. In order to reduce area and leakage overheads,
only a selected set of paths identified using conventional EDA tools are re-instantiated into the
netlist as independent power-gated voltage islands. Proactive as well as reactive control policies
are devised to enable architectural-level management of the voltage islands. The need for predictive
design-time modeling of aging behavior is eliminated through use of online timing measurement
sensors which facilitate autonomous adaptation.

5.1

Autonomous Circuit-level Adaptation for Resilience and Lifetime Energy Reduction of
Logic Paths

A continual push to decrease feature sizes in order to enable higher integration and performance
levels has given rise to new challenges such as reliable operation and the heat dissipation power
wall which restrains the active portion of the chip [144]. This has elevated the need to develop
techniques at various abstraction layers that consider the temporal performance degradation of
logic circuits while ensuring maximum performance within a chip’s Thermal Design Power (TDP)
constraint. While power-gating has been effectively utilized to address TDP constraints, herein,
85

a circuit-level technique Self-Recovery Enabled Logic (SREL) is developed which leverage these
benefits to further provide reliable operation in the presence of transistor aging effects for agingsensitive logic domains. Considerable research efforts have been made to promote wear-leveling
for aging mitigation in memory circuits as discussed next, however, this work focuses on aging
mitigation in logic datapaths.

5.1.1

Proactive Aging Recovery Techniques for SRAM Arrays

Large on-chip SRAM capacities prevalent in contemporary designs facilitate exploration of proactive recovery techniques which have been previously demonstrated for memory circuits in [147,
118]. Portions of the SRAM array which remain under-utilized provide a resource pool for such
wearout management schemes while incurring area and performance overheads. For instance, the
authors in [147] studied the effectiveness of multiple schemes for cache reliability such as proactive/reactive use of redundancy, Error Correcting Codes (ECC) and Graceful Performance Degradation (GPD). Among these ECC tends to have very high area overheads and on the other hand
GPD has no area overhead as certain portions of the existing cache are reserved to accommodate
failed arrays. However, the performance loss with GPD is high for memory intensive benchmarks.
Thus, spatial redundancy at various levels such as row, column and array is introduced to overcome
the performance loss. Redundancy is managed by proactively or reactively switching selected portions of the cache into recovery mode via cache-line migration mechanisms. The performance
impact of cache data migration is analyzed both with and without dedicated links for this purpose.
Depending on the cache design, the array selection logic which is utilized to invalidate or drain
cache data may or may not lie on the data access path of the cache and hence can effect overall
memory performance. In summary, the authors were able to show that proactive management of
resources using round-robin scheduling with one additional array for a set of cache arrays can
extend lifetime with improved performance and area tradeoffs than reactive redundancy manage86

ment, ECC and GPD schemes. Other aging-resilience schemes realizing proactive recovery which
have been proposed for SRAM-based structures include periodic bit flipping technique to balance
the duty cycle [90]. It is straightforward to implement such schemes for microarchitecture structures such as the physical Register File (RF) with minimal performance overheads, as it has high
idle times [5],[96]. On the other hand, for other structures such as the data cache, the idle times
are strongly dependent on workload and hence the performance overheads may be high due to
requirement of forcible bit flipping via invalidations [5]. Another specialized technique to manage
the stress of SRAM cells in a cache structure is through controlling the uneven access pattern [68].
For instance, a dynamic indexing scheme is proposed in [32] to distribute the stress uniformly
over all data lines. In this work, rather than utilizing the inherent redundancy and structural symmetry in SRAM arrays, the novelty is to explore proactive and reactive wearout management for
aging-critical logic portions of the circuit by introducing redundancy at the logic gate-level. The
proposed approach also removes dependence on forcible generation of idle time for recovery of
resources. Furthermore, the need for precise predictive aging models is reduced by making the
circuit adaptable at the logic level.

5.1.2

Research Objectives of the Proposed Technique

To extend the above techniques for aging-mitigation of logic datapaths, it can be intriguing to
consider the influence of near-critical paths on the device lifetime. In this chapter, the tradeoffs of
partitioning a circuit to identify aging-critical logic domains are considered. Within these domains,
proactive or reactive aging-recovery is pursued. While in practice, the desire for balancing timing
paths to increase performance leads to a skewed path delay distribution. Herein, the concentration
is on designs where proportion of near-critical paths is manageable. Rather than surmounting this
timing wall [18], characteristics are identified where judicious leveraging of path delays is feasible.
Thus, only aging-critical components are protected to explicitly limit the area overheads. In previ87

ous work, the enhancement of logic reliability via indiscriminate Structural Duplication (SD), as
demonstrated in [154] can result in prohibitive area overheads. In particular, the approach identified functional elements at a module-level granularity for instantiation as standby components to
facilitate an increase in device lifetime. In case of a failure, the standby element can be activated
resulting in increased lifetime. Structural duplication is considered feasible only if 100% area cost
is acceptable, yet herein it is demonstrated that it is also possible to reduce energy consumption
using significantly less area. Other techniques which exploit the redundancy and/or idle time in
microarchitecture structures like the existence of multiple ALUs in an execution stage [120] or
multiple cores in a multi-core chip [77],[8] to manage wearout of logic paths are discussed in
Section 2.2.4. In the proposed approach, fine-grain gate-level redundancy is shown to provide
seamless throughput as compared to schemes which exploit inherent redundancy in time or space.
The circuit design space is explored at design-time using the proposed synthesis techniques with
objectives and challenges as mentioned in Figure 5.1. The proposed SREL techniques of Alternating Critical Path (ACP) and Competing Critical Path (CCP) enable autonomous circuit-level
adaptation with benefits as discussed below:

• Aging Mitigation: power-gating helps to reduce the delay degradation through rejuvenation
of timing-critical portions of the circuit.
• Energy Reduction: the devised technique enables the circuit to operate with little guardbands
for longer time as rate of performance degradation is reduced while minimizing cooling
demands.
• Post-Fabrication Adaptability: the need for accurate aging prediction is reduced as the sleep
interval can be adjusted based on run-time conditions.

88

Autonomous Operation

Design Objectives
Reliability

Adaptability

Throughput

In-situ Sensing

Energy
Reduction

Challenges

Aging Degradations
Process Variations

Design-Space Exploration
of Lifetime Energy and
Area Tradeoffs

Timing Wall
Area Overhead

Figure 5.1: Design objectives and challenges of the devised SREL techniques.

• Autonomous Anti-Aging: provides circuit adaptation according to real-time operating conditions based on intrinsic measurement of actual circuit behavior without requiring accurate
aging models.
• Area Multiplexing: may help to ease the demands of dynamic voltage/freqeuncy management for some applications. It is favorable as the power efficiency of on-chip regulators

89

utilized for such Dynamic Voltage Scaling (DVS) schemes is often 85% or less, while occupying additional die area [38].
• Elimination of chip hot-spots: power-gating the standby critical paths helps to cool them off,
thus further reducing the aging affects. For instance, [120] reports that power-gated circuit
has both less stress and temperature.
• Focused Redundancy: circuit synthesis flows utilizing top-path extraction and remodeling
automate the provision of redundancy without 100% area increase as only aging-sensitive
portions are replicated.

5.1.3

Power-gating Approaches to Mitigate Aging

Power-gating has been demonstrated as a prominent technique to mitigate transistor aging, as it
lowers the amount of stress incurred since the interval for which the electric field is applied is
decreased. This is due to the favorable annealing capability demonstrated by interfacial traps at
the Si/SiO2 interface when stress is removed. Similar recovery effects are also shown to be present
in High-K devices [131]. Thus, a significant portion of Vth shift responsible for delay degradation
can be recovered [176]. Power-gating can also accelerate the recovery process as the potential for
temperature extremes may be reduced [177]. This chapter focuses primarily on leveraging BTI
recovery mechanisms using circuit-level adaptation.
Although the use of power-gating to realize periodic activity of the datapath can facilitate a reduction in the amount of stress, doing so normally acts to suspend throughput [34]. For example, [38]
reports that to reduce the delay degradation by 5% over an operational period of 10 years, the
circuit must remain power gated for over 60% of the device lifetime which represents an unacceptable reduction in operational capacity. Thus in the work herein, stress reduction is desired while

90

maintaining uninterrupted circuit operation, i.e., without throughput degradation. This is achieved
by identifying potential logic nodes for adaptive aging protection rather than the larger circuit as a
whole [154]. Next, the mechanics of implementing power-gating are discussed.
Sleep Transistor Insertion: Power gating creates a virtual power plane whereby a Sleep Transistor
(ST) is connected in series with VDD (header) or GN D (footer). This directly effects the delay of
the logic circuit as the effective VDD in Eq. 2.6 is reduced by a small amount determined by ST
width. However, the size of the sleep transistor can be selected such that the voltage drop across
it is around 1% of supply voltage [148], which has been applied here to make the delay overhead
negligible.
A valid concern with using pMOS-based (header) STs is that they are also susceptible to NBTI
effects, which may consequently effect the performance of the logic circuit as noted in [181]. For
that reason, periods of recovery can be beneficial to compensate for ST degradation. SREL is
compatible with the recommendation of providing redundant alternative STs. SREL utilizes them
on a rotating basis to mitigate the aging effects of the STs themselves. In the proposed work,
similar effects are also pursued for the logic circuit.
Considerations with ST placement in the header or footer, are discussed in the literature [33, 177,
83]. Furthermore, our experiments with multiple logic gates like INV, NAND2, NOR2 show
that use of header is feasible to establish stress reduction if the aging of sleep transistor itself
is mitigated. For example, Figure 5.2 shows that ∆Vth (t) under header is the lowest among all
configurations, when test vectors are applied which promote recovery under NBTI effects. For
power-gated circuits in this example, the ST is alternatingly activated every 3 months to promote
self-recovery.
Use of ST to promote recovery: In the standby mode with the use of sleep transistors, the virtual
power lines ultimately charge to either VDD or GN D depending upon the use of footer or header
91

respectively. The advantage with the use of header is that the Vgs for pMOS transistors in most logic
gates is greater than zero, which essentially drives the transistor into recovery mode irrespective of
the inputs applied. This is consistent with the findings in [170] which show that driving the VDD
to zero under variable voltage settings, can have huge gains in terms of recovery.
0.0016

Header ST
Footer ST
No ST

0.0014

0.0012

∆ Vth (volts)

0.001

0.0008

0.0006

0.0004

0.0002

0
0

1e+07

2e+07

3e+07

4e+07 5e+07 6e+07
time (seconds)

7e+07

8e+07

9e+07

1e+08

Figure 5.2: Effect of Vth degradation on logic gate with different ST configurations.

5.1.4

Organization of the Chapter

The remainder of the chapter is organized as follows. Section 5.2 presents the proactive and reactive schemes utilized to control the amount of stress incurred in the field. Section 5.3 presents trade-

92

off considerations with replication of timing critical portions of the circuit. Section 5.5 presents
experiments and results of energy reduction.

5.2

Self-Recovery Enabled Logic (SREL)

Two distinctive active anti-aging strategies are developed: Alternating Critical Paths (ACP) is
based on proactive switching between redundant aging-critical domains, whereas Competing Critical Paths (CCP) is based on reactive switching. The proposed selective redundancy enables control
over the stress profile of critical resources while ensuring uninterrupted operation. The working,
advantages and disadvantages of both schemes are discussed below.

5.2.1

Alternating Critical Paths (ACP)

Alternating Critical Paths is a form of proactive resource switching. Proactive switching has primarily been advocated for memory circuits as they provide intrinsic redundancy [147] with minimal re-design effort as discussed earlier. ACP proactively switches between the independently
power-gated aging-critical logic domains throughout the operational period of the circuit to keep
the overall delay degradation to a minimum and enhances the chances for BTI-recovery. Only one
instance is active at any given time, which can be ensured by generation of mutually-exclusive
control signals for the sleep transistors used for power-gating the domains.
Effect of Switching Interval: The switching interval determines the amount of degradation incurred by a logic domain. Figure 5.3 shows the effect of varying the switching interval for the
frg2 benchmark circuit. A delay reduction of 2.84-fold is achieved as compared to an uncompensated design by using a redundancy level of N = 2 and switching interval of 3 months. Furthermore, by reducing the switching interval from 3 months to 1.5 months, the degradation can be
93

further reduced by about 1.15-fold. Primarily, the degradation is highly dependent on the stress
incurred during a given interval of operation. Whereas, only a limited amount of time is required to
achieve the desired recovery effect, as noted through various aging models discussed earlier. Thus,
by controlling the active time of a logic domain, it is possible to limit the overall performance
degradation over time, within the constraints that the energy overhead associated with switching is
negligible as compared to the estimated lifetime energy consumption. The control signals for the
sleep transistors are generated by a low frequency counter or middleware-based software control
mechanism [89] tracking the cumulative utilization of individual logic domains.
865

First Instance Sleeps

860

Propogation Delay (psec)

855

850

845

840

835
Second Instance Activated
830
0

0.2

0.4

0.6

0.8
Age (Years)

Sleep Interval=3.0 Months
Sleep Interval=1.5 Months
1

1.2

1.4

1.6

Figure 5.3: Effect of BTI recovery by changing the sleep interval using N = 2 instances.

Along with these factors, the determination of the sleep interval at design-time can be difficult to
predict using aging models. However, this limitation is also applicable to other common aging94

mitigation schemes, unless some runtime feedback and compensation technique is available. Indeed, aging models are utilized herein to determine the sleep interval, but the circuit can be deemed
to be adaptable as there is provision of altering the sleep interval at run-time and seen to be beneficial as illustrated in Figure 5.3.
Need for circuit-level feedback: Proactive switching has the following disadvantages, which
establishes the need for CCPs as discussed afterwards:

1. Determination of sleep interval at design-time is based on assumptions and may not match
the actual circuit aging.
2. A timing violation may remain undetected using a proactive-only strategy without in-situ
timing assessment and correction.

5.2.2

Competing Critical Paths (CCP)

Fully autonomous reliable operation with optimal switching interval and minimal design effort
is realized with the self-adaptive Competing Critical Paths technique. Autonomous operation at
the circuit-level is developed using a feedback mechanism based on aging degradation during operation. This feedback can be in the form of timing errors incurred over time, which can help
distinguish whether the errors are from aging-induced failures, soft-errors, or other variation related failures. Thus, the sleep interval for CCPs is adapted autonomously based on aging-induced
failures.
CCPs are designed to autonomously control the aging of resources through the provision of redundant resources to enable recovery management. As opposed to adaptive techniques which
manage circuit operating conditions such as voltage and frequency, CCPs can offer reduced power
overheads. This is because changing the voltage for complete circuit just to manage the delay of
95

timing critical portions of the circuit, can result in energy overheads if the circuit is not partitioned
into aging-critical domains. In addition, the energy inefficiencies and area overheads of voltage
regulators required to achieve such management of operating conditions is generally ignored in
other research works. Herein, similar effects are sought by tuning the resource utilization, without
incurring the power overheads associated with schemes which manage the operating conditions.

VDD

PST Control

Q

D

RST Control
Switch Signal

Control is independent
of the Main Datapath
Aging‐Critical Logic Instance

1

Shadow latch operates on
ADS% dilated clock to detect
desired timing violation due
to aging‐effects
sensor

FF
sensor

FF
VDD

Independent ST allow for BTI‐
recovery for ST themselves

Critical PO1
Near-Critical
PO1
Switch Signal

sensor

FF

Aging‐Critical Logic Instance

Critical PO2

2
sensor

FF

FF

...

...

Non‐Critical Logic

Near-Critical
PO2

Shadow latches
utilized for only
portion of the design

FF

Figure 5.4: Use of Timing Sensors in CCPs.

Autonomous Aging-aware Resource Management: Shadow latches [59] can be utilized to detect
timing errors on aging-critical logic domains as shown in Figure 5.4. These operate by sampling
the output data at two different points in time. The earlier, speculative sample is stored in the main
96

latch, which is augmented by a shadow latch operating with a delayed clock signal. Consequently,
aging-induced timing violation can be detected by comparing the two values using a simple XOR
logic-gate. Other aging-detection timing sensors can also be utilized [50],[122],[92] whereby such
timing sensors provide various area/power/detection tradeoffs and granularity of coverage. Without loss of generality, a simple shadow latch based timing sensor is examined to demonstrate the
proposed SREL feedback mechanism. The use is motivated by Razor [59] which utilizes a feedback mechanism to scale voltage below the nominal operating voltage based on timing speculation.
The augmented error signal generated by an active logic domain is used to control the STs for all
logic domains. A round-robin activation pattern for the STs is adopted as shown in Figure 5.4 for
N = 2. A latch is connected with the ST to retain its control signal. The switch signal is used to
clock the latch. It is worth mentioning here that the proposed control logic is not in the critical path
of the main circuit. By design, only one domain is active at any given time and this simple control
can be extended to N logic domains by the use of log2 (N ) − to − N decoder circuit. At runtime,
the logic domains compete for activation based on their timing behavior, which maybe variable due
to input-dependent signal activity. Initially, a single default instance from the pool is active while
other instances are power-gated to be cold-standby alternatives. When a timing violation occurs,
the straightforward control logic is used to switch to an alternative logic domain by removing the
current domain from operation. Hence in summary, the reactive CCP technique utilizes intrinsic
aging behavior to manage resources. On the other hand, the proactive ACP technique uses counterbased control logic and thus does not required shadow latches.
Aging-Degradation Slack: At design-time, straightforward voltage assignment can be done such
that the delay of the circuit is ADS% below the timing specification Dspec , where ADS% is defined
to be the Aging Degradation Slack, normally one or two percent. Then, CCP autonomously adapts
the activation of redundant logic domains such that rate of degradation incurred in the field is
always below ADS%. Thus, CCP eliminates the need to determine the sleep interval at design97

time as required by ACP and accurate aging estimation via predictive aging models. Results for
various values of ADS in comparison to proactive recovery will be provided in Section 5.5.

5.3

5.3.1

Aging-Sensitive Logic Domain

Identifying Paths for Aging Mitigation

SREL aims to mitigate delay degradation of aging-critical parts of a logic circuit. Within these
aging-sensitive domains, logical paths which have their delays close to the target delay specification Dspec can be considered to be susceptible to timing violation due to aging effects. In practice,
for a circuit to perform correctly, all paths should have their delays less than the specification
throughout the operational lifetime of the circuit. Generally, assuming the delay of the slowest
logical path is given by Dcritical (0) at t0 , then it is implied that Dcritical (t) ≤ Dspec at all times
t ≤ tlif etime . Theoretically, replicating only this single longest delay path Dcritical (t) may be sufficient to protect a circuit against aging-induced timing degradation. However, in practice, there
may exist some other near-critical path with delay Dnear−critical (t) which becomes critical after
some arbitrary time t0 of operation due to varying level of stress incurred, i.e., Dnear−critical (t0 ) >
Dcritical (t0 ). For example, this phenomenon is observed for seq benchmark circuit in our experiments. Thus, due to existence of such potential critical paths, it may not be sufficient to replicate
only a single critical path. Therefore, SREL ensures coverage of paths whose delays are in the
closed interval [Dcritical (t) ∗ (100% − P ), Dcritical (t)], where P is the top-path parameter selected
at design-time to specify the subset of paths for aging mitigation.
Often, paths in the top 10% delay bracket have been considered to be adequate for protection
against cumulative delay variations due to aging effects [120][183]. Other paths which are not in
this bracket are less likely to violate the timing constraints and not considered for replication. This

98

bounds the energy overheads as compared to a scheme which simply replicates the entire module
to achieve the desired recovery effects. Here, it is worth-mentioning that PV is not compensated
in the scope of this work, which may affect the selection of critical paths. Hence, reduction of
guardbands due to aging effects alone are considered.

0.12

Normalized Path Counts

0.1

0.08

0.06

0.04

0.02

0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

Normalized Path Lengths

Figure 5.5: Path distribution for Arithmetic Logic Unit of an OpenSPARC core.

5.3.2

Design-time Tradeoffs

The selection of P can significantly constrain the area overhead of the SREL scheme which is
also dependent on the path delay distribution of the circuit. For instance, the path distribution
of the OpenSPARC T1 as shown in Figure 5.5 [72] indicates a significant spread between the
99

length of the critical path and the majority of paths in this actual processor core. In particular,
roughly 90% of the logic paths are seen to be less than 75% of the length of the longest path. Such
unbalanced distributions of path length can facilitate a SREL approach of identifying the top P
paths to undergo replication, for small P .
On the other hand, designs which are balanced to achieve performance efficiency during the synthesis phase [18] may require a excessive number of paths to be replicated. However, this deterministic optimization may not be completely effective for timing yield in the presence of increased
PV, which can increase the likelihood that a path does not meet specification. Hence, complex
CAD-based simultaneous yield and power optimization techniques based on statistical methods
need to be devised [161]. The yield-enhancement techniques which focus on lowering the critical path density can be beneficial to reduce the area overhead of proposed scheme herein. Thus,
designs with such characteristics appear to be most suitable for SREL or gate-sizing approaches.
Furthermore, in practice, P could be selected based on the amount of delay degradation expected
during device operation which is determined by factors such as temperature, supply voltage, input
characteristics, and expected lifetime. Moreover, SREL need only be applied to aging-sensitive
paths which are shown to comprise a small subset of a practical die. For instance, the ALU inside
of an execution stage in a microprocessor can be considered to be aging-sensitive in terms of signal
activity [120].

5.3.3

Gate-level Redundancy for Aging-Mitigation

The replication of logic paths is illustrated through an example circuit shown in Figure 5.6, which
only depicts the top three aging-critical paths while other connections to non-critical portions of
the circuit are not included to illustrate the SREL scheme more clearly. The following notation is
utilized in the Figure 5.6: CPji denotes ith instance of the j th critical path. Thus, the logic gates
100

marked with CP1i consist of the most critical path in the circuit, i.e, it has a delay of Dcritical (0) at
t0 . However, another path shown by gates marked by CP2i which bifurcates from this path and reconverges afterwards may become critical in terms of cumulative delay degradation as described
earlier. Similarly, there is another path CP3i which leads to some other Primary Output (PO), apart
from the critical PO. As shown, this path has a difference of only one gate from the main critical
path CP1i and thus shares majority of the stress profile with CP1i gates. It turns out that all such
paths will have delays which lie in the top delay bracket of the circuit. Note that it is possible for
a circuit to have multiple such paths which are completely un-correlated topologically. Thus, all
such paths are covered by targeting this delay bracket.

1
1

...

1
1
1
1
1
1

2
1

...

Critical PO1

2
1
2
1
2
1

Critical PO2

...

...

1
2

2
2

...

...

1
2

...
2
2

...

1
3

Near Critical PO1

2
3

Near Critical PO2

Figure 5.6: Influence of critical path replication on critical and near-critical POs.

As a consequence of this replication, paths which feed into the logic gates in the critical path(s)
induce a slight increase in their delay due to increased fan-outs. For example, this is shown by
observing the INV gate in Figure 5.6. The increase in fan-out is directly proportional to redundancy
level N . However, this increased delay should not effect the critical delay(s), because if it were to
make it critical, then it should have been included in the top delay bracket and covered by SREL. In

101

practice, top delay bracket identification would include an iterative procedure considering physical
layout within the broader context of the timing closure process.

1
1

...

1
1

...
2
1

...

1
1

...

2
1

ENB

SEL1

ENB

SEL2

...

1
1

Critical PO1

2
1

...

2
1

Critical PO2

To other
non-critical
paths

Figure 5.7: Insertion of merging point to accommodate replication of critical paths.

5.3.4

Controlled Inclusion of Merging Points

The non-critical portions of the circuit having fan-in from replicated critical paths as shown in
Figure 5.7 need only one signal line instead of N signals. Thus, some Merging Points (MPs) need
to be included in the replicated circuit as shown. The best signal-merging option in terms of energy
and delay overheads was found to be tri-state buffers (TBUFs), after evaluating different types of
gates for the MPs such as Transmission gates, N − to − 1 Multiplexers, etc., via simulation.
Thus, TBUF elements are instantiated throughout this work as illustrated in Figure 5.7, where
SELi denotes the sleep signal for ith instance. It is worthwhile to note that MPs may directly
add to the delay of non-critical paths, but in most cases, such delay increases do not violate the
top delay bracket as they are outside this bracket. On the contrary, in some cases post-replication

102

Static Timing Analysis (STA) on selected benchmark circuits can even realize a favorable delay
perturbation over the nominally synthesized circuit.
Depending on the synthesis parameters of the unprotected circuit, reduced delay may be noted due
to the improved drive capability of selected critical gates as a result of including merging buffers.
However, this is highly dependent on the topology of the circuit. For example, not all benchmarks
benefit as noted in Table 5.2. Additionally, some topologies, such as i5, require no merging
gates. In some cases, the delay perturbation achieved by merging can be employed to hide energy
overheads otherwise associated with inclusion of TBUFs as MPs.
Further, merging points may also be required at the terminating PO which is fed by replicated paths.
This will directly effect the delay of all critical paths terminating at a specific PO. An option to
avoid this delay overhead in the critical combinational stage is to hide it in the delay of subsequent
non-critical combinational stage [47] in a typical pipeline circuit by placing the merging point after
the replicated registers.

5.3.5

Resource-Constrained Anti-Aging

The example in Figure 5.6 shows that there is a topological correlation among several paths, which
is common in typical circuit designs [20], i.e., the logic-gate connected to the input is common
to all three paths. It may be possible to reduce the area overheads of replication by targeting
such design segments which have high correlation in top delay bracket generated by STA tools.
However, the implementation complexity is high and thus straightforward selection of paths in the
bracket is proposed herein.
The path delay distribution of a circuit which is dependent on its topology determines the number
of paths selected for replication using SREL. For instance, as shown in Table 5.1 only 10 paths

103

(No. of Critical Paths) in alu4 circuit need to be replicated when P = 10%. Here, CPs stands
for Critical Paths, CGs stands for Critical Gates and MPs stands for Merging Points. Note that P
represents the proportion of path lengths protected, expressed as a percentage of the path length
distribution. In this circuit, P = 10% of the longest paths coincidentally results in a quantity of
ten paths being protected. This corresponds to only 8.9% of the total gates being replicated. The
results also show the overall area overhead with varying the parameter P .
Table 5.1: Design-time Area/Energy/Delay overheads with increasing P (N = 2) for alu4 benchmark at nominal voltage.
P
(%)
0,1,2,3
4
5,6
7
8,9
10

No. of
CPs
1
2
3
4
7
10

No. of
CGs
19
32
37
46
71
79

No. of
MPs
10
17
21
29
50
54

Area
Overhead
6.85%
11.64%
13.82%
18.19%
30.12%
32.96%

Energy
Overhead
4.97%
5.59%
6.91%
8.54%
11.45%
13.09%

Path Delay
Reduction
-1.73%
0.86%
0.88%
0.88%
1.28%
5.95%

The area cost incurred for various small values of the parameter P are evaluated for remaining
benchmark circuits and the results are shown in Figure 5.8. The area overheads, which are indicated with separate curves, depend on the topology of the circuits and are seen to be generally
increasing with P . For example, interesting tradeoffs can be investigated for the seq circuit. In the
case of seq, a range of 6% ≤ P ≤ 8% incurs 17.06% area overhead. Thus, based on the results of
simulation indicating 8% delay degradation in 3 years, the cost of achieving this amount of aging
protection is known. Once P is selected, further reduction in area overhead can be obtained by
using methods as proposed in [64] where a subset of critical paths is selected considering factors
such as process variation, workload induced temperature variations and voltage droops, in addition
to aging effects.

104

35

30

% Area Overhead

25

20
varying quantity of
near-critical paths protected
15

C880
i5
frg2
alu4
seq

10

5
0

2

4
6
Top-Path Parameter P (%)

8

10

Figure 5.8: Parameter P can be traded-off against area overhead incurred.

5.4

Scope and Applicability of Proposed SREL Approaches

At this point, it is worth emphasizing that most of the existing techniques can be complementary
to the SREL approaches proposed herein. Their operation can be viewed as orthogonal to the
spatial multiplexing of aging-critical logic domains. For example, gate-resizing and/or agingaware logic synthesis can be provided as a core technique, and then the proposed aging-critical path
replication to facilitate SREL can further improve their operation. In this case, these two schemes
can complement each other by reducing or eliminating their associated overheads. Indeed, in this
work, voltage guardbanding is used to compensate for the affect of aging, yet significantly reduced

105

guardbands are demonstrated as compared to a uncompensated baseline circuit using only existing
techniques.
Furthermore, several better-than-worst-case design approaches provide power/performance benefits by targeting average-case conditions, while an error detection/correction mechanism deals with
errors in the worst-case [14]. Here aging is mitigated autonomously by increasing voltage when
timing violations occur. If such mechanisms are already in place, then the proposed schemes can
be easily leveraged to further increase the power savings.

5.5

Experiments and Results

The SREL methodology is evaluated by simulation of various circuits from the MCNC and ISCAS
benchmark suites as shown in Table 5.2. The open source NanGate Library based on the 45nm
Predictive Technology Model is utilized [187] in the circuit synthesis stage. The built-in aging
models provided by MOSRA are used which realize BTI effects in Eqs. 2.3-2.5 and HCI effects.
As the interconnect delay is not impacted by BTI and HCI, its degradation is not considered in the
HSPICE simulations. For the technology node utilized herein, NBTI is seen to be the dominant
aging-degradation mechanism.
Table 5.2: Initial time Area/Energy/Delay overheads with P = 10% (N = 2) at nominal voltage.
Benchmark
C880
i5
frg2
alu4
seq

No. of
CGs
14
24
42
79
110

No. of
MPs
3
0
11
54
63

Area
Overhead
34.18%
18.18%
25.53%
32.96%
29.87%

106

Energy
Overhead
8.90%
0.59%
7.15%
13.09%
13.89%

Path Delay
Reduction
0.73%
-0.27%
13.07%
5.95%
9.17%

5.5.1

ACP Reduction of Delay Degradation

Firstly, the benefits of Self-Recovery Enabled Logic using the ACP scheme are quantified. Specifically, the delay reduction factor over the circuit’s lifetime is measured. The amount of degradation
compensated by recovery management can lower the guard-bands required for circuit operation.
Thus, a set of experiments is devised such that the delay of all the circuits is constrained to a timing
specification of 5% over the t0 delay of a nominally-designed baseline circuit, i.e., Dspec . The goal
of this set of experiments is to determine a switching interval for ACP such that the delay degradation across all benchmarks is within this timing specification throughout a lifetime of 10 years. A
value for the switching interval of 3 months was found to satisfy this constraint for all benchmarks.
The supply voltages for ACP designs are selected such that their delays are near the baseline designs’ delays at t0 with the exception of i5 circuit as explained later. Delay normalization also
permits evaluation of the benefit of ACP in terms of applying this technique to nominally-designed
circuits with no or minimal observed delay perturbation from buffer insertion as described earlier.
The potential benefit of ACP is quantified in terms of total lifetime energy reduction as compared
to a baseline circuit which compensates the aging effects by exclusive use of voltage guardbands to
meet the desired specification over the lifetime. This can demonstrate ACP’s capability to mitigate
aging and thus quantify the reduced guardbands required for desired operation. This guardband
reduction achieved by ACP translates into energy savings as compared to VM.
The percentage delay degradations for selected benchmark circuits are shown in Figure 5.9, where
ACP is seen to mitigate aging with a switching interval of 3 months. Furthermore, the results
indicate that the baseline circuits operating at nominal voltage without aging-compensation do not
meet the timing specification of 5%, e.g., % delay degradation for C880 is 8.36% which is greater
then 5%. On the other hand, to achieve compensation using VM, guardbands are added such that
the circuits meet the timing specification of 5% at end of lifetime. As a consequence, the delay for
107

VM at design-time (t0 ) needs to be less than that of the nominal design to compensate for increased
degradation.

20%
18%

17.34%

VM has more degradation as
compared to nominal, due to
higher voltage

16%

12%
10%

14.87%

Nominal does not
meet timing
specification of 5%

14%

8.36%

14.26%

13.92%

13.18%
12.13%

11.69%

11.37%

8.92%

8%

6.87%

6%
4%

ACP Sleep interval
is set to 3 months
for all benchmarks

5.04%

4.65%

4.11%

3.88%

2%
0%
C880

i5

frg2

Nominal

VM

alu4

seq

ACP

Figure 5.9: Delay degradation for VM and ACP designs over a lifetime of 10 years.

Also, it is a non-trivial and iterative process to select the amount of compensation required for VM
as more guardband implies higher rate of degradation. Meanwhile, this process is strongly dependent on the aging model utilized, which can be avoided by enabling the circuit-level adaptation
proposed herein. A comparison of ACP to VM can be made as both meet the desired specification
throughout the lifetime. All ACP designs with the exception of i5 start-off from the same delays
as the baseline nominal circuit delays at t0 and remain near specified limits.

108

1.08
1.06
1.04

Nominal: 12.13%
Increase

Propogation Delay (nsec)

1.02

ACP maintains delay in narrow range throughout lifetime

1
0.98

ACP: 5.04%
Increase

0.96
0.94

VM: 14.87%
Increase

0.92
0.9
Nominal Uncompensated Design
VM - Critical Path
ACP - Critical Path
Timing Specification

0.88
0.86
0

2

4

6

8

10

Age (Years)

Figure 5.10: Delays of multiple critical paths over time for alu4 using ACP design.

Figure 5.10 shows the plot of worst-case path delay over time for alu4. It can be seen that ACP
maintains operation close to the Dspec with a switching interval of 3 months, whereas the uncompensated nominal design is seen to violate the Dspec at a point less than 3 months into operation.
Overall, ACP provides a reduction in delay degradation by a factor of 2.95X as compared to VM
approach in this case.
Moreover, the exception noted earlier for i5 benchmark is due to its high rate of degradation and
thus some guardband needs to be added for it to meet Dspec while using the switching interval of 3
months. Alternatively, this additional guardband could have been removed if a different switching
interval was utilized. However, the amount of guardband required is still less than that required
for VM. The results for ACP clearly show that the rate of degradation is reduced as compared to
109

VM as quantified in Figure 5.11. Thus, ACP helps to operate all benchmarks with less guardbands
as compared to VM throughout the operational period. This percentage reduction in guardband
achieved over VM and the resulting energy savings are shown in Figure 5.11.

30%

25%

4
ACP enables reduction in delay
degradation, which directly translates to
energy savings over lifetime

3.5

3
20%
2.5
15%

2
1.5

10%
1
5%

0.5
0%

0
C880

Energy Saving

i5

frg2

guardband reduction

alu4

seq

delay reduction factor

Figure 5.11: Energy savings with ACP as compared to VM for a lifetime of 10 years (Dspec = 5%).

A direct correlation is seen between the delay reduction factor and energy savings achieved. The
highest energy savings are achieved for frg2 circuit, which includes delay perturbation due to
MPs. To isolate the benefit of ACP in reducing delay degradation, the results for C880 are compared to i5, as the latter does not require MPs. Specifically, the energy savings for C880 is
lowest, which is mainly due to the low rate of degradation for this circuit. On the other hand, i5

110

demonstrates a significant amount of delay degradation which is successfully mitigated by ACP
and hence results in significant energy savings compared to a VM scheme.

5.5.2

Benefit of ACP with Tighter Constraints

In the previous subsection, the fact that ACP realized Dspec with slack in some cases, notable in
Figure 5.9, motivated another set of experiments. Here the delay headroom is diminished towards
the t0 delay of a baseline circuit operating at nominal voltage. Thus, the specifications are made
more stringent, specifically the Dspec is set to 0% over the t0 delay of a baseline circuit operating at
nominal voltage. This reflects the case that a much tighter design margin can be met and does not
affect the results observed earlier. In addition, the experiments are also conducted for a lifetime of
3 years to quantify the benefits of the scheme with varying lifetimes.
Figure 5.12 shows lifetime energy savings as compared to VM for selected benchmark circuits to
meet 0% margin for operational lifetime periods of 3 and 10 years. Again, the guardband reduction
directly correlates to the energy savings achieved. Longer lifetime implies more energy savings,
however, most of the energy savings are due to early-life reduction in delay degradation. In this
case, the experiments for 10 years lifetime are conducted with the same switching interval of 3
months which was used for earlier experiments. Whereas, experiments for lifetime of 3 years
are conducted with a switching interval of 1 month for all benchmark circuits, which maintains the
number of alternations nearly similar to the experiments with 10 year lifetime. As described earlier,
increasing the switching interval reduces the delay degradation to a certain extent. The remaining
aging compensation can be obtained through variation of supply voltage. Furthermore, Figure 5.13
shows that their is an increase in energy savings for ACP when more tighter specification are used
as compared to the experiments with 5% margin. The results clearly indicate that ACP is adaptable
to accommodate various operating conditions and constraints.

111

35%
30%
25%
20%
15%
10%
5%
0%
C880

i5

frg2

3 yr Energy Saving

10 yr Energy Saving

3 yr guardband reduction

10 yr guardband reduction

Figure 5.12: Energy savings with ACP as compared to VM for lifetimes of 3 and 10 years (Dspec =
0%).

5.5.3

Benefits of Autonomous Operation with CCP

The experiments for CCP are conducted with no timing margin and circuit lifetimes of 3 and 10
years such that a comparison can be made to ACP results obtained earlier. As described earlier,
at design-time, the voltage guardband for CCP is adjusted such that it’s delay is ADS% less than
Dspec . Herein, two different values of ADS of 1% and 2% are utilized.

112

35%

400%
350%

30%
High energy savings are realized
with more tighter specs

300%

25%

250%
20%
200%
15%

10%

150%

ACP realizes tighter timing
specs than related works

100%

5%

50%

0%

0%
C880

Timing Spec 5%

i5

Timing Spec 0%

frg2

Relative Energy Saving Increase

Figure 5.13: Comparison of energy savings for ACP as compared to VM realizing different timing
specifications with a lifetime of 10 years.

CCP is able to autonomously adjust it’s switching interval such that Dspec of 0% is never violated.
The minimum intervals required for this purpose are listed in Table 5.3. A reduced switching
interval is required to limit the degradation to 1% as compared to 2%. Furthermore, it is dependent
on the rate of degradation for a specific benchmark, e.g., i5 has the briefest switching interval
due to its highest degradation. The energy savings of CCP as compared to VM are shown in
Figure 5.14. Reduced guardbands due to CCP enables energy savings as high as 35.3% and 34.6%
for frg2 with ADS of 1% and 2% respectively. Highest energy savings are obtained when a lower
ADS is utilized. A tradeoff is evident in that a lower ADS% implies more energy savings while
a reduced switching interval is required. Low energy operation throughout the lifetime implies

113

that the power constraints of the chip are relaxed. CCP allow autonomous operation with lower
power as compared to ACP and VM schemes for the chosen values of ADS. Generally, a significant
portion of the energy savings is obtained by the proactive ACP scheme. Thus, both proactive or
reactive resource management for aging-mitigation is shown to achieve substantial energy savings.
Table 5.3: Minimum switching intervals for CCP when ADS=1% and ADS=2%.
Benchmark
C880
i5
frg2

ADS=1%
3.61 hrs
0.25 hrs
3 hrs

40%

ADS=2%
192 hrs
9.6 hrs
120 hrs

CCP has more energy savings
as compared to ACP

35%
30%

ADS=1% requires the least
amount of guardband

25%
20%
15%
10%
5%

0%
C880

i5

frg2

C880

3 yr Energy Savings

ACP

i5

frg2

10 yr Energy Savings

1%-CCP

2%-CCP

Figure 5.14: Energy savings for ACP, CCP with ADS=1% and, CCP with ADS=2% as compared
to VM (Dspec = 0%).

114

5.6

Summary

SREL can provide a flexible continuum of tradeoffs for energy benefit in terms of area cost. Generally, SREL is most applicable when the product of aging-sensitive logic area and the degree
of replication are minor compared to the overall chip area. This provides an adaptive resource
management technique for anti-aging using a selective sleep interval to enable recovery which is
not possible with gate-sizing approaches. Finally, an extendable technique to extract, remodel,
and merge selectively replicated critical paths is demonstrated within existing EDA design flows.
SREL is shown to successfully reduce the guardband with delay reductions ranging from 1.92-fold
to 2.84-fold over nominal values using a 5% timing margin. Even more favorable energy savings
as high as 35.3% using CCP are obtained due to further reduction of operating voltage through
autonomous adaptation of switching interval.

115

CHAPTER 6: COSTS AND LIMITS OF SOFT ERROR MASKING AT
REDUCED SUPPLY VOLTAGE

Near Threshold Voltage operation offers reduced energy consumption with an acceptable increase
in delay for targeted applications. However, increased Soft Error Rate (SER) has been identified
as a significant concern at NTV. In this chapter, tradeoffs are evaluated regarding use of spatial redundancy to mask increased SER at NTV. It is shown that conventional spatial redundancy
techniques such as N Modular Redundancy can exhibit higher mean delays than simplex arrangements due to increased sensitivity to Process Variation exacerbated at NTV. The implications on
energy consumption and delay for benchmark circuits implemented using 45nm and 22nm Predictive Technology Models (PTM) are evaluated through simulation. Results indicate that the energy
overheads of N MR systems in near-threshold region tend to be slightly higher than the expected
N -fold increase in cost for nominal voltage operation.

6.1

Introduction

While NTV offers an attractive approach to balance energy consumption versus delay for powerconstrained applications [55], it may also introduce reliability implications. In particular, radiationinduced Single Event Upsets (SEUs) which cause soft errors are expected to increase at this operating region [79],[54]. These errors may manifest as a random bit flip in a memory element or
a transient charge within a logic path which is ultimately latched by a flip-flop. While soft errors
in memory elements are feasible to detect and correct using Error Correcting Codes [55], their
resolution in logic paths typically involves the use of spatial or temporal redundancy which allow area versus performance tradeoffs [163]. In this chapter, the well-accepted spatial redundancy
technique of modular redundancy is utilized to mask soft errors in logic paths.
116

The contributions of this chapter are as follows:

• Impacts of Variability on N MR Systems at NTV: identifying the increased impact of PV at
NTV for 45nm and 22nm based N MR Systems, and
• Cost of Redundancy at NTV: expressing the delay cost of N MR systems at NTV for common
values of N within a given energy budget at nominal supply voltage.

6.1.1

Organization of the Chapter

The remainder of the chapter is organized as follows. Section 6.2 quantifies the effects of variability and provides relevant insights into the low-energy benefits of near-threshold operation for
N MR systems. Section 6.3 evaluates voltage guard-banding to achieve these benefits while accommodating variations in N MR arrangements. The experimental results for these conditions are
presented in Section 6.4 in terms of N MR energy and delay performance. Finally, a summary of
achievements are presented.

6.2 N MR Systems at Near-Threshold Voltage

Nanoscale devices are susceptible to process variations created by precision limitations of the
manufacturing process. Phenomena such as Random Dopant Fluctuations (RDF) and Line-Edge
Roughness (LER) are major causes of such variations in CMOS devices [184]. The increased occurrence of PV results in a distribution of threshold voltage Vth . As the Vth increases, the increase
in switching time affects the delay performance of the circuit. Such variability is observed to become magnified by continued scaling of process technology node [184]. For example, the effect of
RDF is magnified as number of dopant atoms is fewer in scaled devices such that the addition or
117

deletion of just a few dopant atoms significantly alters transistor properties. In addition, a large impact in circuit performance occurs as the transistor on-current is highly variable near the threshold
region [79]. Recent approaches for dealing with increased PV at NTV in multicore devices include
leveraging the application’s inherent tolerance for faults through performance-aware task-to-core
assignment based on problem size [76]. Variation impacts at NTV on cache reliability have also
been developed to leverage adaptive methods to dynamically adjust error control strength [94]. In
the remainder of the chapter, the discussion is restricted to show how these PV effects combine
in N MR systems of logic datapaths to exhibit a higher mean delay than simplex systems. Stated
alternatively, N MR arrangements require a more-than-linear increase in energy in order to obtain
a delay which is comparable to its component module.
While operation at NTV can be seen to increase PV by approximately 5-fold as quantified in [55]
for simplex arrangements, the effect in N MR systems has not previously been investigated. In
the case of an N MR arrangement, it can be expected that the worst-case delay will exceed that of
any single module. For instance, the delay of TMR system shown in Figure 6.1 is 2.8ns, which is
determined by the worst-case delay out of all instances. Generally, if the worst-case delay of an
instance i of N MR system is τi , then the overall delay of N MR system τN M R is given by:

τN M R = max (τi ) + δ
1≤i≤N

(6.1)

where δ represents the delay of the voting logic, which contributes directly to the critical delay.
Furthermore, the chance of having an instance with higher than average delay increases with N ,
which has been validated through experimental results quantified in Section 6.4. Overall, these
results are in agreement with distributions of 128-wide SIMD architectures demonstrated in [142],
whereby the speed of the overall architecture is also determined by the slowest SIMD lane. Herein,
the performance of N MR systems is compared to simplex systems, with 3 ≤ N ≤ 5.

118

Need to consider
the largest delay
out of all N modules

Module # 1
[Delay 2ns]

Majority
Voter

Module # 2
[Delay 2.5ns]

[0.3ns]

Clock = (1/2.8ns)

Module # 3
[Delay 1.7ns]

Figure 6.1: Propagation delay of TMR system under increased PV.

Intra-die variations for both 22nm and 45nm technology nodes are simulated using the MonteCarlo method in HSPICE. A viable alternative approach for simulating PV at NTV is also proposed
in [78], which captures both the systematic effects due to lithographic irregularities and the localized variations due to RDF. In this work, each module in an NMR arrangement can be anticipated
to exhibit comparable spatial variability due to the relative proximity of its module instances in the
physical layout. Thus, the scope of this work focuses on random variation impacts while die-to-die
variations would comprise future work.
The random effects are modeled through the variation in Vth caused due to RDF and LER effects.
The standard deviation σVth values are adopted from [184] which range from 25.9mV to 59.9mV
for 45nm process and 22nm process, respectively.

119

1e-08

Mean Delay (seconds)

45nm Technology Node

N=1, node=45nm
N=3, node=45nm
N=5, node=45nm
N=1, node=22nm
N=3, node=22nm
N=5, node=22nm

Increasing N

1e-09

22nm Technology Node
The effect is more
prominent here

1e-10
0.4

0.5

0.6

0.7
0.8
Voltage (volts)

0.9

1

1.1

Figure 6.2: Mean delay of N MR systems increases with scaling voltage down to the Nearthreshold region.

Figure 6.2 shows the mean delay for inverter chain for commonly-used values of N . Here, each
point is obtained by averaging at least 1000 samples. Results indicate that the performance impact
for 45nm technology node is around 10.6X on average at 0.5V for simplex system. However,
it reduces to 6.29X when the voltage is increased to 0.55V . The performance impact for N MR
systems with N = 3 and N = 5 tend to follow the same behavior. This is in agreement with
the near-threshold performance results noted in [55]. Results in Figure 6.2 indicate that the mean
delay is slightly higher for N MR systems and tends to increase with N . The spread in mean delays
(over 1000 samples) between simplex and N MR systems also increases with increased variability
effects notable at NTV as shown in Figure 6.3.

120

90

N=1, node=45nm
N=3, node=45nm
N=5, node=45nm
N=1, node=22nm
N=3, node=22nm
N=5, node=22nm

80 Increasing
N

70

Variation 3σ/µ (%)

60

22nm Technology Node

50
40
30
20
10
45nm Technology Node
0
0.4

0.5

0.6

0.7
0.8
Voltage (volts)

0.9

1

1.1

Figure 6.3: Delay variations decrease with increasing N for N MR systems.

However, more values are clustered closer to the mean when N is increased. This is observed
through delay distribution for 1000 samples demonstrated in Figure 6.4 for VDD = 0.55V whereby
the mean values are increased causing right-shifted peaks for N = 3 and N = 5 compared to simplex while the variances are decreased causing narrower spreads. In addition, the performance difference between simplex and N MR systems at 22nm technology node is magnified due to higher
values for the coefficient of variation for 22nm technology node as compared to the 45nm technology node as noted in Figure 6.3. For instance, the mean delays for the 22nm node with N = 3
and N = 5 are 1.16X and 1.24X the mean delay for a simplex system respectively at the same
voltage of 0.55V . Whereas, in contrast, for 45nm technology node the difference is only 1.06X
and 1.09X for N = 3 and N = 5 systems, respectively. These numbers tend to be higher when
operating very close to the Vth of the transistors due to increased variations.
121

160

140

N=1 (3σ/µ=20.88%)
N=3 (3σ/µ=16.08%)
N=5 (3σ/µ=14.70%)

Increasing N

120

Frequency

100

80

60

40

20

0
3e-09

3.33e-09

3.66e-09
3.99e-09
Delay (seconds)

4.32e-09

4.65e-09

Figure 6.4: Delay distributions of N MR systems at Near-threshold Voltage of 0.55V with 45nm
technology node.

Furthermore, the amount of variability is also dependent on the length of the logic datapath, i.e,
the number of gates in the critical path. For instance, it is noted in [142] that the variability
decreases as the length of inverter chain increases. However, as pointed out earlier, logic datapaths
operating at NTV may be structured to have relatively smaller depths. Herein, alternative synthesis
techniques are demonstrated which can lower the amount of variability at NTV. For instance, it is
observed that the variation is also dependent on the type of logic gate utilized. Figure 6.5 shows
that functionally-identical inverter chains built using NAND2 gates exhibit the least amount of
variation, having inputs tied together to realize the same function as an INV gate.

122

90

INV, node=45nm
NAND node=45nm
NOR, node=45nm
INV node=22nm
NAND node=22nm
NOR node=22nm

80
70

Variation 3σ/µ (%)

60
50
40
30
20
10
0
0.4

0.5

0.6

0.7
0.8
Voltage (volts)

0.9

1

1.1

Figure 6.5: The variation of simplex systems composed of different types of logic gates.

In another set of experiments, various TMR arrangements are considered utilizing diversity of
NAND2 and INV gates in Figure 6.6. Again, all of these TMR arrangements are functionally
equivalent, yet exhibit different amount of variability. For example, 22nm TMR arrangements
based on NAND2 gates exhibit about 13% less variation as compared to a TMR arrangement with
INV gates. Now, for our experiments with the inverter chain, the mean delays for NAND-based
systems are higher than INV-based systems which outweighs any benefit of reduced variation.
Thus, a diversity-enabled N MR synthesis approach needs to be evaluated with more functionallycomplex benchmark circuits which will be pursued as future work. Here, the PV analysis is restricted to uniform logic gates which achieve minimal delay.

123

80

INV-INV-INV, node=45nm
INV-INV-NAND, node=45nm
INV-NAND-NAND, node=45nm
NAND-NAND-NAND, node=45nm
INV-INV-INV, node=22nm
INV-INV-NAND, node=22nm
INV-NAND-NAND, node=22nm
NAND-NAND-NAND, node=22nm

70

Variation 3σ/µ (%)

60

50

40

30

20

10

0
0.4

0.5

0.6

0.7
0.8
Voltage (volts)

0.9

1

1.1

Figure 6.6: The variability of TMR systems composed of modules with design-diversity.

Since the spread of delay distribution for N MR systems is narrower as compared to simplex systems, straightforward techniques to combat performance variability [55] can be used for N MR
systems. For instance, smaller guard-bands are required to maintain same yield as compared to
simplex systems as demonstrated in the results of this work.

6.3

Energy Cost of Mitigating Variability in N MR Arrangements

A fundamental approach to alleviate observed delay variations in the near-threshold region is to
add one-time timing guard-bands [142]. Guard-bands can be realized by operating at reduced
frequency via a longer clock period, or by operating at a slightly elevated voltage to compensate

124

for the increased delay variations. This work assumes the latter for comparison against simplex
systems. It is worth mentioning that the scope of this work is to deal with soft errors at NTV using
spatial redundancy, thus a straightforward voltage margining scheme is utilized to achieve the same
performance as a simplex system in presence of variations.
To achieve an expected yield of approximately 99% for N MR systems while achieving the same
worst-case performance at a fixed NTV of simplex system; the voltage for N MR system is increased such that the respective delay distributions have the following statistical characteristics
for N ≥ 3:
(µN M R + 3 ∗ σN M R ) ≤ (µSimplex + 3 ∗ σSimplex )

(6.2)

where µN M R , µSimplex represent the mean delays for N MR and simplex systems, respectively, and
σN M R , σSimplex represent the respective standard deviations. The three sigma rule has the property
that nearly all (99.7%) of the instances have the delay less than (µN M R + 3 ∗ σN M R ). Thus, this
results in a high expectation for an N MR system to have same worst-case delay as simplex system,
i.e., the same throughput performance.

6.4

Experiments and Results

Experiments are conducted to quantify the energy overheads of N MR systems with increased variations due to operation in near-threshold region. For this case study, MCNC benchmark circuits of
C880 and i5 are utilized. These circuits are synthesized using Synopsis Design Compiler based
on the 45nm PTM-based NanGate open source library [187]. Then, the synthesized netlist is imported into Synopsis HSPICE tool for Monte Carlo simulations. These simulations vary the Vth of
the transistors in the netlist based on a gaussian distribution having a mean equal to the nominal
model card for PTM and σVth as provided in [184]. Ideally, the σVth can be adapted to accommo-

125

date local and global variations, or their combined effects as considered in these experiments. The
simulation framework is illustrated in Figure 6.7.

Uniplex Energy
and Delay
Iteration 1

N samples
chosen
randomly
from MC pool

NMR
Energy=∑ N;
Delay=Max of N
Arrangement 1

Uniplex Energy
and Delay
Iteration 2

Uniplex Energy
and Delay
Iteration 3

...

Uniplex Energy
and Delay
Iteration 1K

...

NMR
Energy=∑ N;
Delay=Max of N
Arrangement 1K

1000 Monte-Carlo
(MC) Samples
Complete

NMR
Energy=∑ N;
Delay=Max of N
Arrangement 2

NMR
Energy=∑ N;
Delay=Max of N
Arrangement 3

Figure 6.7: Simulation framework to estimate the delay and energy for N MR systems in the presence of PV.

If the overhead of the voter circuit is considered to be negligible, i.e., δ = 0, then direct comparisons to simplex systems are possible. The Monte Carlo simulations were conducted to utilize at
least 1,000 experimental runs for a single module within the N MR circuit. Then, the N MR system
is obtained by choosing N random samples from pool of module variants created by these runs.
The module instance with maximum delay determines the delay of the N MR system indicated by
Eq 6.1. Similarly, the energy consumption is computed by accumulating the energy requirement
of the selected N samples operating at a frequency of 1/τN M R . This scenario is repeated 1,000

126

times to establish mean values for comparison to those obtained in an equivalent number of runs
for a simplex system as illustrated in Figure 6.7.
Table 6.1: Mean energy consumption for N MR systems with same performance at specified NTV
of simplex system.

N = 1, VDD (NTV)
0.55V
0.6V
0.65V
0.7V

6.4.1

C880
N=3
N=5
3.03X 5.06X
3.03X 5.05X
3.02X 5.04X
3.01X 5.03X

i5
N=3
N=5
3.02X 5.05X
3.02X 5.04X
3.02X 5.04X
3.01X 5.02X

Iso-Performance Energy Consumption for N MR Arrangements

Table 6.1 shows the mean energy consumptions at multiple near-threshold voltages for TMR and
5MR systems such that they have the same worst-case performance as a simplex system with
yields of at least 99.7%. This is done by ensuring that the three sigma point for delay distribution
of N MR system is less or equal to that of the simplex system as described earlier. For comparison
purposes, the operating voltage of simplex system is used as a reference with the values shown
in the Table 6.1. Then, the operating voltage for the N MR system is elevated from the reference
Near-Threshold Voltage (NTV) assumed for simplex system. Note, all the modules in this case
are assumed to be operated at uniform voltage and are co-located within the same region of the
chip. The elevated supply voltage results in a left-shift of the delay distribution towards that of
the simplex system. It is seen that only a slight voltage increase is sufficient for this purpose. For
example, on average 2mV increase is satisfactory to operate a TMR arrangement based on i5
circuit at comparable delay. This translates to a mean energy consumption of about 3.02X which
is approximately 1% more than 3X assumed at nominal conditions.

127

The N MR energy consumptions as shown in Table 6.1 are not excessive because N MR systems
tend to have lower variation (σ) as compared to simplex systems as demonstrated in Figure 6.3.
Hence, even though N MR systems exhibit higher mean delays (µ), their reduced variance necessitates only a slight increase in reference voltage to meet the delay target of the simplex system.
However, increasing the value of N to 5 increases the energy consumption slightly due to more
increase in mean delay as compared to a TMR system.
Table 6.2: Impact of increased PV due to technology node scaling on energy consumption for
N MR systems.

N = 1, VDD (NTV)
0.5V
0.55V
0.6V
0.65V
0.7V

6.4.2

45nm
N=3
N=5
3.04X 5.07X
3.03X 5.05X
3.03X 5.04X
3.02X 5.03X
3.01X 5.02X

22nm
N=3
N=5
3.17X 5.30X
3.14X 5.27X
3.13X 5.26X
3.12X 5.23X
3.10X 5.16X

Impact of Technology Scaling

To further analyze the impact of increased variability on energy overheads of N MR systems, experiments were conducted next with scaled technology node of 22nm PTM HP model. These
experiments are conducted on inverter chains composed of 26 Fanout-of-4 inverters, which is a
similar setup as adopted in [142]. The energy consumptions with same goals as defined earlier
are listed in Table 6.2 as compared to 45nm-based technology node. The results indicate that the
energy overheads are higher for both TMR and 5MR systems at more deeply scaled technology
nodes. For instance, the 22nm node-based 5MR system requires 3.94% (on average) more energy consumption than a 45nm based 5MR system. This is consistent with the trend of increased
variations as noted in Figure 6.3 beginning with the 22nm node.
128

6.4.3

Cost of Increased Reliability at NTV

Near-threshold operation allows improved energy efficiency. The energy savings can be utilized
for either reduced power operation or to increase resilience via N MR. Thus, operation of N MR
systems in the near-threshold region allows for consideration of interesting tradeoffs. For example,
increasing N from 1 to 3 can be evaluated as a means to increase reliability within the same energy
budget as a simplex system operating at nominal voltage. This is valid provided that the increase
in delay, and thus corresponding drop in performance, is acceptable. Note that this pursuit of
increased reliability is predicated upon the assumption that the source of variability in the nearthreshold region is due to variation in Vth for which this work is restricted. Further study needs to
be performed to determine the reliability levels provided by N MR systems in this operating region
due to other noise sources such as variation in VDD [109], inductive noise and temperature [86].
Based on these assumptions, the tradeoffs are pointed out in Figure 6.8 for 45nm technology node,
where each point is obtained by averaging at least 1000 samples. For instance, results indicate the
feasibility of TMR operation at approximately 0.69V (operating point B on plot) on average given
an identical energy budget of a simplex system operating at nominal voltage of 1.1V (operating
point A on plot) while incurring a delay difference of 2.58X. Similarly, it is possible to achieve
5MR with approximately 0.545V (operating point C on plot) operation on average while incurring a performance impact of 7.15X. This represents a greater-than-linear increase in delay as a
function of N , as compared to TMR operation. Thus for mission-critical applications, this offers
insights into tuning the degree of redundancy facilitated by a near-threshold computing paradigm.

129

3e-12

N=1
N=3
N=5
Nominal (N=1) Energy Budget

2.5e-12

5X

Energy (J)

2e-12

TMR is possible with same
energy as of N=1 operating
at nominal

1.5e-12

3X
1e-12
B: 0.69V

C: 0.545V

A: 1.1V

5e-13
Energy
Budget
0
0.4

0.5

0.6

0.7
0.8
Voltage (volts)

0.9

1

1.1

Figure 6.8: Mean energy consumption of spatial redundancy systems with various operational
voltages.

To consider the feasibility of reducing energy consumption while simultaneously providing soft
error masking, Figure 6.8 can be observed again. The energy requirement of a simplex arrangement
at 1.1V is about 0.541 pJ, while the TMR curve at 550mV is only about 0.330 pJ. Thus, a TMR
arrangement at 550mV results in an energy savings of 38.4% as compared to simplex system
at 1.1V. This means compared to a simplex arrangement at nominal voltage, selecting a supply
voltage of 550mV allows for provision of TMR for soft error masking in the presence of technology
scaling while still reducing the energy requirement significantly.

130

6.5

Summary

Operation of N MR systems in the near-threshold region allows for consideration of energy and
delay tradeoffs in terms of N . Redundancy can be seen as a degree of freedom enabled by decreasing supply voltage. When doing so, it is essential to consider N MR arrangements’ increased
susceptibility to PV.
Further study is worthwhile to determine the reliability provided by N MR systems in this operating
region due to other noise sources such as variation in supply voltage VDD , temperature, and aginginduced variations. For instance, it is pointed out that a further variation of 2X is expected due
to variation of VDD and temperature. However, lower voltage helps to lower transistor junction
temperatures and interconnect currents which can have beneficial effects on aging-induced defects
such as Electromigration and Bias Temperature Instability. Furthermore, augmenting the use of
temporal redundancy in the form of timing-speculation circuits to reduce variation effects and
spatial redundancy would also be worthwhile to investigate.

131

CHAPTER 7: UNDERSTANDING THE IMPACT OF TRANSIENT
ERRORS IN HPC APPLICATIONS FOR LARGE-SCALE SYSTEMS

This chapter investigates the resiliency of HPC application in the exascale computing era. With the
adoption of NTV, to meet the low-power goals of exascale computing, a research push is evolving
to investigate the effects of increased failure rates. HPC applications from the CORAL, ASCR and
DoE proxy benchmark suites are injected with random soft-faults for these studies. Using accelerated fault-injection, the impact of various software-transformations on the applications vulnerability is quantified. Then, a compiler-based fault-propagation framework is proposed and utilized to
determine models, which can be utilized to determine the number of corrupted memory locations
at runtime. These assessments can be employed to determine the protection cost of fault-tolerance
schemes, which can consequently impact the energy consumption of future HPC systems.

7.1

Introduction

Exascale systems promise to deliver 1000-fold more computing capability than current petascale
supercomputers. However, new approaches for reliable operation under low-power and NTV technologies along with higher temperature tolerance will be required to achieve exascale performance
within the targeted 20MW power budget [87]. Without additional mitigation actions, these factors, combined with the sheer number of components, will considerably increase the number of
faults experienced during the execution of parallel applications, thereby reducing the MTTF and
the productivity of these systems. Moreover, power constraints will limit the amount of energy
and resources that can be dedicated to detect and correct errors [111], thus Silent Data Corruptions
(SDCs) are expected to become more common. Finally, transistor device aging effects will amplify
the significance of these conditions.
132

7.1.1

Implications of Compiler Optimizations on Application Vulnerability

Many factors affect the resilience of a system, including the application’s algorithm and characteristics, the scientific libraries used, the environmental conditions of the system, and the aging
of the hardware components [185, 165]. Among these factors, compiler optimizations have been
usually considered of secondary order and their impact on resiliency have not been widely investigated, especially in the context of HPC systems. Such studies are critical for HPC applications
since compiler optimizations are one of the most successful ways to improve performance with
relatively low effort from the programmer. Contemporary compilers are capable of performing
very sophisticated code analysis and code transformation almost transparently to the user. The
most common compiler optimizations include loop unrolling, memory pre-fetching, instruction
reordering to hide memory latency and increase instruction-level parallelism, and elimination of
dead branches and unused global variables. Moreover, compilers can exploit hardware-specific
optimizations, such as special instructions and vector units without requiring the user to deeply
understand every different processor architecture. The impact of these code transformations and
optimizations on performance is substantial to the point that compiler optimizations are considered
essential to achieve high performance and efficiency.
Compiler optimizations, however, also impact the vulnerability of the application, which can affect
the overall resilience of the system [21]. For example, code optimization can increase the utilization and the throughput of out-of-order processors, which generally improve performance. On the
other hand, assuming that faults occur randomly in the processor, higher utilization implies higher
sensitivity to faults, as there are more processor’s components that operate on application’s instructions at the same time. Similarly, increasing data locality may reduce the number of cache misses,
thus errors occurring in a cache line will have a lower probability to propagate to main memory.
While, the same optimization may increase the time a certain value resides in a register before

133

refreshing the register. A bit flip in the processor register will stay in the processor longer and have
a high probability of propagating into the application’s data structures and incur disruptive effects.
Thus, there is a need to study the impact of compiler optimizations on vulnerability of HPC applications as similar effects are noted for other classes of benchmarks [48, 53, 152, 137, 164, 172].
Compiler Optimizations and HPC Applications: The vulnerability analysis of HPC applications has been done in [95] and [36]. However, the vulnerability analysis performed is based on
the knowledge of the applications’ data structures. Thus, data structures are identified which can
help reduce the vulnerability of HPC applications, for instance, critical pointers in AMG can help
reduce the crashes due to segmentation faults. However, the vulnerability characteristics of HPC
applications with varying compiler optimizations have not been studied. In the multi-programmed
domain, authors in [48] assess the application sensitivity to errors for the SPEC CPU2000 integer benchmarks compiled using different optimizations. Instructions are then classified based on
the probability to derate the corrupted values. The effects of the occupancy of various microarchitecture structures such as the reorder buffer, the instruction fetch queue and load store queue
as a result of different compiler optimizations are studied in [53], where the authors also propose
the EF metric. The authors show that optimized code is more reliable according to their failure
metric which is proportional to execution time. However, if the unoptimized code was to run for
equivalent time, then it turns out to be more reliable than optimized code. Similarly, speculativelyscheduled loops are shown to be more vulnerable than unrolled loops [152].
LLFI is used to quantify the effects of compiler optimizations for soft computing applications
in [164]. However, it focuses only on egregious data corruptions, where the resulting loss in
signal-to-noise ratio for image processing applications is greater then 30 dBs. These corruptions
are just 12% of the overall reported SDCs. Similarly, the effect of compiler optimizations on SDCs
is assessed for benchmark programs such as cyclic redundancy check, secure hash, and quicksort
using fault-injection [137]. For the selected benchmarks, it is shown that SDC for optimized
134

programs is only marginally higher as compared to unoptimized programs. Application-level faultinjections are utilized to characterize mobile embedded applications in [172], where it is shown for
one benchmark that SDC rate increases dramatically from O0 to O2 and decreases gradually from
O3 to O5. This is correlated with the percentage of floating point instructions with each compiler
optimization. In summary, the net effects of these interacting characteristics for HPC applications
running on large-scale systems, as opposed to using simulation frameworks on which most of these
works are based upon, are noted to be significant.
Thus, the HPC-specific tradeoffs between performance improvement and resilience due to software
transformations and the implications of scaling parallel applications beyond a single multi-core
compute node are analyzed. Specifically, the answer to several important questions are addressed
as a result of this work, such as:

• “Is performance gain obtained by a given optimization worth the increased application vulnerability?”,
• “Are there optimizations that provide the same level of performance, but decrease vulnerability?”, and
• “Do memory operations modified by optimization levels have a significant impact on vulnerability?”

Results indicate that optimized code performs better than unoptimized code but also that optimized code is more vulnerable. Moreover, both performance and vulnerability generally follow
the same trend and increase with increasing level of code optimizations. In some cases, additional
compiler optimizations do not yield additional performance, but significantly increase application
vulnerability. In other cases, a small performance penalty caused by not using some compiler
optimization may decrease the vulnerability of the application. Finally, the causal relationships
135

between compiler optimizations and application vulnerability and the main reasons that induce
application crashes are also analyzed.

7.1.2

Importance of Fault-Propagation Analysis

Despite the importance and the risks associated with SDCs, there are limited studies that analyze
the impact of transient errors on HPC applications and how these errors propagate in the application data structure (application state) [36, 48, 60, 95, 103]. Generally, these studies rely on
statistical analysis of the application’s outputs performed by injecting faults in the application’s
state at random points during the execution. Although these works provide important information
on the vulnerability of parallel applications, this “black-box” approach does not provide insights
on the impact of the injected faults on the application’s internal state. Thus, it becomes difficult
to assess the extent a fault has contaminated the application’s data structures, at which speed it
propagates, or which resilience mechanism should be employed. In particular, the analysis of the
application’s output state does not distinguish cases in which the application’s state is corrupted
even if the final results are correct. This happens, for example, if the acceptable residual error of
a scientific simulation is large enough to accommodate for the variance introduced by a transient
error. The outcome of the same execution would be different with stricter error bounds.
The results of fault-injection analysis may be incomplete and inaccurate without a comprehensive
knowledge of how transient errors propagate in the application’s state during its execution. This
may lead to wrong resilient mechanism to be employed. In order to collect such comprehensive
knowledge, a novel fault propagation framework is proposed that accurately tracks the propagation
of faults in distributed MPI applications. The framework provides internal application’s vulnerability information beyond what is generally provided by statistical analysis based on output variation. Specifically, it provides information on the speed (i.e., how quickly a fault propagates into

136

a process state) and depth (i.e., how many processes are affected) with which a fault propagates
throughout the execution of the application. This information is essential to understand the impact
of transient errors in HPC applications and, thus, to select the most appropriate resiliency mechanism to employ. A deeper insight provided by the proposed framework can expose vulnerabilities
in the application’s algorithm and implementation that are hidden by a “black-box” analysis.
The proposed framework consists of an LLVM-based instrumentation component and a runtime
checker that tracks the propagation of faults into the application’s state. Although conceptually
straightforward, accurately tracking the propagation of a fault requires a comprehensive and thorough methodology along with properly-implemented tools. In fact, the general assumption that the
output of an instruction becomes corrupted, i.e., a fault propagates, if at least one of the inputs is
corrupted, which could lead to large overestimation of the number of corrupted memory locations.
To avoid such overestimation and to precisely track faults in a generic operation, the stream of
instructions to compute both the potentially corrupted outputs and the pristine outputs are replicated. The former are the outputs of the instructions that may use input values corrupted by an
injected fault or contaminated by previous operands. The latter are the outputs of instructions that
only use non-contaminated operands, which are not impacted by the error. At store operations,
the potentially-corrupted and the pristine values are compared to determine if a fault propagates to
memory. This can be used to track the number of corrupted memory locations over time.
The proposed framework is used to analyze the impact of soft faults in several commonly-used
parallel applications from different scientific domains, such as hydrodynamics and molecular dynamics, taken from various benchmark suites [1] and DOE proxy applications [6]. The experiments
are performed on a 32-node cluster with a total of 1,024 cores. Firstly, it is shown that the outcomes
of statistical analysis based on output variation may lead to erroneous conclusions. For example,
“black-box” analysis would conclude that faults injected in LULESH are masked in over 90% of
the cases. However, a deeper analysis reveals that the faults often propagate and may corrupt up
137

to 25% of the application memory state. Second, given the iterative nature of most HPC applications, faults propagate linearly into the application’s states. Then linear regression techniques are
employed to derive application fault propagation models that can be used to estimate the number
of corrupted memory locations, once a fault is detected. These models can be used to estimate the
number of corrupted memory locations and to understand if a roll-back to a previous checkpoint
should be triggered. From the fault propagation models, the fault propagation speed factor (FPS) is
extracted which can indicate how quickly a transient error propagates into the application’s state.
The FPS metric can be used to express the vulnerability of HPC applications and can be combined
with architectural vulnerability metrics [117, 140] to assess the system resilience.

7.1.3

Organization of the Chapter

The remainder of the chapter is organized as follows. Section 7.2 describes the fault model utilized
and application categorization done based on fault injection. Section 7.3 describes the experimental setup and the HPC benchmarks utilized in the experiments. Section 7.4 assess the impact of
compiler optimizations on application vulnerability. Section 7.5 describes the proposed framework
to track the propagation of faults into application state. Section 7.6 presents the results of tracking
fault propagation in HPC benchmarks and compares the results with conventional fault-injection.
Section 7.7 describes the application-specific fault propagation models. Section 7.8 compares the
proposed work in this chapter with previous and related works.

7.2

Fault Model and Software-Implemented Fault Injection

The fault model utilized herein focuses on transient faults that escape hardware correction and
detection due to the infeasibility of complete fault coverage for millions of chips consisting of

138

billions of transistors each. Next the categorization of application outcomes is described which
results due to fault injection experiments.

7.2.1

Categorization of Application Outcomes

Transient errors may occur any time during the execution of an application and can result in a
variety of outcomes. The classification proposed in [21, 45, 173, 174] for HPC systems is extended
herein. The experiment outcomes are classified in the following categories:

Vanished (V): Faults are masked at the processor-level and do not propagate to memory, thus
the application produces correct outputs and the entire internal memory’s state is correct.
Output Not Affected (ONA): Faults propagate to the application’s memory state, but the final results of the computation are still within the acceptable error margins and the application
terminates within the number of iterations executed in fault-free runs.
Wrong Output (WO): Faults that escape hardware detection and propagate through the
application’s state and corrupt the final output. The application may take longer to terminate.
Prolonged execution (PEX): Some applications may be able to tolerate corrupted memory
states and still produce correct results by performing extra work to refine the current solution.
These applications provide some form of inherent fault tolerance in their algorithms, though
at the cost of delaying the output.
Crashed (C): Finally, faults can induce application crashes. Also, program “hangs”, i.e., the
cases in which the application does not terminate, are included in this class.

Compared to previous classification, SDCs are identified as ONA, WO and, PEX, depending on
their effects on the application’s state and output. Analysis based on output variation, such as fault
139

injection analysis, cannot distinguish V from ONA, thus the class Correct Output (CO) is used to
indicate the sum of V and ONA when the application produces correct results within the expected
execution time. Correct results with longer executions are in PEX.
Given that exascale machines do not yet exist and that analyzing real faults on current systems
would require long periods of time, accelerated fault-injection is adopted herein, which is compatible with previous works [95, 178]. Using this approach, a large number of tests can be performed
in a relatively short time and a considerable part of the application code and result space can
be explored. Single-bit flips are randomly injected at the register-level during the execution of
an application. The faults are injected into the source register1 of both arithmetic and load/store
operations, which is the most accurate high-level error injection model described in [45], besides
circuit-level fault injection. Since, an understanding of the vulnerability of HPC applications needs
to be gained, faults are only injected in the application source code but not in the MPI and system
libraries.
As the primary goal, is to analyze the vulnerability and sensitivity of HPC applications to transient
errors. It must be remarked that this information alone will not provide a comprehensive understanding of the resilience of the entire system. The analysis in this chapter focuses on what happens
after a fault, undetected by the hardware, contaminates the processor registers. Previous work can
be consulted on understanding how transient errors occurring at circuit-level eventually propagate
to architectural-level [45, 140]. In order to assess the resilience of the entire system (hardware and
application), the user needs to combine the expected hardware Failure In Time (FIT) rate with the
vulnerability information provided herein. Thus, this work is complementary to the resiliency studies of hardware systems [19, 21, 97, 117, 140] and is essential to understand the resilience of the
entire system. As explained later, the proposed work is similar to the Program Vulnerability Factor
1

Fault injection in the destination register produces similar effects.

140

(PVF) [152] and the Data Vulnerability Factor (DVF) [185], in that it examines the application’s
sensitivity to faults. However, the assumed fault model considers both architectural-level, i.e., how
faults propagate in processor registers, not considered by DVF, and the MPI communication level,
i.e., how faults propagate among MPI processes, not considered in the PVF metric.

7.2.2

Architecture-level Fault Injection

Since, the objective of this study is to target large parallel applications running on a cluster of nodes
and fault injection at the circuit-level is prohibitively slow and would limit the exploration space.
Thus, in this study, accelerated software fault-injection [45] is opted. Specifically, a compiler-level
fault injection strategy is adopted because undetected transient errors will propagate to processor
register or functional units but may be masked at processor level before contaminating memory
locations. Software fault injection tools based on binary instrumentation, such as [95], instead,
directly inject faults into memory locations. Previous work showed that this form of fault injection
may inaccurately model transient errors occurring in the hardware [45, 111]. Moreover, injecting
faults directly into the application memory state has an impact when assessing the reliability of
a system. In fact, it is not possible to use the architectural FIT rate, usually known, but the user
needs to estimate an application-specific FIT rate that takes into account faults masked at architectural level and that have not propagated to the application’s memory state. Previous work [21]
showed that the application-specific FIT rate is a dominating factor when assessing overall system
reliability and it is not trivial to estimate. On the other hand, injecting faults at register-level makes
the use of micro-architectural FIT rate feasible to evaluate the overall reliability of a system.
To inject faults at processor level, LLFI [178], an LLVM-based fault injection tool that injects
faults into the LLVM Intermediate Representation (IR) of the application source code is utilized
and extended. LLFI injects a single fault into live register at every run of a sequential application

141

in specific program points, which allows the user to track the effects of the fault back to the source
code. LLFI is extended in two directions: First, the necessary support to inject faults in multiple
MPI processes at different times during the execution is provided. Second, LLFI is extended to
inject zero or more faults into each MPI process during each execution of the applications. This
means that some MPI process may experience direct faults, while other may experience indirect
faults (errors) propagated from other MPI process. In the rest of this chapter, the extended version
of LLFI for MPI applications is referred to as LLFI++. It should be mentioned that using LLFI
has its drawbacks. First, faults are only injected into live registers, which limits our fault model to
transient errors occurred during an instruction’s operation, e.g., flip-flop errors in execution units.
As shown in Section 7.6.2, this has an impact on the percentage of vanished faults. Second, as
other software fault injectors, LLFI does not inject faults into non-programmable registers.

7.2.3

Fault-Injection Coverage

Under ideal conditions, every instruction in a program needs to be targeted for fault-injection, to
understand the effects on application outcome. However, this approach is impractical for large
applications such as the ones tested in this work. As statistical fault-injection is performed, it
is important therefore to verify that faults are uniformly injected throughout the execution of the
application. Figure 7.1 indicates that, indeed, faults are uniformly injected during the application
execution. The Figure shows the results of 5,000 injections for LULESH:2 the x-axis represents
time in cycles divided into 500 bins. The bars represent the number of faults injected in each
bin and the red line represents an ideal uniform distribution. As evident in the plot, the actual
distribution of injected faults closely matches the ideal uniform distribution. The approximation of
an ideal uniform distribution was also verified through χ2 tests using MATLAB statistics toolbox.
2

The plots for the other applications follow the same structure.

142

25

Number of faults injected

20

15

10

5

0
0

2.6368e+10
Execution Time

5.2735e+10

Figure 7.1: Faults are injected uniformly throughout the execution of LULESH.

7.3

Experimental Setup

Hardware and Software Platform: The experiments are performed on a cluster of dual-socket
AMD Opteron 6227 (Interlagos) [30] compute nodes. Each processor socket is comprised with
32 cores running at 2.2 GHz. Cores feature a 64 KB L1 data cache while pair of cores share a 64
KB L1 instruction cache and a 256 KB integrated L2 cache. Compute nodes are equipped with
64 GB of DRAM divided into 4 NUMA domains. Each NUMA domain consists of 8 processor
cores that share a memory controller and a 8 MB L3 cache. The compute nodes are interconnected
through an Infiniband communication network. The system runs Linux 3.9.0 while all applications
are compiled with LLVM/clang version 3.4 which internally use GNU gcc 4.8.2, and OpenMPI
1.7.4.
Applications: The impact of compiler optimizations on several important DOE applications and
benchmarks taken from various benchmark suites are analyzed herein. In particular, in this study,

143

LULESH2, LAMMPS and MCB from the CORAL program3 , and miniFE from the DOE proxy applications are chosen. LULESH [75] is a shock hydrodynamics proxy application developed by the
ASCR ExMatEx Exascale Co-Design Center4 to model numerical algorithms and data motion of a
scientific applications that solves a Sedov blast problem with analytical answers. LAMMPS [127]
is a molecular dynamics code that models an ensemble of particles in a liquid, solid, or gaseous
state. The application computes Newton’s equations of motion for system of interacting particles
and can model atomic, polymeric, biological, metallic, granular, and coarse-grained systems using a variety of force fields and boundary conditions. The Cu metallic solid with embedded atom
method (EAM) potential which involves the dynamics of 32,000 atoms for 20,000 time steps is
solved herein. MCB5 models the solution of a heuristic transport equation using a Monte Carlo
technique. The application employs typical features of Monte Carlo algorithms such as particle
creation, particle tracking, tallying particle information, and particle destruction. The heuristic
transport equation models the behavior of particles that are originated, and then travel with a constant velocity, scatter, and are absorbed. miniFE6 is a DOE proxy application which implements
several kernels resembling implicit finite-element applications. The application assembles a sparse
linear-system from the steady-state conduction equation on a brick-shaped problem domain of linear 8-node hex elements. Next, miniFE solves the linear-system using a simple un-preconditioned
conjugate-gradient algorithm.
3

https://asc.llnl.gov/CORAL/

4

http://science.energy.gov/ascr/research/scidac/co-design

5

https://codesign.llnl.gov/mcb.php

6

https://mantevo.org/

144

Table 7.1: Optimizations applied at each level by clang.
Opt-level
O0
O1

O2
O3

Optimizations Applied, listed in order
-targetlibinfo -datalayout -notti -basictti -x86tti -preverify -domtree -verify
-no-aa -tbaa -basicaa -globalopt -ipsccp -deadargelim -instcombine -simplifycfg -basiccg -prune-eh -inline-cost -functionattrs
-sroa -early-cse -lazy-value-info -jump-threading -correlated-propagation -tailcallelim -reassociate -loops -loop-simplify
-lcssa -loop-rotate -licm -loop-unswitch -scalar-evolution -indvars -loop-idiom -loop-deletion -loop-unroll -memdep
-memcpyopt -sccp -dse -adce -strip-dead-prototypes -always-inline
-slp-vectorizer -globaldce -constmerge -barrier -loop-vectorize -gvn -inline (-always-inline)
-argpromotion

7.4

Effect of Software Transformations on Application Vulnerability

LLVM 3.4 framework based on the clang 3.4 C/C++ compiler is used. Although modern compilers provide the users with specific options to optimize their code, typically set of optimizations
are grouped into higher level options O0, O1, O2, O3, etc. Table 7.1 lists the clang optimizations
applied at each optimization level incrementally from O0, to O3. Here, each row shows the additional optimizations added to the previous level whereas optimizations specified in parenthesis
are removed. As shown, several optimizations are supported, including loop unrolling, rotation
and simplifications, function inlining, memory dependency analysis, etc. At level 2 (O2) the compiler performs store and loop vectorization, removes unreachable global variables, and eliminates
redundant instructions. At optimization level 3 (O3), the compiler promotes arguments passed by
reference to arguments passed by value.
The most important compiler transformations which may effect the memory operations in program
code include:

• Loop Vectorization: transform data dependence graphs that do not exhibit cycles between
iterations into single instructions on multiple data items consisting of a range of array indices,
• Memory to Register: promotes memory references to register references to increase register
utilization,
145

• Dead Load/Store Elimination: redundant stores are eliminated by doing a local pass over
function, global values are enumerated to enable elimination of redundant loads,
• Register Renaming: reassignment of variables to remove output dependencies or optimize
register use.

Most of the above transformations are applied, during the various levels of optimization mentioned
in Table 7.1. For instance, -gvn global value numbering is used to eliminate redundant load instructions during O2. Similarly, -loop-vectorize is performed at the same optimization level.
The -mem2reg optimization is also considered, which promotes memory references to register
references, resulting in reduced number of memory references and increased register pressure.
Experimental Results: The experiments on single and multiple nodes are performed using the
reference input set of each application. All available cores for all applications are utilized except
LULESH2, which requires a perfect cube number of MPI tasks, e.g., 27 on single node and 512
on multiple nodes. To ensure reasonable statistical significance, 1,000 runs are conducted for
each application/set of compiler optimizations. The dynamic cycle at which to inject a fault and
the target MPI task are randomly selected. The random cycle and MPI task are extracted from
two separate random sequences obtained from the same uniform random number generator using
distinct seeds.
Performance Analysis: As most compiler optimizations primarily target increased performance,
firstly, the effects of applying a different set of compiler optimizations on the applications’ performance are analyzed. The set of optimizations reported in Table 7.1 are applied: increasing levels of optimization generally augment the previous set of optimizations with additional
ones. For example, optimization level O2 augments the set of optimizations used at level O1
with -slp-vectorizer, -globaldce, -constmerge -barrier -loop-vectorize
-gvn -inline.
146

3

2.5

Speedup

2

1.5

1

0.5

0

O0 O1 O2 O3

O0 O1 O2 O3

LULESH

MINIFE

O0 O1 O2 O3
LAMMPS

O0 O1 O2 O3
MCB

Figure 7.2: Performance impact of compiler optimizations.
Figure 7.2 shows the speedup of applying increasing compiler optimization levels with respect to
optimization level O0. The results in the chart are average of 10 runs of native code, i.e., code not
instrumented for fault injection. As indicated by the results in Figure 7.2, compiler optimizations
can dramatically increase performance with speedup up to 2.7-2.8x for MCB and LULESH2, respectively. The performance improvements are much lower, but still considerable, for miniFE (up
to 1.52x) and limited for LAMMPS. An interesting observation is that optimization level O1 already
provides about half of the total performance improvement. As reported in Table 7.1, the clang
compiler already applies many important optimizations, including loop unrolling, rotating and
deletion, function inlining, and memory copy, at optimization level O1. Moving to optimization
level O2 provides considerable performance improvements for LULESH2 and MCB, which benefit
from vectorization, while optimization level O3 does not vary the applications’ performance.

147

To further investigate the reasons beyond such large performance improvements, the dynamic characteristics of applications’ are analyzed for each set of compiler optimization using the perf
toolset. As noted in Table 7.2, the instructions per cycle (IPC) of most of the applications decreases
when applying higher levels of compiler optimizations. However, as reported in Figure 7.2, higher
levels of optimizations provide significant performance improvements. The reason for these performance improvements is that higher levels of compiler optimizations reduce the total number of
instructions (Table 7.2), but do not generally reduce the number of last-level cache misses, hence
the Last-Level Cache misses/Instruction (LLC misses/Instr) increases and the IPC decreases. This
is an interesting observation, as generally it is expected that a higher IPC will be achieved when
applying increasing levels of optimizations because of loop unrolling, removing unnecessary data
movement, vectorization, and code elimination. These results are consistent with the findings
in [53], where the authors determine that the number of expected faults (EF) in an application
greatly decreases with increasing optimization level because of the reduced execution time. A
reduction in the execution time, however, does not imply that the intrinsic vulnerability of an application has changed, but rather there is a shorter time in which a fault can affect the application’s
output, which is highly dependent on the workload. Next, the impact of code optimizations on the
intrinsic vulnerability of HPC applications are analyzed.
Vulnerability Analysis: In this set of experiments, the vulnerability of each application when
applying increasing levels of compiler optimizations are analyzed. A single fault into one randomly selected MPI process is injected and the final application outcomes are classified according
to the criteria presented in Section 7.2. Here, masked case is the same as correct output (CO)
case due to inability of the fault-injection mechanism to distinguish vanished (V) and output not
affected (ONA) classes. To understand whether an injected fault is masked or produces a SDC,
the applications’ outputs are compared against a “golden” (fault-free) output obtained by profiling
applications in the same experimental conditions.

148

Table 7.2: Impact of compiler optimizations on application performance and other characteristics for single node.

Instruction decrease wrt O0
LLC-misses increase wrt O0
IPC
LLC misses/inst
Loads/inst
Stores/inst

O0
1.00
1.00
0.78
0.00043
0.7988
0.0101

Instruction decrease wrt O0
LLC-misses increase wrt O0
IPC
LLC misses/inst
Loads/inst
Stores/inst

O0
1.00
1.00
0.63
0.00054
0.7722
0.0069

LULESH
O1
O2
0.53
0.32
1.05
1.14
0.81
0.72
0.00084 0.00155
0.5486
0.5414
0.0169
0.0215
MiniFE
O1
O2
0.53
0.40
1.22
1.34
0.53
0.46
0.00125 0.00182
0.5274
0.4936
0.0128
0.0171

O3
0.32
1.15
0.72
0.00155
0.5429
0.0215

O0
1.00
1.00
0.57
0.00005
0.7298
0.0065

O3
0.39
1.34
0.46
0.00183
0.4962
0.0171

O0
1.00
1.00
0.34
0.00109
0.7779
0.0173

MCB
O1
O2
0.52
0.32
1.07
1.07
0.53
0.52
0.00010 0.00017
0.6304
0.6260
0.0075
0.0087
LAMMPS
O1
O2
0.83
0.82
1.00
1.00
0.32
0.32
0.00130 0.00132
0.6693
0.6701
0.0207
0.0211

O3
0.32
1.07
0.51
0.00017
0.6198
0.0083
O3
0.82
1.01
0.32
0.00134
0.6733
0.0211

Masked
100

SDC

Prolonged execution

Crashed

90

Percentage Outcome

80
70
60
50
40
30
20
10
0

O0 O1 O2 O3

O0 O1 O2 O3

LULESH

MINIFE

O0 O1 O2 O3
LAMMPS

O0 O1 O2 O3
MCB

(a) 1 Node (32 cores)
Masked
100

SDC

Prolonged execution

Crashed

90

Percentage Outcome

80
70
60
50
40
30
20
10
0

O0 O1 O2 O3

O0 O1 O2 O3

LULESH

MINIFE

O0 O1 O2 O3
LAMMPS

O0 O1 O2 O3
MCB

(b) 16 Nodes (512 cores)

Figure 7.3: Statistical breakdown of application vulnerability.
Figure 7.3a shows the percentage of cases in each category for single-node experiments with 32
cores. As indicated by the results, each application shows a different vulnerability profile: for
LULESH2 most of the injected faults are masked (97.3%) and do not generally produce corrupted
outputs (< 3%) nor require additional iterations to converge (< 1%). However, the injected faults
may result in the application to crash. For the other applications, a much lower percentage of
150

masked faults (between 50 and 65%) and a predominance of the other outcomes is observed.
In particular, miniFE and MCB present a considerable number of crashes (up to 20% and 40%,
respectively), while LAMMPS experiences a large number of cases in which the injected faults
corrupt the final output. Note that LAMMPS and MCB run for a fixed number of iterations. For
example, in case of LAMMPS, the total number of steps for which the molecular simulation needs
to be performed is specified as input parameter, thus no prolonged execution cases are observed.
Additionally, application developers in LAMMPS have provisioned built-in mechanisms to detect
extreme erroneous behavior, upon which user-driven invocation of MPI Abort() is done resulting in an application crash. Such cases are categorized as SDC in this set of experiments. Similar
mechanisms are enforced in LULESH2 and are treated in a similar manner.
When increasing the level of optimization from O0 to O3, there is a noticeable increase in the
number of crashes, with the exception of MCB. Interestingly, this increase in crashes does not
necessarily reduce the percentage of masked faults, at least for miniFE and LAMMPS. Rather, few
cases that require additional iterations to converge for miniFE and fewer SDC cases for LAMMPS
are observed. Overall, the results indicate that compiler optimizations have a stronger impact on
LULESH2 and miniFE than on the other two applications, as the former show a number of crashes
that considerably increases with the optimization level.
The trends observed on single-node experiments can be also identified on the multi-node experiments (Figure 7.3b): in both cases, the number of crashes for LULESH2 and miniFE increases with
the optimization levels, but in the multi-node experiments the number of crashes for LAMMPS
steadily increases with each optimization level. More in detail, LAMMPS has the lowest percentage of Masked cases, 50.7% and 36.2% with 32 and 512 MPI tasks, respectively. At the other
end of the spectrum, LULESH2 has the highest percentage of Masked cases, 97.3% and 96.5%
with 27 and 512 MPI tasks, respectively. Comparing Figure 7.3a and 7.3b, a general increase in
the applications sensitivity to faults is observed, which is due to a combination of factors. First,
151

a fault injected into a specific MPI process propagates faster in the application when the number of MPI processes increases. This is due to the higher number of MPI messages exchanged,
which increases the probability that a specific MPI process contaminates others. Second, assuming a constant per-process crash probability, using more MPI processes increases the probability
that the entire application crashes as the result of any of the MPI processes crashing. Notably,
for LAMMPS with optimization level O3, the Masked cases decrease from 54.1% to 37.5% when
varying the number of MPI tasks from 32 to 512. Comparing the two graphs in Figure 7.3, it
is also noticed that the vulnerability profiles of LULESH2 and MCB do not considerably change
while LAMMPS and miniFE show a increased vulnerability to faults when scaling to 512 cores.
Performance/Vulnerability Tradeoffs: As observed earlier, increasing levels of compiler optimizations have the potentiality of dramatically increasing application performance, with speedup
up to 2.8x. Whereas, it is also observed that compiler optimizations have an impact on application
vulnerability and that this impact is usually negative, i.e., the vulnerability of applications increases
with increasing levels of code optimization. In this section, the tradeoffs between performance and
vulnerability are discussed and the reasons why code optimizations impact fault masking. Combining the results reported in Figure 7.2 and Figure 7.3, it is observed that some applications, such
as LULESH2 and MCB, achieve considerable performance improvements with a limited impact on
their vulnerability. For example, LULESH2 achieves 1.9x and 2.8x speedup with optimization O1
and O2, respectively, with only a minimal impact on its vulnerability: the percentage of crashes
increases to 5% and 6% when applying optimization O1 and O2, respectively, compared to 3% of
optimization level O0. In this case, the performance benefit of higher level of code optimization is
worth the increase in application vulnerability. For other applications, however, the performance
benefits obtained by optimizing the code corresponds to a considerable degradation of the application’s vulnerability. For miniFE, for example, using optimization levels O2 and O3 does not bring
extra performance with respect to O1, but the vulnerability keeps increasing with each optimiza-

152

tion level. Finally, LAMMPS’s performance does not vary when applying different level of code
optimizations, while its vulnerability slightly increases.
38

O3

6

Percentage of crash

Percentage of crash

7

O2
O1

5

mem2reg

2

1
0.005

36

mem2reg

4
3

34
O1
32
O3

O0
30
6

0.01
0.015
0.02
0.025
Number of stores per instruction

(a) LULESH
8

O3

7

20
18

O2
7
8
9
10
Number of stores per instructionx 10−3

(b) MCB

Percentage of crash

Percentage of crash

22

16

O0

O2
O1

14

12
mem2reg
O0
10
0.005
0.01
0.015
Number of stores per instruction

0.02

O1
O2
O3

6
5
4
3

O0
2
mem2reg
1
0.016
0.018
0.02
0.022
0.024
Number of stores per instruction

(c) MiniFE

(d) LAMMPS

Figure 7.4: Correlation between application vulnerability and stores/instruction.
Analysis of the Causal Relation between Code Optimization and Vulnerability: As reported
in Table 7.2, the performance observed in Figure 7.2 is mostly achieved by applying optimizations
which result in a decrease in the number of instructions, despite the fact that the IPC of the tested
applications generally reduces when increasing the optimization level. In fact, LLVM applies aggressive dead code elimination (-adce) and combining redundant instructions (-instcombine)
across all optimization levels. By observing the results in Table 7.2 and the results in Figure 7.3,
it can be noted that IPC and application vulnerability seems inversely proportional, i.e., when the
153

IPC decreases the application vulnerability increases. Thus, the causal relation between compiler
optimizations, IPC and application vulnerability is investigated in two directions: First, the relation
between stores and application vulnerability and then the relation between loads and application
vulnerability is analyzed.
As explained earlier, the number of expected faults during an application run is greatly affected by
the application execution time. However, the fact that an application runs for a shorter amount of
time than another application does not imply that the former is more reliable than the latter. To
avoid effects of the bias induced by the different amount of instructions and execution time of each
application, loads and stores are analyzed with respect to the total number of instructions.
Figure 7.4 shows the percentage of application crashes as function of the number of stores/instruction,
which increases with the level of code optimization. As indicated by the results, there is a significant positive correlation, albeit not perfect, between the percentage of crashes and the number
stores per instruction, i.e., the application vulnerability increases with the number of stores/instruction.
This is due to the probability of a fault to propagate in the application memory state and, eventually,
crash the application. In fact, faults injected in the processors’ registers, can propagate and corrupt
the application memory state through store instructions. However, there is a probability that a fault
injected into a register is masked before the next store. If there is a large number of instructions
between the instruction at which the fault is injected and the next store to memory, i.e., a low
stores/instruction ratio, the probability of fault masking is high, hence the probability of crashes
is low. Conversely, if there are only a few instructions between the fault injection and the store
to memory, i.e., high number of stores/instruction, the probability of masking the injected fault
decreases, hence the probability that the fault propagates to memory and crashes the application
increases. Figure 7.4 confirms this observation for all applications, with the exception of MCB.
As reported in Figure 7.3, MCB does not follow any particular trend when increasing the level

154

of compiler optimization, though the performance trend reported in Table 7.2 follows the general
pattern.
The relation between the number of loads per instruction, which decreases with higher levels of
compiler optimization, and the application vulnerability is also analyzed. In this case, a decrease in
the number of loads/instruction decreases the probability that a memory load operation overwrites
a corrupted register before it propagates to memory, hence masking the injected fault. Figure 7.5
shows that, as the number of loads/instruction decreases, the percentage of crashes decreases.
38

O0

6
5

Percentage of crash

Percentage of crash

7 O3
O2
O1

4
3

mem2reg

2
1
0.5

36
mem2reg
34
O1
32

O0
0.6
0.7
0.8
Number of loads per instruction

30

0.9

O3
O2
0.65
0.7
0.75
Number of loads per instruction

(a) LULESH

(b) MCB
8

O3

7

20
18
16

Percentage of crash

Percentage of crash

22

O2
O1

14
12
10
0.4

0.8

O2
O3

6
5
4
3
O0

2

mem2reg O0
0.5
0.6
0.7
Number of loads per instruction

O1

0.8

(c) MiniFE

1

mem2reg
0.65
0.7
0.75
Number of loads per instruction

0.8

(d) LAMMPS

Figure 7.5: Correlation between application vulnerability and loads/instruction.

155

The correlations between loads/instruction and stores/instruction and the number of crashes resulting from the fault injection experiments are indication of the intrinsic vulnerability of each
application, i.e., the ability of an application to mask an eventual fault that occurs at register level.
All applications except MCB follow the same trend, i.e., a high stores/instruction ratio or a low
loads/instruction ratio indicate high numbers of crashes. For MCB, it is noticed that the number of
stores/instruction (and similarity the number of stores/load) is one order of magnitude lower than
other applications. Moreover, the number of stores/instruction is significantly lower than for the
other applications even with optimization level O0. These observations point to the fact that this
application has completely different characteristics than the others tested in this work. In particular,
it appears that each store instruction has a higher probability of corrupting loop control variables,
as shown in the next section. Thus, this characteristics is attributed to the random sampling of the
Monte Carlo algorithm executed by MCB, which produces a less regular computation and memory
access pattern [119].
Crash Analysis: Finally, Crash cases are analyzed in more detail to gain insights into the different
causes that may produce an abrupt termination of an application. The results of this analysis is
presented in Figure 7.6. At compiler optimization level O0, most of the crashes are due to segmentation faults caused by attempts to access a restricted or not allocated portion of the application
address space, probably as the result of corrupting a register that stores the value of a pointer. With
increasing compiler optimizations, some other reasons for Crash also arise, such as “MPI faults”
caused by erroneous inputs to MPI routines (remember that no faults are injected in the MPI library). The “hangs” cases observed for MCB are due to the application incurring in a time out
(set to 4x the execution time of the fault-free version). Figure 7.6 shows that a considerable number of crashes are due to “other” errors: these are mainly raised by the memory controller when
attempting to access erroneous physical addresses.

156

Seg−Fault

Hangs

MPI

Others

100
90

Percentage Outcome

80
70
60
50
40
30
20
10
0

O0 O1 O2 O3

O0 O1 O2 O3

LULESH

MINIFE

O0 O1 O2 O3
LAMMPS

O0 O1 O2 O3
MCB

Figure 7.6: Statistical breakdown of program crashes.

7.5

Framework to Quantify the Propagation of Faults into Distributed Application State

A compiler-level framework was developed to assess the corruption of application state at any
given time. Specifically, the total number of corrupt memory locations at any given time can be
determined after the fault has been injected into the distributed application.
Once an error occurs in a hardware register, it might propagate and contaminate other registers or
memory locations in the same process address space or, in case of distributed parallel applications,
to the address space of other processes. Understanding the speed (in terms of time) and depth (in
terms of number of contaminated processes) at which a fault propagates is essential to understand
the vulnerability of an application to faults.

157

1
4
2
1

2
2
4
1

Iteration 0
3 4 1
3 1 2
X
=
3 3 2
2 4 3

23
17
25
19

1
4
2
1

2
2
4
1

Iteration 1
3 4 23 208
3 1 17 220
X
=
3 3 25 246
2 4 19 166

1
4
2
1

2
2
4
1

Iteration 2
3 4 208 2050
3 1 220 2176
X
=
3 3 246 2532
2 4 166 1584

1
4
2
1

2
2
4
1

Iteration 2
3 4 184 1760
3 1 214 1964
X
=
3 3 228 2256
2 2 116 1086

(a) Without fault injected
1
4
2
1

2
2
4
1

Iteration 0
3 4 1
3 1 2
X
=
3 3 2
2 2 3

23
17
25
13

1
4
2
1

2
2
4
1

Iteration 1
3 4 23 184
3 1 17 214
X
=
3 3 25 228
2 2 13 116

Fault

(b) With fault injected

Figure 7.7: Fault propagation in Matrix-Vector multiplication.

Figure 7.7 shows an example of how a fault injected into an architectural register may propagate
and contaminate a large part of the memory state of an application. The example in Figure 7.7
is an iterative Matrix-Vector multiplication program that, at each iteration, performs Axi = bi ,
where A is a constant input matrix, xi = bi−1 is the input vector (x0 = [1

2

2

3] is a program

input) and bi is the iteration output. Figure 7.7a shows three iterations of the program when no
fault is injected. Let’s now assume that, during the execution of iteration 0, a fault occurs and the
third least significant bit of A[3, 3] flips from 1 to 0, inducing a change of value in A[3, 3] from 4
to 2. The corrupted value in A is then used to compute b0 [3] which, in turn, becomes corrupted.
Since b0 is used as input vector in iteration 1, the fault keeps propagating and corrupting other
values in the application’s state. As shown in Figure 7.7b, in three iterations one single bit flip
contaminates 37.5% of the application’s memory state, 100% of the application’s output state b2
and 100% of the read/write state x2 and b2 . In fact, in just two iterations, 25% of the application’s
state, 100% of the output state and 62.5% of the read/write state have already been corrupted.
Clearly, this rudimentary example shows how quickly a fault can propagate and contaminate part
of the application’s state and outputs.

158

The following instrumentation and modification is done using LLVM compiler framework so as to
get the fault-propagation information:

1. Need to compare with golden value when a memory operation is done.
2. Reproduce the pristine value when an access is made to an address with potentially corrupt
value.
3. Parallel set of operations need to be performed on the pristine set of values.
4. Additionally, function definitions will be modified to pass both pristine and potentially corrupt parameters, and return both pristine and corrupt outcomes.

The fault propagation module (FPM) for MPI applications tracks the fault injected by LLFI++ into
a register as it progressively contaminates the application’s memory state. The FPM consists of
two components: a compiler-level translation/instrumentation and a runtime checker/tracker. This
double-approach is necessary because accurately tracking faults on real applications running on
distributed systems is not as simple as it may appear at first. Many false positives may produce
inaccurate results and lead to erroneous conclusions. Table 7.3 shows several examples in which
the propagation of a fault depends on the particular operation and the operands involved. The
operations in the Table use two registers, a and b. Assume a and b be 8-bit values and initially
a = 19 (00010011) and b = 5 (00000101) and that the second least significant bit of a flips from
1 to 0. As Table 7.3 shows, whether or not the fault introduced in a propagates to b depends on the
particular operation a and b are involved in. In the first example, b becomes contaminated because
the corrupted bit a1 is used to compute the final value. In the example N4, instead, the corrupted
bit a1 is shifted out before the resulting value is stored to b, thus the fault does not propagate and b
contains a clean value.

159

Table 7.3: Fault propagation depends on instruction type and its operands.
N
1
2
3
4

Op
b=a+5
b = 13
b = a >> 1
b = a >> 2

Result (b)
24
13
9
4

Faulty Result (b’)
22
13
8
4

Contaminated?
Yes
No
Yes
No

Understanding whether a fault propagates with a pure compile-level tool is complicated, thus it
needs to be integrated with a runtime checker/tracker. Table 7.3 shows that a fault propagates only
if the output of an operation diverges from its equivalent output when all inputs are clean. This observation is used to understand, at runtime, how fault propagates into the application memory state.
Thus, both the potentially-corrupted results of an operation (b in Table 7.3), computed with inputs
that might have been contaminated, and the pristine results of the same operation (b0 in Table 7.3),
computed with inputs that have not been contaminated, need to be computed. The instructions that
contain potentially-corrupted results are part of the Primary Chain of instruction, i.e., the original program instructions chain augmented with LLFI++ fault injection instrumentation, while the
replicated instruction are part of the Secondary Chain, as shown in Figure 7.8 for the statement
c = 2*a + b. The pristine values associated with corrupted memory locations are stored in a
hash-table structure in the FPM runtime.
The FPM translation and instrumentation of the statement c = 2*a + b is depicted in Figure 7.9. Figure 7.9a shows an LLVM-like intermediate representation of the code. The program
loads the two input values from address a and b and stores the final result at the address c. Figure 7.9b shows the first step (fault injection): the code is instrumented with fault injection functions (fim inj(x) at lines 3, 5, and 6).7 At runtime, the fim inj(x) function checks if a
fault should be injected and eventually flips a random bit in register x (hence the term “potentially7

In this example, LLFI++ instruments only arithmetic operations, but other class of instructions can be considered.

160

corrupted”). The result is a potentially-corrupted value stored in register xf, which will then be
used in the primary chain of instructions. The second step (fault propagation) produces the code
shown in Figure 7.9c. All arithmetic instructions are replicated by the source-to-source translator
(lines 7 and 11). Whereas, the original instructions in the primary chain use potentially-corrupted
registers (r1f, r2f, r3, r3f, and r4) and the replicated instructions in the secondary chain use
pristine registers (r1p, r2p, r3p, and r4p). Load and store operations are instrumented with
runtime functions: at each load operation the runtime system checks whether the target memory location has been previously contaminated and, if so, also fetches the pristine value for that
memory location (fpm fetch() at lines 2 and 4 in Figure 7.9c and also the red dashed lines in
Figure 7.8). If the target memory location has not been contaminated, the fpm fetch() returns
the same value for both the potentially-corrupted and the pristine registers. Store instructions are
instrumented to compare the potentially-corrupted value that has to be stored in memory to the
corresponding pristine value computed by the secondary chain of instruction (fpm store() at
line 13). If the two values differ, the runtime environment adds the memory location address to the
list of memory location contaminated and stores its pristine value (this value will be fetched for
the subsequent load instructions corresponding to the memory address).
Below, the challenges of fault-propagation framework are analyzed:
Store addresses: In the example in Figure 7.9, transient errors are only injected into registers that
contains variables’ values. However, instructions can also manipulate address and use registers to
indirectly access memory. If a fault propagates to a register that contains a memory address and
subsequently it is used by a store produces a duplicate effect. First, the actual memory location
modified becomes corrupted because the address used by the store was not supposed to be written.
Second, the memory location that was supposed to be written is not modified, hence contains a
corrupted value.

161

r1 = ld a

r3 = mul r1,2

r3p = mul r1p,2

r4 = add r2,r3

r2 = ld b

r4p = add r2p,r3p

st r4,c

Figure 7.8: Primary and secondary chain of instructions for instrumentation in fault-propagation
framework.

Consider the following instrumented code:
r1f = fim inj(r1)
st 5, (r1f)
If a fault is injected in r1 (thus, r1 != r1f), the value 5 is written to a memory location r1f
that was not supposed to be written and becomes corrupted. To record this contamination, the FPM
runtime adds the pair <r1f,x> to the runtime hash-table, where x is the original value of r1f
before the store. The memory location r1 was supposed to contain the value 5 after the store but
it has not been overwritten, thus FPM also adds the pair <r1,5> to the hash-table.

162

1:
2:
3:
4:
5:

r1
r2
r3
r4
st

1:
= ld a
2:
= ld b
3:
= mul r1, 2 4:
= add r2, r3 5:
r4, c
6:
7:
8:

(a) LLVM IR

r1 = ld a
r2 = ld b
r1f = fim_inj(r1)
r3 = mul r1f, 2
r2f = fim_inj(r2)
r3f = fim_inj(r3)
r4 = add r2f, r3f
st r4, c

(b) LLFI++ code

1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:

r1 = ld a
r1p = fpm_fetch(a)
r2 = ld b
r2p = fpm_fetch(b)
r1f = fim_inj(r1)
r3 = mul r1f, 2
r3p = mul r1p, 2
r2f = fim_inj(r2)
r3f = fim_inj(r3)
r4 = add r2f, r3f
r4p = add r2p, r3p
st r4, c
fpm_store(r4,r4p,c)
(c) FPM code

Figure 7.9: FPM transformation and instrumentation using sample LLVM-IR program.

Function Calls: The input parameters passed to a function may be contaminated by a previously
injected fault and affect the result computed and returned by the function. For pure functions it
would be enough to execute the function twice, once with the potentially-corrupted input parameters and once with the corresponding pristine values. The former case would produce a potentiallycorrupted output while the latter a pristine output. This approach is used for library function calls
(such as sin() from the math library). However, in general, a function may also access global
variables during their execution. This means that a generic function could contaminate a much
larger application state than the returned value. To address this issue, the dual instruction-chain approach is followed. Additionally, the description of each function is also modified to accommodate
one extra input parameter (the pristine value) for each input parameter. Moreover, the exit point for
these functions is modified to return a struct that consists of two values, the potentially-corrupted
value computed by the primary chain and the pristine value computed by the secondary chain of
instructions. Finally, additional code is inserted to properly retrieve the pristine values associated
to each input parameter and to store the pristine result.

163

HT P1
α1 PVal1
α2 PVal2

Header

HT P2
β1 PVal1
β2 PVal2
# cont. loc.
d1
PVal1
d2
PVal2

# cont. loc.
d1
PVal1
d2
PVal2
α

β

α1

β1
MPI Comm.

Original
msg

α2

β2

Adderss space of
MPI process P1

Adderss space of
MPI process P2

Figure 7.10: MPI message handling within the fault-propagation framework.

Some functions, such as memory management or I/O operations, impact the address space structure
or the interaction with the external world. Such functions are not replicated to avoid side effects,
such as output values printed twice.
MPI communications: A fault can propagate from the address space of an MPI process P1 to the
address space of another MPI process P2, if P1 sends a message to P2 containing contaminated
data. Neither the sender nor the receiver process have enough information to accurately track the
propagation of faults through inter-process communication. The main problem is that a contaminated memory location in the virtual address space of the sender may be stored in a completely
different memory location in the virtual address space of the receiver. Since neither the sender nor
the receiver has access to each other’s address space, thus, extra information is embedded in the
message together with the message itself. Figure 7.10 illustrates the approach in detail: Assume
that an MPI process P1 sends a message msg to a destination process P2. In the address space

164

of the sender process msg is stored at address α while in the address space of the destination process the message will be copied to address β which, in general, is different from α. Also assume
that N memory locations in msg are contaminated and that their pristine values are stored in the
FPM hash-table in the address space of P1, HT1. Given that α 6= β, the addresses of the two
contaminated memory locations in the address space of P1 (α1 and α2) can not be used to derive
the addresses of the memory location in the receiver address space (β1 and β2) on the receiver
side. Thus, the displacements with respect to the beginning of the msg are utilized, which remains
constant regardless of the initial address at which the msg is stored, to communicate to the receiver
which memory locations are contaminated in the message. Before sending the message, the FPM
runtime routine intercepts the MPI communication functions and analyzes the message. For each
contaminated memory location, FPM computes the displacement with respect to the initial address
α of the original message and retrieves the corresponding pristine values from the hash table. In
the example in Figure 7.10, the two contaminated memory locations in the message are stored at
α1 and α2, thus d1 = α1 − α and d2 = α2 − α. FPM then add an extra header to the original
msg, containing the number of memory locations contaminated in the message and one record
<displacement,pristine value> for each contaminated memory location. On the receiver side, the destination process extracts the header and uses the first entry to determine how
many contaminated memory locations are present in the message. The receiver then extracts the
original message msg and stores it at address β. At this point, the displacements in each record
in the header are used to compute the addresses of the contaminated memory locations in the
destination address space. Finally, the FPM runtime on the receiver side adds the addresses of the
contaminated memory locations and their corresponding pristine values to the HT2 in its address
space. Then, MPI process P2 could utilize these pristine values in it’s load operations.

165

7.6

Experimental Results for Fault-Propagation Analysis

In this section, the impact of faults in important HPC applications is analyzed through how faults
propagate into the applications’ state. An additional benchmark AMG2013 from the CORAL program is considered in the fault-propagation analysis as compared to set of benchmarks utilized
earlier. All applications are compiled with default optimizations in this set of experiments and use
their default input set.

7.6.1

The Black-Box Analysis

Fault-injection experiments are conducted similar to the “black box” approaches followed in previous work [95, 146, 178] and are based on the analysis of the application’s output variation. Each
application is ran 5,000 times and a single fault is injected in each execution into randomly selected
MPI processes. The application output is considered corrupted (WO) if it differs significantly from
the fault-free execution (5% tolerance is utilized herein) or if the application itself reports results
outside of the acceptable error metrics.
The results are reported in figure 7.11; LULESH appears as a robust application with over 90% of
cases resulting in correct results and no performance penalties. Only less than 10% of the executions result in crashes and less than 5% of the experiments produce wrong results. On the other
extreme, LAMMPS appears to be the most vulnerable application: about 20% of the experiments
result in crashes and in 40% of cases the result is corrupted by a single-bit fault (WO). The final
results are correct in only 40% of the experiments. MCB shows a behavior similar to LAMMPS,
though 60% of experiments show correct results. For miniFE, a considerable number of cases that
produce a correct result, but take more time to converge to an acceptable solution (PEX) are noted.
These are interesting cases, as they expose a particular characteristic of scientific applications that

166

is not necessarily present in other domains: The user could tradeoff the accuracy and correctness
of the computed solution for the performance of the application. These kind of tradeoffs will be
more important in the exascale era, when SDCs will be more common. In majority of the cases,
the crashes are mainly due to bit flips in pointers that cause the applications to access a part of the
address space that has not been allocated. However, there are some application-level provisions
which may also cause a crash. For instance, LULESH has some internal checks which are run on
the partial result to look for obvious deviations from the norm. Such as, if the energy computed
at time step i is outside of the acceptable boundary, the application aborts the execution calling
MPI Abort() routine.
CO

WO

PEX

Crashed

100

Percentage Outcome

80

60

40

20

0
LULESH

AMG2013

miniFE

LAMMPS

MCB

Figure 7.11: Outcome of fault injection with single fault into a single MPI process.

167

7.6.2

Fault Propagation Analysis

Statistical analysis of the vulnerability of parallel applications based on output variation provides
useful high-level information but does not provide insights on the application’s memory state. This
set of experiments examines how injected faults propagate through the application. In particular,
the objective is to discover how fast and with which profile faults propagate in the memory space of
an MPI process (propagation speed) and how many MPI processes are contaminated (propagation
depth). Since it is not possible to show all the graphs, a representative set of fault propagation
profiles for each application is shown in Figure 7.12. The application of these profiles is discussed
later. Even the few cases plotted already highlight the importance of the applications’ structure
and algorithm in the propagation of faults. As reported in the previous section, crashes generally
occur immediately or in proximity of the last injected fault, thus such cases are not reported in
Figure 7.12. The plots in Figure 7.12 report two cases for each of the three remaining classes (CO,
WO, and PEX), whenever possible. The maximum percentage of contaminated memory state for
each application are also reported separately in Figure 7.12f.
Generally, the iterative nature of these scientific applications produces a deep contamination of
their memory state. When faults contaminate the velocity or position of a particle P , the interaction
of P with other particles and the forces induced on the latter by P will produce wrong movement
or energy charges. The particles affected by P will also be contaminated and the process will
repeat exponentially in the next time steps. Eventually, given enough iterations, all the memory
state will become contaminated. Below each application is discussed separately.

168

Prolonged Execution

Wrong

3
2.5
2
1.5
1
0.5
0
0

0.5

1

1.5

2
2.5
Time (cycles)

3

3.5

4

Masked

6

5
#of Corrupted Memory Locations

#of Corrupted Memory Locations

x 10

x 10

Wrong

3

2

1

1

2

3

(a) LULESH

4
5
Time (cycles)

6

7

Masked

6

2

4

0
0

4.5
8
x 10

Prolonged Execution

#of Corrupted Memory Locations

Masked

6

3.5

8

x 10

Wrong

1.5

1

0.5

0
0

9

Prolonged Execution

0.5

1

1.5

8

x 10

(b) AMG

2
2.5
Time (cycles)

3

3.5

4
9

x 10

(c) miniFE
60
50

3.5

10
8
6
4
2
0
0

1

2

3
4
Time (cycles)

(d) LAMMPS

5

Masked

6

#of Corrupted Memory Locations

#of Corrupted Memory Locations

12

Wrong

6

7
9

x 10

x 10

Wrong

Percentage %

Masked

6

x 10

3
2.5
2

40
30
20

1.5
1

10

0.5
0
0

0
1

2

3
Time (cycles)

(e) MCB

4

5

6

LULESH

AMG

MINIFE

LAMMPS

MCB

7

x 10

(f) Percentage of cont. memory state

Figure 7.12: Fault propagation plots demonstrating the number of corrupted memory locations over time.

Figure 7.12a shows how faults propagate in LULESH [75], a shock hydrodynamics proxy application developed by the ASCR ExMatEx Exascale Co-Design Center to model numerical algorithms,
data motion, and programming style of typical scientific applications. The application solves a
simple Sedov blast problem with analytical answers. As depicted in the plots, injected faults progressively propagate into the application’s state. This is the result of the iterative structure of the
application that use the results of a time step i (speed and position of the fluid) as input of time
step i + 1. With a closer look at the graph it is possible to identify the time steps. Within each
time step, the number of contaminated memory locations remains roughly constant, while between
one time step and the next the number of contaminated memory locations increases. The plot in
Figure 7.12a also shows that the propagation of faults follow the same trend in all cases, regardless
of the correctness of the final output or if the application takes longer to converge.
LAMMPS [127] is a molecular dynamics code that models an ensemble of particles in a liquid,
solid, or gaseous state. The application computes Newton’s equations of motion for system of
interacting particles and can model atomic, polymeric, biological, metallic, granular, and coarsegrained systems using a variety of force fields and boundary conditions. Herein, the Cu metallic
solid with embedded atom method (EAM) potential which involves the dynamics of 32,000 atoms
for is solved for 100 time steps. Figure 7.12d shows that faults injected in the application progressively propagate through the memory state at every time step. A fault that corrupts the velocity
or the position of a molecule at time step i will induce wrong forces to the adjacent molecules at
time step i + 1. Within 100 time steps, more than half the memory state becomes contaminated
(see Figure 7.12f), which results in more than half the case of corrupted results in Figure 7.11. An
interesting case is represented by the lower profile in Figures 7.12d. In this case the injected fault
corrupted a static data structure that is not used during the computation, thus the fault does not
propagate to the rest of the application’s memory state. Note that this case was not identified in the
previous experiments based on output variation analysis.

170

miniFE is a DOE proxy application which implements several kernels representative of implicit
finite-element applications. In particular, the application assembles a sparse linear-system from the
steady-state conduction equation on a brick-shaped problem domain of linear 8-node hex elements.
Next, miniFE solves the linear-system using a simple unpreconditioned conjugate-gradient (CG)
algorithm and compare the computed solution to an analytical model for steady-state temperature
in a cube. Figure 7.12c presents fault propagation profiles for miniFE. These profiles can be distinguished in the graph as the assembly of the linear system in the first part (which mainly consists
of scattering element-operators into sparse matrix and vector) from the CG solving phase (sparse
matrix-vector products). Faults injected in the initialization phase quickly propagate and contaminate the sparse matrix and vector (as in the dense example in Figure 7.7) and, reach a steady state
maintained in the solving phase. Faults injected in the solving phase quickly reach a steady state.
As indicated by the profiles, the two cases with wrong results cause different behaviors: for the
left-most case, the internal check on the sparse matrix and vector fails and the application aborts
before starting the solving phase. In the right-most case, the application does not converge and
terminates after reaching the maximum number of iterations. Given the sparsity of the matrix and
vector, even a small percentage of contaminated memory locations (see Figure 7.12f) can lead to
corrupted results or prolonged executions.
AMG2013 [71] is a parallel algebraic multi-grid solver for linear systems arising from 3-D dimensional problems on unstructured grids. The communications and computations patterns exhibit
the surface-to-volume relationship. The default problem, a 3-D Laplace type problem on an unstructured domain with an anisotropy in one part is used herein. The fault propagation results for
AMG2013 are shown in Figure 7.12b. The application performs three different phases that can be
identified in the figure, especially when the fault is injected early during the execution of the application: Initialization, Setup of the conjugate gradient pre-conditioner and the Solving phase. A
close look at the data reveals that faults injected in the early initialization phase propagates slowly

171

at first and then ramps up when starting the setup phase. During the setup, the amount of memory location contaminated remains roughly constant, which indicates that the unstructured grid
becomes quickly and completely contaminated. Finally, in the solving phase, AMG2013 allocates
the data structures required to solve the Laplace problem: as seen from the graph, faults quickly
propagate in the memory state of the application contaminating more memory locations at every
iteration of the solver. In two cases, the fault injected contaminates data structures not involved in
the solving phase. In these cases, the amount of contaminated memory locations remains stable at
the value reached during the setup phase.
MCB models the solution of a simple heuristic transport equation using a Monte Carlo technique.
The application employs typical features of Monte Carlo algorithms such as particle creation,
particle tracking, tallying particle information, and particle destruction. The heuristic transport
equation models the behavior of particles that are born, travel with a constant velocity, scatter, and
are absorbed. MCB achieves parallelism through domain decomposition of the physical space and
threading. When particles hit the boundary of a domain, they are buffered and then sent using
non-blocking MPI calls to the processor simulating the domain on the other side of the boundary.
Figure 7.12e shows the typical fault propagation profile as indicated for other iterative applications.
Faults propagate from one particle to the other during their movement and from one MPI process
to another when a particle moves across domains. Interestingly, even late-injected fault can still
corrupt the output.
Propagation through MPI processes: The following experiments analyze how faults propagate
across different MPI processes. As explained earlier, even though a single fault is injected into a
randomly selected MPI process. However, that process may send contaminated data to other MPI
processes and, thus, corrupt their address space.

172

Figure 7.13 shows two example, LULESH and miniFE, in which an initial fault injected at a certain time X into MPI process 4 and 6, respectively, propagates and contaminates all other MPI
processes. For LULESH faults propagate immediately to all other MPI process and spreads very
quickly, as MPI exchange data at the end of an iteration. For miniFE, instead, the fault does not
propagate until very late in the execution, but then spreads quickly to all other MPI processes.

LULESH

MINIFE

1000
900

#of Corrupted MPI ranks

800
700
600
500
400
300
200
100

X

X+20

X+40

X+60

X+80

X+100
X+120
Time (seconds)

X+140

X+160

X+180

X+200

Figure 7.13: Fault propagation across different MPI processes.

Categorization Based on Fault Propagation: The results presented in Figure 7.11 and 7.12 are
somehow contradictory. Figure 7.11 shows that the tested applications can tolerate the presence
of faults during their execution and still produce correct results. For example, in 90% of the cases
LULESH produces a correct results in the presence of a randomly-injected fault. Following this
data, the user could decide to employ a light-weight resilience mechanism to protect LULESH,
given the relatively robust nature of the application. In reality, however, the application is quite
173

sensitive to transient errors, as LULESH’s memory state might be corrupted significantly even
when the final output is correct (Figure 7.12). This observation holds for the other applications as
well. This means that only using the results of fault injection experiments may lead to incorrect
conclusions and the deployment of resilience mechanisms that are not adequate.
Using the proposed fault propagation framework, it is possible to distinguish cases in which transient errors propagate through the application’s memory state from the cases in which a fault is
masked at processor level before contaminating any memory location. As explained earlier, the
CO cases in Figure 7.11 can be decomposed into two further categories: Vanished and ONA. A
deeper analysis of the internal propagation of faults reveals that most cases (over 98%) identified
as CO in Section 7.6.1 present corrupted memory states. The number of cases in which faults are
masked at processor level before propagating to memory is surprisingly low. This may be due to
the fact that LLFI injects faults into live registers and that these faults have a higher probability of
propagating to memory than faults injected into dormant registers. Previous work has also identified similar discrepancies between circuit- and register-level fault injection [45]. Nevertheless,
these results show that it would be dangerous to assume that the tested applications can tolerate
the presence of faults while, in reality, they may produce incorrect results in a slightly different
execution context.

7.7

Fault Propagation Modeling

In Section 7.6.2, it is noted that faults generally propagate linearly in the application’s state during
the execution. Here, these observations are utilized to build a fault propagation model that can be
used to estimate the number of Corrupted Memory Locations (CMLs) at a time t, once a fault is
detected at time tf . From the graphs in Figure 7.12, it is evident that each fault propagation profile
can be expressed as a function of the execution time with a piece-wise equation that is linear in the
174

first sub-domain and constant in the second. The linear part of the profile is the most interesting
because the different profiles characterize the sensitivity of the applications to faults. Thus, linear
regression techniques can be employed to derive a generic close form of the fault propagation
profile. For each experiment, the specific fault propagation profile can be expressed as:

CM L(t) = a · t + b

(7.1)

where t is the time during the execution, a expresses how quickly a fault propagates in terms
of memory locations corrupted per second, and b indicates the time tf where the fault occurs.
Standard validation techniques are employed to verify the accuracy of each model. Our results
show that the errors are within 0.5% of the actual CML values. The value of b in a particular
execution can be derived from the time tf in which the fault occurs.8

b = −a · tf

(7.2)

By applying linear regression to each fault propagation experiment, a family of linear functions is
obtained. Using these linear models, the fault propagation speed factor (FPS) for each application
can be computed as the average of the a factors. The FPS expresses the rate at which transient
errors propagate into the application’s state. The metric can be used operatively to estimate the
number of CML within a timer interval (t1 , t2 ), even if the exact time at which the transient error
occurred, tf , is not known. For example, assume that no fault was detected at time t1 and that a
fault is detected at time t2 . The application FPS can then be used to estimate the maximum number
of CMLs, as:
max(CM L(t1 , t2 )) = F P S · (t2 − t1 )
8

(7.3)

We assume that the fault is detected when it occurs. In reality, there might be a delay between the occurrence and
the detection of the fault ∆t that needs to be taken into account in the computation of b.

175

The above formula is an upper-bound of the maximum number of CMLs where it is assumed that
the time the fault occurs, tf , is close to the lower extreme of the interval, t1 . On average, tf =
(t2 − t1 )/2, hence the average number of CMLs in the time interval (t1 , t2 ) is avg(CM L(t1 , t2 )) =
max(CM L(t1 , t2 )/2, as expected.9 The estimation provided by the model can be used to decide,
at runtime, if a roll-back should be triggered. For application with low FPS, i.e., relatively robust
applications, the fault-tolerance system could decide to keep the application running if the CML at
the end of the application is predicted to be below a safe threshold.
Table 7.4: Fault propagation speed factors.
App.
FPS
SDev

LULESH
0.0147
1.48E-4

LAMMPS
0.0025
0.96E-4

MCB
0.0562
26.7E-4

AMG2013
0.0144
6.82E-4

miniFE
0.0035
2.89E-4

Table 7.4 reports the FPS computed for each application. To the contrary of what is indicated in
Figure 7.11, Table 7.4 shows that, when considering the CML in the application’s state independently of the final output, LULESH is much more vulnerable than LAMMPS, as faults propagate
at a rate of 0.0147 CML/sec in the former and 0.0023 CML/sec in the latter. MCB is the most vulnerable application among the ones tested. For this application faults propagate at a rate of 0.0531
CML/sec. This may be a property of the Monte Carlo method used in the application, which is
almost embarrassingly parallel. Interestingly, LAMMPS and miniFE, which are the applications
with the largest percentage of wrong results in Figure 7.11, are the applications with the lowest
FPS. This indicates that, compared to the other applications, faults propagate at a much lower rate
in these two applications but that the error margins used to accept the final solutions are stricter.
Since the error margins are parameters set by the user and do not depend on the application’s characteristics, the FPS metric may be a more precise way to assess the intrinsic vulnerability of an
application.
9

min(CM L(t1 , t2 )) = 0, assuming tf = t2 .

176

7.8

Related Works on HPC Fault Tolerance

Fault Injection: Fault injection can be performed at various abstraction layers, from circuit to
application level. Cho et al. [45] present a study of the accuracy of fault injection at higher abstraction layers. The authors report that single bit-flips at the circuit-level such as those in flip-flops
are difficult to model at the register-file or architecture level. Fault injections at the circuit-level is
considered the most accurate method but it requires sophisticated infrastructures, such as radiationexposure to processor chips [43, 44], or processor RTL simulations [45, 107]. Next, cycle-accurate
microarchitecture-level simulators [97] and architecture-level simulators [60, 24] have been proved
reliable. Both of which might limit the size and scale of the applications and systems under study.
Software-Implemented Fault Injection (SWIFI) [31, 49, 95, 106, 146, 150, 178] can be used to perform accelerated fault injection at application level. Li et al. [95] propose a fault injection system
based on Pintool [105], a binary instrumentation tool for x86 systems. In [95], a single fault is randomly injected into data-structures of parallel scientific applications and the correlation between
the fault outcome and the location of the injected faults is analyzed. Such fault injection is done directly in the application state, by corrupting the global, heap and stack memory space and does not
consider faults that may be masked at the architecture level, before reaching memory. Compilerlevel SWIFI tools such as LLFI [178] and KULFI [146] provide the ability to inject faults at register
level, which provides a chance for the faults to be masked at the register level before being committed to memory. Other related works include [60, 103, 138]. In [178], the accuracy of LLFI is also
quantified in assessing applications resiliency as compared to dynamic-instrumentation based fault
injection. LLFI is found to provide adequate information about applications vulnerability to soft
errors. Other techniques, like Xception [106] and GOOFI2 [150], inject faults during exception
handling routines or through breakpoints, respectively.

177

Application vulnerability studies are instrumental in providing important insights into application
behavior, such that mechanisms can be enforced to provide desired level of application performance and/or resiliency. Critical application points identified using such studies, can then be
hardened against soft errors by techniques such as instruction duplication to ensure correct application state is maintained during execution or otherwise the error is detectable [138]. Other
software-only techniques such as Software Implemented Fault-Tolerance (SWIFT) reduce the overhead of instruction duplication through control-flow checking and exploiting unused instructionlevel parallelism [133]. Hardware-only solutions, such as Dynamic Implementation Verification
Architecture (DIVA), ensure functional correctness at runtime by augmenting commit phase of
an out-of-order core with a functional checker [15]. Energy utilization versus resilience tradeoffs have been explored for reconfigurable parallel architectures [74]. A detailed comparison of
compiler-based software-only and hybrid hardware/software techniques for soft-error protection
of embedded processors is presented in [108]. In this work, it was investigated whether compilerbased software-only techniques can effect the resiliency of distributed applications and the relative
performance/resilience tradeoffs.
Vulnerability Metrics: Different metrics at various abstracted layers have been presented in literature with distinct goals. Figure 7.14 summarizes some of the resilience and vulnerability metrics
commonly used. At the lowest level of abstraction, circuit-level masking is quantified using the
Timing Vulnerability Factor (TVF) [140] which is used to determine the probability of a fault occurring within the setup and hold time windows of the Flip-Flops in the circuit. Moving upwards to
assess the vulnerability of macro hardware structures inside of a processor like register files, arithmetic logic units, re-order buffer, the Architectural Vulnerability Factor (AVF) was proposed [117]
to determine the probability of a fault in each of these structures causing an error in the final application outcome. In a subsequent work, the Program Vulnerability Factor (PVF) [152] was proposed to evaluate the software reliability independent of the underlying microarchitecture so as to

178

propose changes at the software or application-level. Thus, AVF can be used to determine both the
architecture and microarchitecture level masking effects on the application outcome, whereas PVF
focuses on architecture-level masking effects. Both AVF and PVF are computationally-intensive,
especially if applied to large parallel, distributed-memory applications. In [48], the authors categorize instructions based on derating of a single bit flip in architectural registers and propose the
Application Derating metric [21]. PVF or AVF analysis can be done using Architecturally Correct
Execution (ACE) [117] analysis or using fault injection. On a side note, previous works [173] have
pointed out inaccuracies of ACE analysis as compared to fault injection studies, which can lead to
overestimation of protection mechanisms.

AVF

Architecture

PVF

Software/App

Microarchitecture

Legend:
TVF = Timing Vulnerability Factor
AVF = Architectural Vulnerability Factor
PVF = Program Vulnerability Factor
DVF = Data Vulnerability Factor
FPS = Fault Propagation Speed

TVF

Circuit

FPS

DVF

MPI Layer

Figure 7.14: Fault-Masking levels investigated by related works.
In a complementary work [62], fault propagation was studied across MPI boundaries. However
as the faults are directly injected into data structures in distributed memories, the architecturelevel masking effects cannot be quantified. Similarly, the Data Vulnerability Factor (DVF) [185]
is used to capture the vulnerability of data structures inside of an application by estimating the
memory access patterns. While AVF, PVF and DVF are scalar metrics, the methodology and FPS
metric proposed herein provide detailed information about the internal application memory state

179

within a single process and across multiple processes in distributed applications, while providing
information about architecture- and software-level masking similar to PVF metric.
Fault Propagation: Li et al. [95] use application knowledge to analyze the results and qualitatively
correlate the outcomes of injected faults to the locations where fault was injected. They assess that
a fault has propagated if the final result of the application has been corrupted or the execution has
been prolonged. However, it is shown through this work, that faults may propagate into the application’s memory state even if they do not corrupt the final state of the application and additionally
the speed and depth with which faults propagate is quantified. This quantitative analysis is important to understand the underlying vulnerability characteristics of an application and to select the
most effective fault tolerance system. Similar application-specific studies have been performed for
multi-grid solvers in [36] and iterative linear algebra methods in [28]. The authors in [36] study
the effect of fault propagation across various phases of AMG by observing the deviation of known
data structures from fault-free values with the final goal of protecting critical pointers. In contrast,
a generic methodology that allows the user to study a larger set of applications is feasible through
the proposed work.
Fault Tolerance based on Fault Injection Studies: In the past, researchers have been mainly
concerned about permanent (or hard) errors that occur when a specific component does not work
properly and needs replacement [58]. Permanent errors are more consistently observable, though
modern system components, such as network links or memory modules, may appear functioning
though not with the expected performance [102]. Fault detection is orthogonal to this work, yet important for the design of resiliency systems. SoftWare Anomaly Treatment (SWAT) [135] utilizes
a low overhead symptom based fault detection mechanism to provide fault-tolerance. Application
symptoms include triggers like high OS activity, hardware traps and program invariants. Compile
time analysis is also employed to analyze applications and identify instructions that have an increased chance of causing SDCs [138]. Similar studies have been presented in [60, 84, 103]. In
180

Shoestring [60], the authors find out instructions that modify the system state and/or serve as input
to function calls are potential candidates for SDC hardening. The authors, then, propose selective
instruction duplication and comparison of results before committing final outcomes to the memory, to increase tolerance to SDCs. These studies are mostly done at low-scale, while large-scale
systems can be targeted for similar analysis through proposed work in this chapter.

7.9

Summary

Power constraints of exascale systems will encourage use of techniques like massive parallelism
and NTV, which may result in higher errors and impact the correctness of scientific applications.
Despite its importance, the vulnerability of HPC applications has not received enough attention,
primarily because of the lack of tools that allow researchers to perform accelerated fault injection
and analyze how injected faults propagate in the application’s state. Thus, a novel fault propagation
framework is developed that is capable of accurately tracking the propagation of faults into parallel
MPI applications. The framework provides programmers with new insights about the vulnerability
of applications to transient errors and the correlation with the application structure. Results indicate that even a single fault introduced at the processor-level can contaminate a considerable part
of the application’s state. Compiler-optimizations are shown to strongly impact application vulnerability because it effects application structure and algorithm. Some applications are inherently
more robust than others and can tolerate partially-corrupted memory states, possibly at the cost of
increased execution time. Generally, faults are observed to propagate linearly and progressively
corrupt the distributed address space through message passing communication. This observation is
used to derive fault propagation models for each application that can be used at runtime to estimate
the number of corrupted memory locations once a fault is detected. It is also shown that analyzing
the vulnerability of parallel applications with a “black-box” approach based on output variability

181

analysis resulting from statistical fault injection may lead to incorrect conclusions. In particular,
the developed tool is capable of differentiating the cases in which a fault is completely masked
at processor-level and does not propagate to memory from those cases in which the application’s
memory state becomes contaminated, even though the final results appear correct.

182

CHAPTER 8: CONCLUSIONS AND FUTURE WORK

This final chapter presents a summary of achievements of this dissertation and integration into
state-of-the-art. In addition, limitations of proposed techniques are discussed, with recommendations for improvement. The final section presents a list of possible future work.

8.1

Summary of Developed Techniques and Tools

Future computing architectures will need to continuously adapt in order to effectively utilize the
underlying unreliable components or devices caused by the high integration density, performance
and energy-efficiency requirements. To operate reliably in these environments while meeting userspecifications will require the computer-architects to provision novel techniques at various design
abstraction levels which are manifested autonomously and oblivious to the application behavior.
Multiple abstraction layers can interact to provide a coherent computational ecosystem as demonstrated herein. Specifically, in this dissertation, adaptability is demonstrated at circuit-level up to
the application-level. The summary of technical achievements for developed circuit-level faulttolerance techniques are listed in Table 8.1.
Netlist-Driven Evolutionary Refurbishment & Evolvable Hardware: In some cases, adaptability is possible due to ability of the computing platform to support such amorphism. For instance,
FPGAs support such trait at runtime through their dynamic reconfiguration capability, which makes
them an ideal platform for applications which require autonomous operation in addition to a short
time-to-market or design-cycle. Thus, intelligent techniques can easily be adopted which can
meet multiple objectives such that the functional requirements of the application are maintained
while providing reliable and energy-efficient operation. The EHW technique is an example of

183

providing functional reliability, where faults which manifest at the application-level are handled
autonomously without requiring functional models to verify correctness. The genetic-algorithm is
utilized to realize FPGA configurations, which can be downloaded on to the hardware at runtime to
quantify functional correctness and other diagnosis can similarly be performed. Herein, proposed
changes to the genetic-algorithm through the NDER technique is shown to reduce its search space
significantly to make the refurbishment of relatively large-sized circuits feasible.
Table 8.1: Summary of technical achievements for circuit-level fault-tolerance techniques.
Technique
DD (Chapter 3)
NDER (Chapter 4)
SREL (Chapter 5)
TMR for Soft-Error Masking (Chapter 6)
DD
NDER
SREL
TMR for Soft-Error Masking
SREL
TMR for Soft-Error Masking

Goals met
CMF Mitigation in TMR Systems for FPGAs
Pruning in EHW reduces Recovery Time
Relaxed Guardbands reduces Energy Consumption
100% Fault Coverage at NTV in Energy Budget
Limitations
DD metric does not scale well for FPGA circuits
Proprietary Configuration Bitstream for FPGAs
Area Overheads and Timing Wall
200% Area Overhead
Remaining Challenges
Processor Implementation on a Fabricated Chip
Temporal Redundancy for Area Reduction

The self-healing capability of conventional EHW technique is extended by NDER through exploitation of circuit characteristics. NDER takes into account the output discrepancy conditions,
formulated as spanning from exhaustive down through aggressive pruning heuristics, to reduce the
search space and maintain the functionality of viable components in the subsystems which use
them. For instance, the failure syndrome expressed as the bitwise output discrepancy is utilized to
form a concise set of suspect resources at failure-time. Thus, the individuals comprising the population of the GA are dynamically determined, which balances the metrics of fault coverage, size
of the search space, and implementation complexity. The proposed hybrid heuristic HAE is shown
184

to achieve significant pruning while achieving complete fault recovery. It’s capability to preserve
pristine LUTs promotes gainful search for the GA. As a result of the proposed work, the benefits
of EHW are feasible, such as the exploration of the design-space independent of the underlying
device technology characteristics and failure mechanisms.
Self-Recovery Enabled Logic: On the other end of the spectrum, hardware reconfiguration may
not be possible for ASICs, unless special provisions are accommodated at the design-time. Thus,
ensuring reliable operation is a design challenge in such computing platforms. For example,
general-purpose processors are ASICs, where software developers write programs which assume
that the underlying circuit performs a sequence of instructions reliably. At the circuit-level, performance variations due to reliability concerns such as transistor-aging can be compensated via
worst-case design-margins or voltage/frequency scaling at runtime. Although conceptually simple, voltage or frequency variation at runtime requires the use of on-chip sensors which can provide
feedback to compensate these circuit parameters accordingly. Additionally, voltage has a quadratic
relationship with dynamic power consumption and frequency can affect application throughput.
Thus, techniques developed herein can ensure reliable operation while minimizing these overheads.
To this effect, the proposed SREL technique leverages the silicon area on the chip which is ‘dark’.
Recall, dark silicon is due to high integration densities and thermal constraints which make it
infeasible to utilize all the on-chip transistors. This area is utilized by the SREL technique to implement a redundant set of aging-critical logic elements of the circuit, which are space multiplexed
with the active elements in the circuit. By alternating the activation of these logic elements through
power-gating, it is possible to reduce stress which results in significant mitigation of aging-induced
performance degradation in addition to reduced leakage energy overheads. The activation of these
elements can be controlled autonomously by using on-chip timing sensors as described for the
competing critical paths control strategy developed in this dissertation. This circuit-level adapta185

tion is carried out independent of the application-level, and in effect provides a unique technique
where architecture-level control knobs can provide post-fabrication flexibility.
Furthermore, the proposed circuit remodeling is compatible with other feedback control mechanisms such as Razor, and can be extended to enable self-selection and runtime competition among
logic domains in the presence of other noise sources such as process, voltage and temperature
variations. Insertion of shadow latches for detecting timing violations are shown to enable selfselection among replicated paths. This autonomous selection behavior alleviates the need for any
accurate aging modeling as actual circuit degradation is determined using runtime inputs and operating conditions.
Thus, SREL provides an example of active management of on-chip resources, which can be easily
integrated with voltage/frequency scaling techniques as described earlier. This causes a substantial
reduction in design margins, without affecting the overall performance of the application, in turn
reducing energy consumption over the device lifetime. Results for ISCAS benchmark circuits
show around 35% savings in supplemental lifetime energy due to reduction in voltage guardbands.
Impact of circuit-level reliability on large-scale systems: Due to thermal constraints and integration challenges of CMOS devices, energy-efficiency through near-threshold voltage operation
is seen as an optimal operating point in terms of performance and energy tradeoffs. Operation
at this voltage requires circuit-level modifications which can accommodate increased susceptibly
to performance variation as demonstrated in the dissertation. Additionally, the probability of soft
errors increases. As domains such as HPC look to adopt this technology for future exascale systems, the vulnerability of HPC applications in the presence of increased soft-errors is analyzed in
this dissertation. Thus, in the later part of the dissertation, the reliability concerns for large-scale
systems are assessed successfully by using the developed fault-injection and tracking tools. A

186

summary of achievements for large-scale system vulnerability characterization to soft-errors and
limitations of developed tools are listed in Table 8.2.
Table 8.2: Summary of technical achievements for vulnerability characterization of large-scale
systems (Chapter 7).
Tool
LLFI++
Fault-Propagation Module
LLFI++
Fault-Propagation Module
LLFI++
Fault-Propagation Module

Goals met
HPC Application Vulnerability on Large-Scale Systems
Determination of Rate of Corrupted Memory Locations
Limitations
Fault-Injection only plausible in live-registers
Runtime Performance Overhead
Remaining Challenges
Extend Fault model to include Multi-Bit Upsets (MBUs)
Trace-back to identify Critical Application Functions

Circuit-level transient errors can manifest to architecture-level faults in processors. In distributed
MPI-based applications, a single fault in one processor can articulate to application memory state,
which can then corrupt the memory state of all other processors in the system. Through the proposed LLVM-based fault-tracking instrumentation for distributed applications, it is shown that the
speed of this corruption varies depending upon the application. For example, it is shown that all
the processing nodes in LULESH benchmark are corrupted relatively quickly as compared with
miniFE benchmark. This shows that faults originating at the circuit-level has implications on
large-scale systems. Subsequently, adequate protection mechanisms need to be developed at the
circuit-level or the application-level. For instance, as demonstrated in this dissertation, applicationlevel mechanisms can include use of ‘safe’ compiler optimizations. On the other end, there may be
costs associated with protection at the circuit-level such as use of Error Correcting Codes, which
has energy and performance overheads.

187

8.2

Lessons Learned and Limitations

Based on experiments and analyses conducted as part of this dissertation research, limitations of
proposed techniques and possible recommendations for improvement are presented below with
reference to benchmarks utilized.
NDER Difficult-to-Recover FPGA Circuits: While NDER is able to demonstrate autonomous
fault-tolerance against permanent faults in FPGAs. Results indicate some circuit characteristics
which may be difficult to refurbish using NDER as listed below:

(a) Circuits with small number of Primary Outputs: This results in large percentage of overall
LUTs to be selected for evolution in most cases. For example, alu2 and 5xp1 benchmarks
take relatively large average number of generations to refurbish.
(b) Circuits synthesized with large resource sharing: This can result in observation of common
failure signatures at the primary outputs. Hence, it is difficult to limit the pool of marked
LUTs or the search space of the GA. For example, this scenario is observed with the alu2
benchmark, which has a large number of LUTs with output set containing POs 2 and 5 (1 is
the LSB). However, this may be catered by adequate circuit synthesis at design-time.
(c) Circuits with large number of LUTs with fanout equal to one for some or all Primary Outputs:
This will result in a large pool of marked LUTs whenever a single primary output is diagnosed
as corrupt. This can also be improved by adequate circuit synthesis at design-time. On a small
scale, this was observed for the 5xp1 benchmark as it has the highest percentage of LUTs
with unary fanout. Similarly, on a large scale, this was observed for the apex4 benchmark,
which has large number of LUTs with unary fanout for some POs.

188

SREL Area-Energy Tradeoffs: The scope of the logic region to which SREL is applied can be
selected at design-time to tradeoff performance objectives versus area cost. For SREL experiments
herein, the use of a heuristic quantile which selected the top P = 10% of critical paths for aging
mitigation was investigated. This resulted in logic overhead around 20% as compared to the unenhanced baseline design for the selected benchmarks. However, these area overheads should be
evaluated in the context of two considerations. First, the proposed technique allows the parameter
P to be traded off. Here P can be selected based on the amount of delay degradation expected
in the field which is determined by operational factors such as temperature, supply voltage, input
characteristics, and expected lifetime. Second, SREL need only be applied to aging-sensitive paths
which are shown below to comprise a small subset of a practical die.
It can be relevant to consider the above area overheads in perspective, as logic is typically only a
proportion of the die area in many implementations. For instance, [98] reports that 46.1% of total
die area in 45nm-based Intel’s Penryn multicore processor is considered to be core area. The core
area is composed of Instruction Fetch Unit, Renaming Unit, Load-Store Unit, Local caches, and
an Execution Unit. The remainder of the die area termed as uncore area is composed of Networkon-Chip, On-chip L2 and L3 caches, Memory Controller, and Clocking circuitry. In that design,
only 39.03% of a single core area is occupied by the execution unit, of which just 65.49% can
be considered to be an aging-critical portion as far logic is concerned [120]. This aging-sensitive
logic domain is composed of high-activity arithmetic units. Thus, the aging-sensitive portion is a
modest 11.77% of the total die area for this chip.
In accordance with above observations, a rudimentary area cost model is developed. Here the
impact of area overhead incurred by SREL is considered from the perspective of overall die area.
This can be expressed as the product of the Aging-Sensitive Logic Proportion of the total die area
and SREL Area Overhead for selected values of the parameters N and P selected. For instance,
the net increase in area for the ALU benchmark given the parameters in Figure 5.8 where N = 2
189

and P = 7% would nominally incur 11.77% ∗ 18.19% = 2.14% die area overhead. Here, 11.77%
represents the proportion of aging-sensitive logic taking the broader value mentioned above as a
guide, and where 18.19% represents the SREL Area Overhead in terms of the additional logic and
merging gates obtained after processing through proposed circuit synthesis.

UltraSPARC-T2 (65nm)

Penryn (45nm)

T1-DC-64 (22nm)

0%

5%

10%
15%
% Perspective Area Overhead

BubbleWrap

SD

20%

25%

SREL

Figure 8.1: Perspective area overhead of SREL as compared to related works.

Results for other multicore architectures (cores of T1-DC-64 (64-cores), Intel Penryn (2-cores) and
UltraSPARC-T2 (8-cores) occupy 27.56%, 46.06% and 44.20% of their die areas respectively) are
shown in Figure 8.1 using similar assumptions applied to other recently-proposed aging-mitigation
schemes. Here, the alternative technique of Structural Duplication (SD) [154] assumes that agingsensitive logic is replicated in its entirety and upon failure the redundant unit is utilized, i.e., the
190

Area Overhead factor is 100%. Comparison is also provided for the BubbleWrap scheme [77] proposed for many-core architectures, whereby half of the cores on the chip are designated expendable
cores. These cores are aged in a controlled manner to contribute to the portion of the chip which
cannot be utilized during the full lifetime. Taken all together, Figure 8.1 depicts an equivalent utilization model with P = 7% of long paths in the arithmetic units for the SREL technique. Here the
area costs of the SD and BubbleWrap approaches range from 4.36% to 23.03% while SREL can be
seen to compare favorably by incurring an equivalent die area cost of 0.79% to 2.7%. Furthermore,
aging-aware resynthesis [56] prior to top path identification P may result in further reduction in
area overheads associated with SREL.
Software-transformations & HPC Applications Vulnerability: In the same way as the performance impact of compiler optimization depends on the application, results show that the impact
on application vulnerability also depends on the particular application and algorithm. Table 8.3
is an effort to summarize the characteristics of each application and a qualitative assessment of
the general vulnerability of the parallel applications tested in this dissertation, as resulting from
performed experiments. Most of the HPC applications are iterative and arithmetically-intensive
in nature, thus there is a chance that the fault is either masked or propagated to later iterations.
Moreover, the convergence properties of each applications play an important role in the detection
and masking of SDC.
Table 8.3: Summary of HPC applications’ vulnerability characteristics.
Benchmark
LULESH
MiniFE
LAMMPS
MCB
AMG

Class
Sedov Blast Problem
Finite-Element method
Molecular Dynamics
Transport Equation
Algebraic Multi-Grid Solver

Prolonged
Execution
Yes
Yes
No
No
Yes

191

Vulnerability

Convergence

Low
Medium
Medium
High
Medium

High
Low
N/A
Low
High

Results indicate that highly-optimized code is more vulnerable than less-optimized code, but also
that there is a good chance that a fault may be masked before propagating to memory, either by
a register refresh or by other arithmetic operations, provided that there is “enough time” before
the store. More importantly, blindly increasing the level of code optimization without considering
the effects on the application vulnerability might not be a wise choice in exascale environments.
In fact, in some case, further optimizing the code does not provide performance improvement but
may negatively impact the application vulnerability. This observation suggests that the compiler
cost function should account for vulnerability, as well as code size and performance, and indicate optimizations that only apply ‘safe’ code transformations. Moreover, intelligent runtime systems could leverage automatic techniques, such as genetic algorithms, to traverse the performancevulnerability optimization search space [73].
Compiler-level Fault-Injection for HPC Application Vulnerability Analysis: Generally, the
goal of fault-injection studies is to determine the applications’ vulnerability to faults at various
abstraction layers ranging from circuit to application level such that appropriate fault-tolerance
mechanisms can be employed at those layers. The most intuitive level to conduct such studies is at
the hardware layer, e.g., soft-errors originate at the circuit-level and propagate to the architectural
or application level. Indeed, hardware monitoring requires sophisticated assessment mechanisms,
such as radiation exposure to processor chips to accelerate the study of soft error effects on application performance [43, 44].
However, conducting such studies at large scale for distributed applications is challenging and required development of a software-based mechanism to conduct fault injection studies using actual
hardware platforms. Although, this allows to run the experiments on actual large scale systems
with real workloads and makes it possible to identify critical datastructures/functions in the applications. The proposed compiler-level fault-injection tool based on LLFI has its drawbacks. The
primary being the bias created in fault injection distribution due to fault-injection only in user192

programmable live registers inside the processors. Thus, dormant registers which are not being
used in the application will not incur faults. Additionally, it is difficult to inject faults into OSvisible registers such as program counter, or stack pointer using a compiler-level tool. Finally, the
runtime overheads associated with compiler-level fault injection can be high, but as this analysis is
only utilized for fault-characterization and not for actual production systems, it may be acceptable.

8.3

Future Work

Based on the work presented in this dissertation, possible future work which extend the themes
and scope of presented work are highlighted below.
Extensions to SREL: An interesting extension would be to investigate increasing the level of redundancy in critical-logic domains to simultaneously mitigate aging and process variation. Here,
multiple instances of the critical paths can encourage realization of at least one path having parameters near the mean. Extensions to provide automated insertion of design diversity among paths
may also be possible. Finally, the SREL technique is orthogonal to other techniques for agingmitigation. For example, DVS can be implemented orthogonally to the use of SREL to further
reduce operating voltage within each stress interval prior to recovery. Moreover, the technique can
benefit largely in some cases, as the area overhead can potentially be reduced significantly.
Adaptive and Energy-Efficient HPC Systems: In the face of increasing resiliency challenges in
the HPC domain, there is a great potential to research techniques motivated from the embedded
systems domain. Low-power budget requirements at the system-level will require the development
of hardware-software co-design methodologies and tools. An initial research goal on this topic can
be to achieve a computational efficiency of 75 GFLOPS/W.

193

As indicated by results in this dissertation, different levels of performance and application resiliency are achieved by choosing different compiler optimizations. For future work, it is proposed
to have multiple definitions of critical functions which can be triggered based on user-defined criteria at runtime. This is analogous to design-diversity proposed for FPGA circuits as shown in this
dissertation. For software applications, the diverse functions can be implemented by choosing different compiler optimizations or by using fine-grain instruction-level redundancy. Thus, the ability
to provide different levels of performance, resiliency and power is feasible.
In this regard, the developed vulnerability analysis tools can identify critical functions/kernels inside of the distributed application which impacts the resilience and energy/performance to an extent
that it is visible at the overall system level. For example, some functions corrupt application state
more as compared to other functions, thus affect resiliency more as compared to other functions.
Similarly, some functions may take longer part of the execution time, thus having higher effect on
the overall energy consumption.
A runtime system can be implemented to monitor various indicators of the system. This can include metrics such as performance (Floating point Operations Per Second, FLOPS), and resiliency.
Circuit-level feedback such as power and/or timing errors using on-chip sensors can also be utilized. A user-defined objective system can be used to guide the adaptive control system to choose
appropriate function definition. The adaptive system can be hooked to the application using APIs
such that the runtime system can trigger such selection. Thus, it can intrinsically adapt to actual
failure rates while utilizing extra resources in terms of energy/performance only when required.
Additional energy benefits are also possible as overheads of checkpoint/restart is reduced because
appropriate resiliency level can be triggered based on observed failure rate. Overall, the goal of the
adaptive system can be to provide user-defined performance and energy levels with nearly 100%
reliability level.

194

Use of Alternative Technologies for HPC: The use of FPGA-based accelerators can help reduce
data movement energy cost and achieve high throughput. Integration of application-specific accelerators with general purpose processors can help achieve the goals of exascale computing by
ensuring scalability and low power operation. Additionally, the use of novel memory technologies
like NVRAM for storing critical data-structures and checkpoint states can be investigated due to
their lower energy consumption.
Adaptable Computing with Emerging Devices: As conventional electronic devices reach their
physical limits, the use of novel devices is being explored actively which present a new set of
reliability challenges. Mimicking brain-like intelligence using memristive systems has been at the
forefront of many such efforts [85]. The feasibility of energy efficient and reliable use of emerging
devices to implement the paradigms of Neuromorphic Computing and Approximate Computing
can be explored.
The road ahead: The unifying theme to connect and extend all of the above future works is that
the HPC domain has great potential to adopt techniques from the embedded system domain as
NTV operation and technology scaling effects permeate both domains to converge towards similar research goals. Intelligent techniques need to be envisioned which can enforce system-wide
adaptability based on feedback of vital symptoms from multiple abstracted layers. Such techniques
with a closed-loop mechanism can deal with PVT, Operating Conditions, and Resiliency Demands
while providing energy-aware operation. Thus, reliable and energy-efficient operation will continue to be the major design challenges to extend Moore’s Law for sustaining future computing
systems.

195

LIST OF REFERENCES

[1] Collaboration Oak Ridge, Argonne and Livermore. https://asc.llnl.gov/CORAL/.
[2] The top500 supercomputers list. http://www.top500.org.
[3] Partial reconfiguration user guide. Technical report, Xilinx, 2010.
[4] The algorithms of NDER. Technical report, University of Central Florida, 2012.
[5] J. Abella, X. Vera, and A. Gonzalez. Penelope: The NBTI-aware processor. In Microarchitecture, 2007. MICRO 2007. 40th Annual IEEE/ACM International Symposium on, pages
85–96, Dec 2007.
[6] Advanced Scientific Computing Research (ASCR). Scientific discovery through advanced
computing (SciDAC) Co-Design. http://science.energy.gov/ascr/research/scidac/co-design/.
[7] R. Al-Haddad, R. Oreifej, R. A. Ashraf, and R. F. Demara. Sustainable modular adaptive
redundancy technique emphasizing partial reconfiguration for reduced power consumption.
International Journal of Reconfigurable Computing, 2011, 2011.
[8] J.M. Allred, S. Roy, and K. Chakraborty. Dark silicon aware multicore systems: Employing
design automation with architectural insight. Very Large Scale Integration (VLSI) Systems,
IEEE Transactions on, 22(5):1192–1196, May 2014.
[9] R. A. Ashraf, A. Alzahrani, N. Khoshavi, R. Zand, S. Salehi, A. Roohi, M. Lin, and R. F.
DeMara. Reactive rejuvenation of CMOS logic paths using self-activating voltage domains.
In Circuits and Systems (ISCAS), Proceedings of 2015 IEEE International Symposium on,
May 2015.

196

[10] R. A. Ashraf and R. F. DeMara. Scalable FPGA refurbishment using netlist-driven evolutionary algorithms. Computers, IEEE Transactions on, 62(8):1526–1541, Aug 2013.
[11] R. A. Ashraf, R. Gioiosa, G. Kestor, R. F. DeMara, C.-Y. Cher, and P. Bose. Understanding
the propagation of transient errors in HPC applications. In Proceedings of the International
Conference for High Performance Computing, Networking, Storage and Analysis, SC ’15,
2015.
[12] R. A. Ashraf, O. Mouri, R. Jadaa, and R. F. DeMara. Design-for-diversity for improved
fault-tolerance of TMR systems on FPGAs. In Reconfigurable Computing and FPGAs (ReConFig), 2011 International Conference on, pages 99–104, Nov 2011.
[13] R. A. Ashraf, R. Oreifej, and R. F. DeMara. Scalability of sustainable self-repair to mitigate aging induced degradation in SRAM-based FPGA devices. In Presentations at the
ReSpace/MAPLD 2011 Conference, Aug 22-Aug 25 2011.
[14] T. Austin, Valeria Bertacco, David Blaauw, and Trevor Mudge. Opportunities and challenges for better than worst-case design. In Proceedings of the 2005 Asia and South Pacific
Design Automation Conference, ASP-DAC ’05, 2005.
[15] T. M. Austin. Diva: a reliable substrate for deep submicron microarchitecture design. In
Microarchitecture, 1999. MICRO-32. Proceedings. 32nd Annual International Symposium
on, pages 196–207, 1999.
[16] A. Avizienis and J.P.J. Kelly. Fault tolerance by design diversity: Concepts and experiments.
Computer, 17(8):67 –80, aug. 1984.
[17] Thomas Back. Evolutionary algorithms in theory and practice: evolution strategies, evolutionary programming, genetic algorithms. Oxford University Press, Oxford, New York,
1996.
197

[18] Xiaoliang Bai, C. Visweswariah, P.N. Strenski, and D.J. Hathaway. Uncertainty-aware circuit optimization. In Design Automation Conference, 2002. Proceedings. 39th, pages 58–63,
2002.
[19] Raghuraman Balasubramanian, Zachary York, Matthew Dorran, Aritra Biswas, Timur Girgin, and Karthikeyan Sankaralingam. Understanding the impact of gate-level physical reliability effects on whole program execution. In the IEEE 20th International Symposium on
High Performance Computer Architecture (HPCA), February 2014.
[20] T. Baumann, D. Schmitt-Landsiedel, and C. Pacha. Architectural assessment of design
techniques to improve speed and robustness in embedded microprocessors. In Design Automation Conference, 2009. DAC ’09. 46th ACM/IEEE, pages 947–950, July 2009.
[21] C. Bender, P.N. Sanda, P. Kudva, R. Mata, V. Pokala, R. Haraden, and M. Schallhorn. Softerror resilience of the IBM POWER6 processor input/output subsystem. IBM Journal of
Research and Development, 52(3):285–292, May 2008.
[22] Swarup Bhunia and Saibal Mukhopadhyay.

Low-power variation-tolerant design in

nanometer silicon. Springer, New York, 2011.
[23] J. Blome, Shuguang Feng, S. Gupta, and S. Mahlke. Self-calibrating online wearout detection. In Microarchitecture, 2007. MICRO 2007. 40th Annual IEEE/ACM International
Symposium on, pages 109–122, Dec 2007.
[24] S. Bohm and C. Engelmann. xSim: The extreme-scale simulator. In The Int. Conference on
High Performance Computing and Simulation (HPCS), July 2011.
[25] C. Bolchini and C. Sandionigi. Fault classification for SRAM-based FPGAs in the space
environment for fault mitigation. Embedded Systems Letters, IEEE, 2(4):107–110, 2010.

198

[26] Gabriel de Morais Borges, Luiz Fernando Gonçalves, Tiago Roberto Balen, and
Marcelo Soares Lubaszewski. Evaluating the effectiveness of a mixed-signal tmr scheme
based on design diversity. In Proceedings of the 23rd symposium on Integrated circuits and
system design, SBCCI ’10, pages 134–139, New York, NY, USA, 2010. ACM.
[27] S. Borkar. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. Micro, IEEE, 25(6):10–16, 2005.
[28] Greg Bronevetsky and Bronis de Supinski. Soft error vulnerability of iterative linear algebra
methods. In Proceedings of the 22Nd Annual International Conference on Supercomputing,
ICS ’08, pages 155–164, New York, NY, USA, 2008. ACM.
[29] A. A. M. Bsoul, N. Manjikian, and Li Shang. Reliability- and process variation-aware
placement for FPGAs. In Design, Automation & Test in Europe Conference & Exhibition
(DATE), pages 1809–1814, 2010.
[30] Michael Butler, Leslie Barnes, Debjit Das Sarma, and Bob Gelinas. Bulldozer: An approach
to multithreaded compute performance. IEEE Micro, 31(2):6–15, March 2011.
[31] Jon Calhoun, Luke Olson, and Marc Snir. Flipit: An llvm based fault injector for hpc. In
Lus Lopes, Julius ilinskas, Alexandru Costan, RobertoG. Cascella, Gabor Kecskemeti, Emmanuel Jeannot, Mario Cannataro, Laura Ricci, Siegfried Benkner, Salvador Petit, Vittorio
Scarano, Jos Gracia, Sascha Hunold, StephenL. Scott, Stefan Lankes, Christian Lengauer,
Jesus Carretero, Jens Breitbart, and Michael Alexander, editors, Euro-Par 2014: Parallel
Processing Workshops, volume 8805 of Lecture Notes in Computer Science, pages 547–
558. Springer International Publishing, 2014.
[32] A. Calimera, M. Loghi, E. Macii, and M. Poncino. Dynamic indexing: Concurrent leakage
and aging optimization for caches. In Low-Power Electronics and Design (ISLPED), 2010
ACM/IEEE International Symposium on, pages 343–348, Aug 2010.
199

[33] A. Calimera, E. Macii, and M. Poncino. NBTI-aware power gating for concurrent leakage
and aging optimization. Low Power Electronics and Design, International Symposium on,
0:127–132, 2009.
[34] A. Calimera, E. Macii, and M. Poncino. NBTI-aware clustered power gating. ACM Trans.
Des. Autom. Electron. Syst., 16(1):3:1–3:25, November 2010.
[35] F. Cancare, S. Bhandari, D. B. Bartolini, M. Carminati, and M. D. Santambrogio. A bird’s
eye view of FPGA-based evolvable hardware. In Adaptive Hardware and Systems (AHS),
2011 NASA/ESA Conference on, pages 169–175, 2011.
[36] Marc Casas, Bronis R. de Supinski, Greg Bronevetsky, and Martin Schulz. Fault resilience
of the algebraic multi-grid solver. In Proceedings of the 26th ACM International Conference
on Supercomputing, ICS ’12, New York, NY, 2012.
[37] J. A. P. Celis, S. De La Rosa Nieves, C.R. Fuentes, S.D.S. Gutierrez, and A. Saenz-Otero.
Methodology for designing highly reliable fault tolerance space systems based on COTS
devices. In Systems Conference (SysCon), 2013 IEEE International, pages 591–594, 2013.
[38] Tuck-Boon Chan, J. Sartori, P. Gupta, and R. Kumar. On the efficacy of NBTI mitigation
techniques. In Design, Automation Test in Europe Conference Exhibition (DATE), pages
1–6, 2011.
[39] Jifeng Chen, Shuo Wang, and Mohammad Tehranipoor. Efficient selection and analysis of
critical-reliability paths and gates. In Proceedings of the great lakes symposium on VLSI,
GLSVLSI ’12, pages 45–50, 2012.
[40] Xiaoming Chen, Yu Wang, Yun Liang, Yuan Xie, and Huazhong Yang. Run-time technique
for simultaneous aging and power optimization in GPGPUs. In Proceedings of the The 51st

200

Annual Design Automation Conference on Design Automation Conference, DAC ’14, pages
168:1–168:6, New York, NY, USA, 2014. ACM.
[41] Xiaoming Chen, Yu Wang, Huazhong Yang, Yuan Xie, and Yu Cao. Assessment of circuit
optimization techniques under NBTI. Design Test, IEEE, 30(6):40–49, Dec 2013.
[42] Yibo Chen, Yuan Xie, Yu Wang, and A. Takach. Minimizing leakage power in agingbounded high-level synthesis with design time multi-vth assignment. In Design Automation
Conference (ASP-DAC), 2010 15th Asia and South Pacific, pages 689–694, Jan 2010.
[43] Chen-Yong Cher, Meeta S. Gupta, P. Bose, and K. Paul Muller. Understanding soft error
resiliency of BlueGene/Q compute chip through hardware proton irradiation and software
fault injection. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC, Los Alamitos, CA, USA, 2014.
[44] Chen-Yong Cher, K. P. Muller, R. A. Haring, D. L. Satterfield, T. E. Musta, T. M. Gooding,
K. D. Davis, M. B. Dombrowa, G. V. Kopcsay, R. M. Senger, Y. Sugawara, and K. Sugavanam. Soft error resiliency characterization on IBM BlueGene/Q processor. In In the 19th
Asia and South Pacific Design Automation Conference (ASP-DAC), Jan 2014.
[45] Hyungmin Cho, S. Mirkhani, Chen-Yong Cher, J.A. Abraham, and S. Mitra. Quantitative evaluation of soft error injection techniques for robust system design. In in the 50th
ACM/EDAC/IEEE Design Automation Conference (DAC), May 2013.
[46] Y. Choi, S. Yoo, S. Lee, J. H. Ahn, and K. Lee. MAEPER: Matching access and error
patterns with error-free resource for low vcc L1 cache. IEEE Transactions on Very Large
Scale Integration (VLSI) Systems, 21(6):1013–1026, June 2013.

201

[47] M. R. Choudhury, V. Chandra, R.C. Aitken, and K. Mohanram. Time-borrowing circuit designs and hardware prototyping for timing error resilience. Computers, IEEE Transactions
on, 63(2):497–509, Feb 2014.
[48] J. J. Cook and C. B. Zilles. A characterization of instruction-level error derating and its
implications for error detection. In the 38th IEEE/IFIP Int. Conf. on Dependable Systems
and Networks (DSN), Anchorage, AK, 2008.
[49] Charng da Lu and D.A. Reed. Assessing fault sensitivity in mpi applications. In Supercomputing, 2004. Proceedings of the ACM/IEEE SC2004 Conference, pages 37–37, Nov
2004.
[50] H. Dadgour and K. Banerjee. Aging-resilient design of pipelined architectures using novel
detection and correction circuits. In Design, Automation Test in Europe Conference Exhibition (DATE), 2010, pages 244–249, March 2010.
[51] R. F. DeMara and K. Zhang. Autonomous FPGA fault handling through competitive runtime
reconfiguration. In Evolvable Hardware. Proceedings. 2005 NASA/DoD Conference on,
pages 109–116. IEEE, 2005.
[52] R. F. DeMara, K. Zhang, and C. A. Sharma. Autonomic fault-handling and refurbishment
using throughput-driven assessment. Applied Soft Computing, 11(2):1588 – 1599, 2011.
The Impact of Soft Computing for the Progress of Artificial Intelligence.
[53] M. Demertzi, M. Annavaram, and M. Hall. Analyzing the effects of compiler optimizations
on application reliability. In Workload Characterization (IISWC), 2011 IEEE International
Symposium on, pages 184–193, Nov 2011.
[54] A. Dixit and Alan Wood. The impact of new technology on soft error rates. In IEEE
International Reliability Physics Symposium (IRPS), pages 5B.4.1–5B.4.7, April 2011.
202

[55] R.G. Dreslinski, M. Wieckowski, D. Blaauw, D Sylvester, and T. Mudge. Near-threshold
computing: Reclaiming moore’s law through energy efficient integrated circuits. Proceedings of the IEEE, 98(2):253–266, 2010.
[56] M. Ebrahimi, F. Oboril, S. Kiamehr, and M. B. Tahoori. Aging-aware logic synthesis. In
Proceedings of the International Conference on Computer-Aided Design, ICCAD ’13, pages
61–68, 2013.
[57] J. M. Emmert, C. E. Stroud, and M. Abramovici. Online fault tolerance for FPGA logic
blocks. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 15(2):216–
226, 2007.
[58] C. Engelmann, H. H. Ong, and S. L. Scott. The Case for Modular Redundancy in LargeScale High Performance Computing Systems. In Proceedings of the 27th IASTED Int. Conference on Parallel and Distributed Computing and Networks (PDCN), Innsbruck, Austria,
February 2009.
[59] D. Ernst, Nam Sung Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin,
K. Flautner, and T. Mudge. Razor: a low-power pipeline based on circuit-level timing
speculation. In Microarchitecture, 2003. MICRO-36. Proceedings. 36th Annual IEEE/ACM
International Symposium on, pages 7–18, Dec 2003.
[60] Shuguang Feng, Shantanu Gupta, Amin Ansari, and Scott Mahlke. Shoestring: Probabilistic
soft error reliability on the cheap. In the 15th ACM Architectural Support for Programming
Languages and Operating Systems, ASPLOS, Pittsburgh, PA, 2010.
[61] P. R. Fernando, S. Katkoori, D. Keymeulen, R. Zebulum, and A. Stoica. Customizable
FPGA IP core implementation of a general-purpose genetic algorithm engine. Evolutionary
Computation, IEEE Transactions on, 14(1):133–149, 2010.

203

[62] David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, and Ron
Brightwell.

Detection and correction of silent data corruption for large-scale high-

performance computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12, pages 78:1–78:12, Los
Alamitos, CA, USA, 2012. IEEE Computer Society Press.
[63] F. Firouzi, S. Kiamehr, and M. B. Tahoori. Power-aware minimum NBTI vector selection
using a linear programming approach. Computer-Aided Design of Integrated Circuits and
Systems, IEEE Transactions on, 32(1):100–110, Jan 2013.
[64] F. Firouzi, Fangming Ye, K. Chakrabarty, and M. B. Tahoori. Representative critical-path
selection for aging-induced delay monitoring. In Test Conference (ITC), 2013 IEEE International, pages 1–10, Sept 2013.
[65] M. Garvie and A. Thompson. Scrubbing away transients and jiggling around the permanent:
Long survival of FPGA systems through evolutionary self-repair. In On-Line Testing Symposium, 2004. IOLTS 2004. Proceedings. 10th IEEE International, pages 155–160. IEEE,
2004.
[66] G. W. Greenwood. On the practicality of using intrinsic reconfiguration for fault recovery.
Evolutionary Computation, IEEE Transactions on, 9(4):398–405, 2005.
[67] G. W. Greenwood. Attaining fault tolerance through self-adaption: the strengths and weaknesses of evolvable hardware approaches. In Proceedings of the 2008 IEEE world conference on Computational intelligence: research frontiers, WCCI’08, pages 368–387, Berlin,
Heidelberg, 2008. Springer-Verlag.
[68] E. Gunadi, A.A. Sinkar, Nam Sung Kim, and M.H. Lipasti. Combating aging with the
colt duty cycle equalizer. In Microarchitecture (MICRO), 2010 43rd Annual IEEE/ACM
International Symposium on, pages 103–114, Dec 2010.
204

[69] Saket Gupta and Sachin Sapatnekar. Employing circadian rhythms to enhance power and
reliability. ACM Trans. Des. Autom. Electron. Syst., 18(3):38:1–38:23, July 2013.
[70] P. C. Haddow and A. M. Tyrrell. Challenges of evolvable hardware: past, present and the
path to a promising future. Genetic Programming and Evolvable Machines, pages 1–33,
2011.
[71] Van Emden Henson and Ulrike Meier Yang. Boomeramg: A parallel algebraic multigrid
solver and preconditioner. in Journal of maAppl. Numer. Math., 41(1):155–177, April 2002.
[72] Giang Hoang, Robby Bruce Findler, and Russ Joseph. Exploring circuit timing-aware language and compilation. SIGPLAN Not., 47(4):345–356, March 2011.
[73] N. Imran, R. A. Ashraf, and R. F. DeMara. Power and quality-aware image processing softresilience using online multi-objective gas. Int. J. Comput. Vision Robot., 5(1):72–98, Jan.
2015.
[74] N. Imran, R. A. Ashraf, J. Lee, and R. F. DeMara. Activity-based resource allocation for
motion estimation engines. J. of Circuits, Systems and Computers, 24(01), 2015.
[75] Ian Karlin, Abhinav Bhatele, Jeff Keasler, Bradford L. Chamberlain, Jonathan Cohen,
Zachary DeVito, Riyaz Haque, Dan Laney, Edward Luke, Felix Wang, David Richards,
Martin Schulz, and Charles Still. Exploring traditional and emerging parallel programming
models using a proxy application. In 27th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Boston, May 2013.
[76] Ulya R. Karpuzcu, Ismail Akturk, and Nam Sung Kim. Accordion: Toward Soft NearThreshold Computing. In Proceedings of the 20th IEEE International Symposium on High
Performance Computer Architecture (HPCA), February 2014.

205

[77] Ulya R. Karpuzcu, Brian Greskamp, and Josep Torrellas. The bubblewrap many-core: Popping cores for sequential acceleration. In Proceedings of the 42Nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, pages 447–458, New York, NY,
USA, 2009. ACM.
[78] Ulya R. Karpuzcu, Krishna B. Kolluru, Nam Sung Kim, and Josep Torrellas. VARIUSNTV: A microarchitectural model to capture the increased sensitivity of manycores to process variations at near-threshold voltages. In Proceedings of the 42nd Annual IEEE/IFIP
International Conference on Dependable Systems and Networks (DSN), pages 1–11, June
2012.
[79] Himanshu Kaul, Mark Anders, Steven Hsu, Amit Agarwal, Ram Krishnamurthy, and
Shekhar Borkar. Near-threshold voltage (NTV) design: Opportunities and challenges. In
Proceedings of the 49th Annual Design Automation Conference, DAC ’12, pages 1153–
1158, 2012.
[80] D. Keymeulen, R.S. Zebulum, Y. Jin, and A. Stoica. Fault-tolerant evolvable hardware
using field-programmable transistor arrays. Reliability, IEEE Transactions on, 49(3):305
–316, sep 2000.
[81] O. Khan and S. Kundu. A self-adaptive system architecture to address transistor aging. In
Design, Automation Test in Europe Conference Exhibition, 2009. DATE ’09., pages 81–86,
April 2009.
[82] S. Khan and S. Hamdioui. Modeling and mitigating NBTI in nanoscale circuits. In On-Line
Testing Symposium (IOLTS), 2011 IEEE 17th International, pages 1–6, July 2011.
[83] N. Khoshavi, R. A. Ashraf, and R. F. DeMara. Applicability of power-gating strategies for
aging mitigation of cmos logic paths. In to appear in IEEE 57th International Midwest
Symposium on Circuits and Systems (MWSCAS), August 2014.
206

[84] Daya Shanker Khudia, Griffin Wright, and Scott Mahlke. Efficient soft error protection
for commodity embedded microprocessors using profile information. In the 13th ACM Int.
Conf. on Languages, Compilers, Tools and Theory for Embedded Systems, LCTES, Beijing,
China, 2012.
[85] Kuk-Hwan Kim, Siddharth Gaba, Dana Wheeler, Jose M. Cruz-Albrecht, Tahir Hussain,
Narayan Srinivasa, and Wei Lu. A functional hybrid memristor crossbar-array/cmos system
for data storage and neuromorphic applications. Nano Letters, 12(1):389–395, 2012.
[86] Youngtaek Kim, Lizy Kurian John, Indrani Paul, Srilatha Manne, and Michael Schulte.
Performance boosting under reliability and power constraints. In Proceedings of the International Conference on Computer-Aided Design (ICCAD), pages 334–341, Nov 2013.
[87] P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, K. Hill, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards,
A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. A. Yelick. Exascale
computing study: Technology challenges in achieving exascale systems. Technical Report
DARPA-2008-13, DARPA IPTO, September 2008.
[88] S. Kothawade, D.M. Ancajas, K. Chakraborty, and S. Roy. Mitigating NBTI in the physical register file through stress prediction. In Computer Design (ICCD), 2012 IEEE 30th
International Conference on, pages 345–351, Sept 2012.
[89] S. V. Kumar, C.H. Kim, and S.S. Sapatnekar. Adaptive techniques for overcoming performance degradation due to aging in CMOS circuits. Very Large Scale Integration (VLSI)
Systems, IEEE Transactions on, 19(4):603–614, 2011.
[90] S. V. Kumar, Chris H. Kim, and Sachin S. Sapatnekar. Impact of NBTI on sram read
stability and design for reliability. In Proceedings of the 7th International Symposium on

207

Quality Electronic Design, ISQED ’06, pages 210–218, Washington, DC, USA, 2006. IEEE
Computer Society.
[91] J. Lach, W. H. Mangione-Smith, and M. Potkonjak. Enhanced FPGA reliability through
efficient run-time fault reconfiguration. Reliability, IEEE Transactions on, 49(3):296–304,
2000.
[92] Liangzhen Lai, Vikas Chandra, Robert Aitken, and Puneet Gupta. Slackprobe: A low overhead in situ on-line timing slack monitoring methodology. In Proceedings of the Conference
on Design, Automation and Test in Europe, DATE ’13, pages 282–287, San Jose, CA, USA,
2013. EDA Consortium.
[93] V. Lakamraju and R. Tessier. Tolerating operational faults in cluster-based FPGAs. In Proceedings of the 2000 ACM/SIGDA eighth international symposium on Field programmable
gate arrays, pages 187–194. ACM, 2000.
[94] Cheng Li, Meilin Zhang, and P. Ampadu. Reliable ultra-low voltage cache with variationtolerance. In Proceedings of the IEEE 56th International Midwest Symposium on Circuits
and Systems (MWSCAS), pages 121–124, Aug 2013.
[95] Dong Li, Jeffrey S. Vetter, and Weikuan Yu. Classifying soft error vulnerabilities in extremescale scientific applications using a binary instrumentation tool. In the International Conference on High Performance Computing, Networking, Storage and Analysis, SC, Salt Lake
City, Utah, 2012.
[96] Lin Li, Youtao Zhang, Jun Yang, and Jianhua Zhao. Proactive NBTI mitigation for busy
functional units in out-of-order microprocessors. In Design, Automation Test in Europe
Conference Exhibition (DATE), 2010, pages 411–416, March 2010.

208

[97] Man-Lap Li, P. Ramachandran, U. R. Karpuzcu, S. K. S. Hari, and S. V. Adve. Accurate
microarchitecture-level fault modeling for studying hardware faults. In the IEEE 15th Int.
Symp. on High Performance Computer Architecture (HPCA), Feb 2009.
[98] Sheng Li, Jung-Ho Ahn, R.D. Strong, J. B. Brockman, D.M. Tullsen, and N.P. Jouppi. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM International Symposium on, pages 469–480, Dec 2009.
[99] Y. Li, S. Mitra, D. Gardner, Y. Kim, and E. Mintarno. Overcoming early-life failure and
aging challenges for robust system design. Design & Test of Computers, IEEE, 26(6):28–
39, 2009.
[100] S. Liu, R.N. Pittman, A. Forin, and J.L. Gaudiot. Minimizing the runtime partial reconfiguration overheads in reconfigurable systems. The Journal of Supercomputing, pages 1–18,
2011.
[101] J. Lohn, G. Larchev, and R. F. DeMara. A genetic representation for evolutionary fault
recovery in virtex FPGAs. In Proceedings of the 5th international conference on Evolvable
systems: from biology to hardware, pages 47–56. Springer-Verlag, 2003.
[102] Charng-Da Lu. Failure data analysis of HPC systems. CoRR, abs/1302.4779, 2013.
[103] Qining Lu, Karthik Pattabiraman, Meeta Gupta, and Jude A. Rivers. SDCTune: A model
for predicting the sdc-proneness of an application for configurable protection. In The Int.
Conf. on Compilers, Architecture and Synthesis for Embedded Systems (CASES)., 2014.
[104] Yinghai Lu, Li Shang, Hai Zhou, Hengliang Zhu, Fan Yang, and Xuan Zeng. Statistical
reliability analysis under process variation and aging effects. In Design Automation Conference, 2009. DAC ’09. 46th ACM/IEEE, pages 514–519, 2009.
209

[105] Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney,
Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the ACM Conf. on
Programming Language Design and Implementation, PLDI, Chicago, IL, USA, 2005.
[106] R. Maia, L. Henriques, D. Costa, and H. Madeira. XceptionTM - enhanced automated
fault-injection environment. In Dependable Systems and Networks, 2002. DSN 2002. Proceedings. International Conference on, pages 547–, 2002.
[107] M. Maniatakos, N. Karimi, C. Tirumurti, A. Jas, and Y. Makris. Instruction-level impact
analysis of low-level faults in a modern microprocessor controller. Computers, IEEE Transactions on, 60(9):1260–1273, Sept 2011.
[108] A. Martinez-Alvarez, S. Cuenca-Asensi, F. Restrepo-Calle, F.R.P. Pinto, H. GuzmanMiranda, and M.A. Aguirre. Compiler-directed soft error mitigation for embedded systems.
Dependable and Secure Computing, IEEE Transactions on, 9(2):159–172, March 2012.
[109] T. N. Miller, R. Thomas, Xiang Pan, and R. Teodorescu. VRSync: Characterizing and
eliminating synchronization-induced voltage emergencies in many-core processors. In 39th
Annual International Symposium on Computer Architecture (ISCA), pages 249–260, June
2012.
[110] E. Mintarno, J. Skaf, Rui Zheng, J. B. Velamala, Yu Cao, S. Boyd, R. W. Dutton, and
S. Mitra. Self-tuning for maximized lifetime energy-efficiency in the presence of circuit
aging. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on,
30(5):760–773, 2011.
[111] S. Mitra, P. Bose, Eric Cheng, Chen-Yong Cher, Hyungmin Cho, Rajiv Joshi, Young Moon
Kim, Charles R Lefurgy, Yanjing Li, Kenneth P Rodbell, et al. The resilience wall: Cross-

210

layer solution strategies. In VLSI Technology, Systems and Application (VLSI-TSA), Proceedings of Technical Program-2014 International Symposium on, pages 1–11. IEEE, 2014.
[112] S. Mitra, W. J Huang, N. R. Saxena, S. Y Yu, and E. J. McCluskey. Reconfigurable architecture for autonomous self-repair. Design & Test of Computers, IEEE, 21(3):228–240,
2004.
[113] S. Mitra and E. J. McCluskey. Combinational logic synthesis for diversity in duplex systems.
In Test Conference, 2000. Proceedings. International, pages 179 –188, 2000.
[114] S Mitra and E. J. McCluskey. Word-voter: a new voter design for triple modular redundant
systems. In Proceedings of the 18th IEEE VLSI Test Symposium (VTS), pages 465–470,
May 2000.
[115] S. Mitra, N. R. Saxena, and E. J. McCluskey. A design diversity metric and reliability
analysis for redundant systems. In Test Conference, 1999. Proceedings. International, pages
662 –671, 1999.
[116] S. Mitra, N. R. Saxena, and E. J. McCluskey. Efficient design diversity estimation for
combinational circuits. Computers, IEEE Transactions on, 53(11):1483 –1492, nov. 2004.
[117] Shubhendu S. Mukherjee, Christopher Weaver, Joel Emer, Steven K. Reinhardt, and
T. Austin. A systematic methodology to compute the architectural vulnerability factors
for a high-performance microprocessor. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 36, pages 29–, 2003.
[118] Majid Namaki-Shoushtari, Abbas Rahimi, Nikil Dutt, Puneet Gupta, and Rajesh K.
Gupta. Argo: Aging-aware gpgpu register file allocation. In Proceedings of the Ninth
IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System
Synthesis, CODES+ISSS ’13, pages 30:1–30:9, Piscataway, NJ, USA, 2013. IEEE Press.
211

[119] Rob Neely. Proxy applications: Vehicles for co-design and collaboration, Dec. 2013. Presented at Predictive Science Academic Alliance Program (PSAAP) II Meeting.
[120] F. Oboril and M. B. Tahoori. Extratime: Modeling and analysis of wearout due to transistor
aging at microarchitecture-level. In Dependable Systems and Networks (DSN), 2012 42nd
Annual IEEE/IFIP International Conference on, pages 1–12, June 2012.
[121] F. Oboril and M. B. Tahoori. Aging-aware design of microprocessor instruction pipelines.
Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on,
33(5):704–716, May 2014.
[122] M. Omaa, D. Rossi, N. Bosio, and C. Metra. Low cost NBTI degradation detection and
masking approaches. Computers, IEEE Transactions on, 62(3):496–509, March 2013.
[123] R. S. Oreifej, R. N. Al-Haddad, Heng Tan, and R. F. DeMara. Layered approach to intrinsic evolvable hardware using direct bitstream manipulation of Virtex II Pro devices. In
Field Programmable Logic and Applications, 2007. FPL 2007. International Conference
on, pages 299–304, 2007.
[124] P. S. Ostler, M. P. Caffrey, D. S. Gibelyou, P. S. Graham, K. S. Morgan, B. H. Pratt, H. M.
Quinn, and M. J. Wirthlin. SRAM FPGA reliability analysis for harsh radiation environments. Nuclear Science, IEEE Transactions on, 56(6):3519–3526, 2009.
[125] M. G. Parris, C. A. Sharma, and R. F. Demara. Progress in autonomous fault recovery of
field programmable gate arrays. ACM Computing Surveys (CSUR), 43(4):31, 2011.
[126] M. Pignol. How to cope with seu/set at system level? In On-Line Testing Symposium, 2005.
IOLTS 2005. 11th IEEE International, pages 315–318, 2005.
[127] Steve Plimpton. Fast parallel algorithms for short-range molecular dynamics. Journal of
Computational Physics, 117(1):1 – 19, 1995.
212

[128] Zhenyu Qi and Mircea R. Stan. NBTI resilient circuits using adaptive body biasing. In
Proceedings of the 18th ACM Great Lakes symposium on VLSI, GLSVLSI ’08, pages 285–
290, 2008.
[129] A. Rahman, M. Agostinelli, P. Bai, G. Curello, H. Deshpande, W. Hafez, C-H Jan,
K. Komeyli, J. Park, K. Phoa, C. Tsai, J. Y Yeh, and J. Xu. Reliability studies of a 32nm
system-on-chip (soc) platform technology with 2nd generation high-k/metal gate transistors.
In Reliability Physics Symposium (IRPS), 2011 IEEE International, pages 5D.3.1–5D.3.6,
April 2011.
[130] S. Ramey, A. Ashutosh, C. Auth, J. Clifford, M. Hattendorf, J. Hicks, R. James, A. Rahman, V. Sharma, A. St.Amour, and C. Wiegand. Intrinsic transistor reliability improvements
from 22nm tri-gate technology. In Reliability Physics Symposium (IRPS), 2013 IEEE International, pages 4C.5.1–4C.5.5, April 2013.
[131] S. Ramey, C. Prasad, M. Agostinelli, Sangwoo Pae, S. Walstra, S. Gupta, and J. Hicks. Frequency and recovery effects in high-K BTI degradation. In Reliability Physics Symposium,
2009 IEEE International, pages 1023–1027, April 2009.
[132] Daniel A. Reed and Jack Dongarra. Exascale computing and big data. Commun. ACM,
58(7):56–68, June 2015.
[133] G.A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D.I. August. Swift: software implemented fault tolerance. In Code Generation and Optimization, 2005. CGO 2005. International Symposium on, pages 243–254, March 2005.
[134] W. Robinett, G. S. Snider, P. J. Kuekes, and R. S. Williams. Computing with a trillion
crummy components. Communications of the ACM, 50(9):35–39, 2007.

213

[135] S. K. Sahoo, Man-Lap Li, P. Ramachandran, S. V. Adve, V.S. Adve, and Yuanyuan Zhou.
Using likely program invariants to detect hardware errors. In the IEEE International Conference on Dependable Systems and Networks (DSN)., June 2008.
[136] R. Salvador, A. Otero, J. Mora, E. de la Torre, L. Sekanina, and T. Riesgo. Fault tolerance
analysis and self-healing strategy of autonomous, evolvable hardware systems. In Reconfigurable Computing and FPGAs (ReConFig), 2011 International Conference on, pages
164–169, 2011.
[137] B. Sangchoolie, F. Ayatolahi, R. Johansson, and J. Karlsson. A study of the impact of bit-flip
errors on programs compiled with different optimization levels. In Dependable Computing
Conference (EDCC), 2014 Tenth European, pages 146–157, May 2014.
[138] S. K. Sastry Hari, S. V. Adve, H. Naeimi, and P. Ramachandran. Relyzer: Application
resiliency analyzer for transient faults. IEEE Micro, 33(3):58–66, May 2013.
[139] N. Seifert, B. Gill, S. Jahinuzzaman, J. Basile, V. Ambrose, Quan Shi, R. Allmon, and
A. Bramnik. Soft error susceptibilities of 22 nm tri-gate devices. IEEE Transactions on
Nuclear Science, 59(6):2666–2673, Dec 2012.
[140] N. Seifert and N. Tam. Timing vulnerability factors of sequentials. Device and Materials
Reliability, IEEE Transactions on, 4(3):516–522, Sept 2004.
[141] L. Sekanina. Evolutionary functional recovery in virtual reconfigurable circuits. ACM Journal on Emerging Technologies in Computing Systems (JETC), 3(2):8, 2007.
[142] Sangwon Seo, R.G. Dreslinski, M. Woh, Yongjun Park, C. Charkrabari, S. Mahlke,
D. Blaauw, and T. Mudge. Process variation in near-threshold wide simd architectures.
In Design Automation Conference (DAC), 2012 49th ACM/EDAC/IEEE, pages 980–987,
2012.
214

[143] Azam Seyedi, Gulay Yalcin, Osman S. Unsal, and Adrian Cristal. Circuit design of a novel
adaptable and reliable L1 data cache. In Proceedings of the 23rd ACM International Conference on Great Lakes Symposium on VLSI, GLSVLSI ’13, pages 333–334, New York, NY,
USA, 2013. ACM.
[144] Muhammad Shafique, Siddharth Garg, Jörg Henkel, and Diana Marculescu. The EDA challenges in the dark silicon era: Temperature, reliability, and variability perspectives. In
Proceedings of the The 51st Annual Design Automation Conference on Design Automation
Conference, DAC ’14, pages 185:1–185:6, New York, NY, USA, 2014. ACM.
[145] C. A. Sharma. Sustainable Fault-Handling of Reconfigurable Logic Using ThroughputDriven Assessment. PhD thesis, University of Central Florida, FL, USA, aug 2008.
[146] Vishal Chandra Sharma, Arvind Haran, Zvonimir Rakamarić, and Ganesh Gopalakrishnan.
Towards formal approaches to system resilience. In the 19th IEEE Pacific Rim International
Symposium on Dependable Computing (PRDC), 2013.
[147] J. Shin, V. Zyuban, P. Bose, and T.M. Pinkston. A proactive wearout recovery approach
for exploiting microarchitectural redundancy to extend cache sram lifetime. In Computer
Architecture, 2008. ISCA ’08. 35th International Symposium on, pages 353–362, June 2008.
[148] Youngsoo Shin, Jun Seomun, Kyu-Myung Choi, and Takayasu Sakurai. Power gating: Circuits, design methodologies, and best practice for standard-cell vlsi designs. ACM Trans.
Des. Autom. Electron. Syst., 15(4):28:1–28:37, October 2010.
[149] P. Shivakumar, M. Kistler, S.W. Keckler, D. Burger, and L. Alvisi. Modeling the effect
of technology trends on the soft error rate of combinational logic. In Dependable Systems
and Networks, 2002. DSN 2002. Proceedings. International Conference on, pages 389–398,
2002.

215

[150] D. Skarin, R. Barbosa, and J. Karlsson. GOOFI-2: A tool for experimental dependability
assessment. In the IEEE Int. Conf. on Dependable Systems and Networks (DSN), June 2010.
[151] M. Snir, R. W. Wisniewski, J. A. Abraham, S. V. Adve, S. Bagchi, Pavan Balaji, J. Belak,
P. Bose, F. Cappello, B. Carlson, Andrew A. Chien, P. Coteus, N. A. Debardeleben, P. Diniz,
C. Engelmann, M. Erez, S. Fazzari, A. Geist, R. Gupta, F. Johnson, Sriram Krishnamoorthy, Sven Leyffer, D. Liberty, S. Mitra, T. S. Munson, R. Schreiber, J. Stearley, and E. V.
Hensbergen. Addressing failures in exascale computing*. International Journal of High
Performance Computing, 2013.
[152] V. Sridharan and D.R. Kaeli. Eliminating microarchitectural dependency from architectural
vulnerability. In High Performance Computer Architecture, 2009. HPCA 2009. IEEE 15th
International Symposium on, pages 117–128, Feb 2009.
[153] J. Srinivasan, S. V. Adve, P. Bose, and J. A. Rivers. The impact of technology scaling on
lifetime reliability. In Dependable Systems and Networks, 2004 International Conference
on, pages 177–186, 2004.
[154] J. Srinivasan, S. V. Adve, P. Bose, and J.A. Rivers. Exploiting structural duplication for
lifetime reliability enhancement. In Computer Architecture, 2005. ISCA ’05. Proceedings.
32nd International Symposium on, pages 520–531, June 2005.
[155] S. Srinivasan, R. Krishnan, P. Mangalagiri, Y. Xie, V. Narayanan, M. J. Irwin, and K. Sarpatwari. Toward increasing FPGA lifetime. Dependable and Secure Computing, IEEE Transactions on, 5(2):115–127, 2008.
[156] E. Stomeo, T. Kalganova, and C. Lambert. Generalized disjunction decomposition for evolvable hardware. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on,
36(5):1024–1043, 2006.

216

[157] E. A. Stott and P. Y. K. Cheung. Improving FPGA reliability with wear-levelling. In Field
Programmable Logic and Applications (FPL), 2011 International Conference on, pages
323–328, 2011.
[158] E. A. Stott, J. S. J. Wong, P. Sedcole, and P. Y. K. Cheung. Degradation in FPGAs: measurement and modelling. In Proceedings of the 18th annual ACM/SIGDA international
symposium on Field programmable gate arrays, pages 229–238. ACM, 2010.
[159] Jin Sun, Roman Lysecky, Karthik Shankar, Avinash Kodi, Ahmed Louri, and Janet Roveda.
Workload assignment considering NBTI degradation in multicore systems. J. Emerg. Technol. Comput. Syst., 10(1):4:1–4:22, January 2014.
[160] Ketul Sutaria, Athul Ramkumar, Rongjun Zhu, Renju Rajveev, Yao Ma, and Yu Cao. BTIinduced aging under random stress waveforms: Modeling, simulation and silicon validation.
In Proceedings of the The 51st Annual Design Automation Conference on Design Automation Conference, DAC ’14, pages 203:1–203:6, New York, NY, USA, 2014. ACM.
[161] D Sylvester and A. Srivastava. Computer-aided design for low-power robust computing in
nanoscale cmos. Proceedings of the IEEE, 95(3):507–529, March 2007.
[162] Berkeley Logic Synthesis and Verification Group. ABC: A system for sequential synthesis
and verification, release 10216.
[163] L.G. Szafaryn, B.H. Meyer, and K. Skadron. Evaluating overheads of multibit soft-error
protection in the processor core. IEEE Micro, 33(4):56–65, July 2013.
[164] Anna Thomas, Jacques Clapauch, and Karthik Pattabiraman. Effect of compiler optimizations on the error resilience of soft computing applications. In Workshop on Algorithmic
and Application Error Resilience (AER) held in conjunction with International Conference
on Supercomputing (ICS 2013). June 2013.
217

[165] Abhishek Tiwari and Josep Torrellas. Facelift: Hiding and slowing down aging in multicores. In Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 41, pages 129–140, 2008.
[166] Y. Tohma and S. Aoyagi. Failure-tolerant sequential machines with past information. Computers, IEEE Transactions on, C-20(4):392 – 396, april 1971.
[167] Bogdan Tudor, Joddy Wang, Weidong Liu, and Hany Elhak. MOS Device Aging Analysis
with HSPICE and CustomSim. Technical report, Synopsys, 08 2011.
[168] Balaji Vaidyanathan and Anthony S Oates. Technology scaling effect on the relative impact
of NBTI and process variation on the reliability of digital circuits. Device and Materials
Reliability, IEEE Transactions on, 12(2):428–436, 2012.
[169] J. B. Velamala, K. Sutaria, T. Sato, and Yu Cao. Physics matters: Statistical aging prediction under trapping/detrapping. In Design Automation Conference (DAC), 2012 49th
ACM/EDAC/IEEE, pages 139–144, June 2012.
[170] J. B. Velamala, K.B. Sutaria, H. Shimuzu, H. Awano, T. Sato, G. Wirth, and Yu Cao.
Logarithmic modeling of BTI under dynamic circuit operation: Static, dynamic and longterm prediction. In Reliability Physics Symposium (IRPS), 2013 IEEE International, pages
CM.3.1–CM.3.5, April 2013.
[171] S. Vigander. Evolutionary Fault repair of Electronics in Space Applications. PhD thesis,
Norwegian University of Science and Technology, Trondheim, Norway, feb 2001.
[172] Liang Wang, R. Bertran, A. Buyuktosunoglu, P. Bose, and K. Skadron. Characterization of
transient error tolerance for a class of mobile embedded applications. In Workload Characterization (IISWC), 2014 IEEE International Symposium on, pages 74–75, Oct 2014.

218

[173] Nicholas J. Wang, Aqeel Mahesri, and S. J. Patel. Examining ACE analysis reliability
estimates using fault-injection. In Proceedings of the 34th Annual International Symposium
on Computer Architecture, ISCA ’07, pages 460–469, New York, NY, USA, 2007. ACM.
[174] Nicholas J. Wang, Justin Quek, Todd M. Rafacz, and S. J. patel. Characterizing the effects of transient faults on a high-performance processor pipeline. In Proceedings of the
2004 International Conference on Dependable Systems and Networks, DSN ’04, pages 61–,
Washington, DC, USA, 2004. IEEE Computer Society.
[175] Shuo Wang, Jifeng Chen, and Mohammad Tehranipoor. Representative critical reliability
paths for low-cost and accurate on-chip aging evaluation. In Proceedings of the International Conference on Computer-Aided Design, ICCAD ’12, pages 736–741, 2012.
[176] Wenping Wang, Shengqi Yang, S. Bhardwaj, S. Vrudhula, T. Liu, and Yu Cao. The impact
of NBTI effect on combinational circuit: Modeling, simulation, and analysis. Very Large
Scale Integration (VLSI) Systems, IEEE Transactions on, 18(2):173–183, 2010.
[177] Yu Wang, Hong Luo, Ku He, Rong Luo, Huazhong Yang, and Yuan Xie. Temperature-aware
NBTI modeling and the impact of standby leakage reduction techniques on circuit performance degradation. IEEE Transactions on Dependable and Secure Computing, 8(5):756–
769, 2011.
[178] Jiesheng Wei, A. Thomas, Guanpeng Li, and K. Pattabiraman. Quantifying the accuracy
of high-level fault injection techniques for hardware faults. In the IEEE/IFIP Int. Conf. on
Dependable Systems and Networks (DSN), June 2014.
[179] Chris Wilkerson, Alaa Alameldeen, and Zeshan Chishti. Scaling the memory reliability
wall. Intel Technology Journal, 17(1):18–34, May 2013.

219

[180] Kai-Chiang Wu and D. Marculescu. Power-aware soft error hardening via selective voltage
scaling. In Computer Design, 2008. ICCD 2008. IEEE International Conference on, pages
301–306, Oct 2008.
[181] Kai-Chiang Wu, D. Marculescu, Ming-Chao Lee, and Shih-Chieh Chang. Analysis and mitigation of NBTI-induced performance degradation for power-gated circuits. In Low Power
Electronics and Design (ISLPED) 2011 International Symposium on, pages 139–144, 2011.
[182] S. Yang. Logic synthesis and optimization benchmarks version 3. Technical report, Microelectronics Center of North Carolina, 1991.
[183] Xiangning Yang and K. Saluja. Combating NBTI degradation via gate sizing. In Quality
Electronic Design, 8th International Symposium on, pages 47–52, 2007.
[184] Yun Ye, T. Liu, Min Chen, S. Nassif, and Yu Cao. Statistical modeling and simulation of
threshold variation under random dopant fluctuations and line-edge roughness. Very Large
Scale Integration (VLSI) Systems, IEEE Transactions on, 19(6):987–996, 2011.
[185] Li Yu, Dong Li, Sparsh Mittal, and Jeffrey S. Vetter. Quantitatively modeling application
resilience with the data vulnerability factor. In Proceedings of the International Conference
for High Performance Computing, Networking, Storage and Analysis, SC ’14, pages 695–
706, Piscataway, NJ, USA, 2014. IEEE Press.
[186] Lide Zhang and Robert P. Dick. Scheduled voltage scaling for increasing lifetime in the
presence of NBTI. In Proceedings of Asia and South Pacific Design Automation Conference,
pages 492–497, 2009.
[187] Wei Zhao and Yu Cao. New generation of predictive technology model for sub-45nm design exploration. In Proceedings of the 7th International Symposium on Quality Electronic
Design, pages 585–590, 2006.
220

[188] Quming Zhou and K. Mohanram. Gate sizing to radiation harden combinational logic.
Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on,
25(1):155–166, Jan 2006.

221

