Detection, diagnosis and modeling of ESD-induced soft failures - a gate-level and mixed-signal approach by Jiang, Rui
c© 2017 Rui Jiang
DETECTION, DIAGNOSIS AND MODELING OF ESD-INDUCED SOFT




Submitted in partial fulfillment of the requirements
for the degree of Master of Science in Electrical and Computer Engineering
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2017
Urbana, Illinois
Adviser:
Associate Professor Shobha Vasudevan
ABSTRACT
Electronic systems are an indispensable part of people’s lives today. However,
the reliability of electronic systems can be threatened by external stimuli such
as Electrostatic Discharges (ESDs). ESDs can either physically damage an
electronic system or let it malfunction without damaging it. Therefore, a
lot of design work and qualification testings are needed by manufacturers to
improve the robustness against the negative effects of ESDs. The trial-and-
error based solution implementation has incurred huge costs to companies
in terms of labor and time. Despite the ever-increasing effort being devoted
to solving ESD-related problems, cases of field returns still happen, and a
significant portion can be attributed to soft failure induced by system-level
ESD. Despite that, the ESD-induced permanent failures are well-studied and
protection mechanisms have proven to work, the studies on ESD-induced soft
failures are all on the physical and transistor level.
In this thesis, we studied ESD-induced soft failures by first conducting case
studies of injecting ESDs into physical devices and observing the application
level symptoms of the failures, and then performing simulation-based ESD
injections on a well-known instruction-set-architecture. For the first time, we
correlated the physical level ESD event to high-level system behavior. We im-
plemented a mixed-signal-simulation-based fault injection environment and
device models to allow ESDs to be injected to target systems. By injecting
different types of ESDs into the target system, we, for the first time, identi-
fied gate-level bit-flip patterns from a SPICE level high-voltage event. Our
experimental results show that the extent of register value corruption can be
single-bit or widespread, and the bit flips manifested can affect the system in
multiple ways. We also demonstrated low-cost protection measures for some
of the failures resulted.
ii
To my parents, for their love and support.
iii
ACKNOWLEDGMENTS
I would like to use this opportunity to express my gratitude to people who
helped me throughout my master’s degree study. I hereby express my most
sincere gratefulness to all professors, mentors, family members and friends
who provided me with inspiration, advice, and illumination.
Foremost, I would like to express my most sincere acknowledgment to my
advisor, Professor Shobha Vasudevan, for her trust and guidance over the
past four years. I was fortunate to be an undergraduate researcher in Profes-
sor Vasudevan’s group. Despite being an undergraduate student, I received
consistent trust from her and joined her post-silicon validation project, and
published the paper “Can’t see the forest for the trees: State restoration’s
limitations in post-silicon trace signal selection”, which received the best
paper nominate in the 2015 International Conference on Computer Aided
Design (ICCAD). After formerly becoming her master’s student, I started to
work on the ESD-induced soft failures project. She exhibits extreme profes-
sionalism and consistently motivates me with brilliant ideas, which eventu-
ally lead to the publication of “Application level investigation of system-level
ESD-induced soft failures”. She has guided me to overcome adversities and
serves as an indefatigable mentor in both life and academia.
I would like to particularly thank Professor Elyse Rosenbaum, our collab-
orator working on physical level circuit design and ESD-related areas, for
being extremely knowledgeable, kind and helpful, Sandeep Vora and Prajwal
Mysore Vijayaraj for collaborating on this project and establishing our un-
derstanding on this topic, Debjit Pal for introducing and teaching me with
various available resources, Dr. Zbigniew Kalbarczyk for his constant help on
fault injection concepts and techniques, Dr. Janak Patel for inspirations on
error detection techniques and observability enhancement techniques, Janice
L. Progen for the unwearying effort in revising my language in the thesis,
Yang Xiu for the construction of the circuit-level models, as well as Sujan
iv




CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . 1
1.1 Reliability as a Concern . . . . . . . . . . . . . . . . . . . . . 1
1.2 Soft Failures and Hard Failures . . . . . . . . . . . . . . . . . 3
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
CHAPTER 2 PRELIMINARIES . . . . . . . . . . . . . . . . . . . . 11
2.1 IEC 61000-4-2 Standard . . . . . . . . . . . . . . . . . . . . . 11
2.2 Fault Injection . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Error Masking, Detection and Correction . . . . . . . . . . . . 14
2.4 Verilog-AMS and Mixed-Signal Simulation . . . . . . . . . . . 17
CHAPTER 3 SOFT ERRORS IN HARDWARE INJECTION . . . . 19
3.1 DUTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Microprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Micro-Controller . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5 Observations and Lessons Learned . . . . . . . . . . . . . . . . 33
CHAPTER 4 TARGET SYSTEM – OPENMSP430 . . . . . . . . . . 35
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Tape-out Preparations and Module Descriptions . . . . . . . . 37
CHAPTER 5 SIMULATION-BASED FAULT INJECTION . . . . . 48
5.1 Automated Gate-Level Fault Injector . . . . . . . . . . . . . . 49
5.2 Injecting ESD on Clock and Input Pins . . . . . . . . . . . . 52
5.3 Injecting ESD on Supply Net . . . . . . . . . . . . . . . . . . 56
CHAPTER 6 EXPERIMENTAL RESULTS . . . . . . . . . . . . . . 61
6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.2 Results for Gate-Level Injections . . . . . . . . . . . . . . . . 61
6.3 Results for Clock Injections . . . . . . . . . . . . . . . . . . . 66
6.4 Results for Data Pin Injections . . . . . . . . . . . . . . . . . 71
vi
6.5 Injections on the Supply Net . . . . . . . . . . . . . . . . . . . 73
6.6 Suggested Recovery Strategies . . . . . . . . . . . . . . . . . . 74
CHAPTER 7 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . 80
7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81




1.1 Reliability as a Concern
The complimentary metal-oxide-silicon (CMOS) technology has advanced
along with new and unprecedented reliability concerns. Transistor aging
violation, multi-interface quality, higher local heat, small-dimension scatter-
ing and material concerns are the major challenges for CMOS technology
reliability [1] in electronic devices. As a result, electronic devices, such as
personal computers and mobile phones, sometimes fail to function correctly.
Researchers have been examining the causes for the failures, and one research
area that receives a lot of attention is electrostatic discharge (ESD). ESD is
the sudden flow of electricity between two electrically charged objects caused
by contact, an electrical short, or a dielectric breakdown, and it typically has
a high current and a short duration. ESD can harm an electronic system in
two ways: the device may have permanent physical damage or the system
may malfunction without any permanent physical damage. The former is
commonly referred to as hard failures and the latter is referred to as soft
failures.
ESD reliability is a major concern in the electronics industry, and the In-
dustry Council on ESD Target Levels has written two publications to describe
this problem [2, 3]. Also, researchers have proposed a design methodology
called system-efficient ESD design (SEED) to protect devices from hard fail-
ures. The proposed protections are primarily provided by on-chip and on-
board ESD protection devices. This thesis, on the other hand, will mainly
focus on soft failures induced by system-level ESDs. The term system-level
ESDs is used in contrast to component-level ESDs, where the latter refers
to ESDs that occur during the manufacturing stage of different components
and the assembly and handling stage of an integrated circuit (IC), and the
1
former refers to ESDs that occur when a charged object discharges into an
electronic system or into an object nearby the electronic system.
Qualification testings are proposed and enforced for products with undue
susceptibility to system-level ESDs. However, solutions for ESD susceptibil-
ity will only be implemented after the detection and diagnosis of the failures.
Therefore, this methodology has significant costs in terms of time-to-market.
In one case, multiple man-months were spent to resolve an ESD-induced
problem, delaying the product’s release for an extended period of time [2].
Moreover, the field returns are not completely eliminated by qualification
testing. Among the 58 different product returns studied by the Industry
Council, a quarter of them are attributed to system-level ESDs. Among the
failures caused by ESDs, almost half are due to soft failures. Furthermore,
the study underestimates the likelihood of soft failure in electronic products,
because the soft failures are not reported as frequently. For example, in the
case of an unexpected phone program termination caused by a system-level
ESD, most customers will regard this as a software issue and re-launch the
application. As a result, the soft failure will be fixed since soft failures are
characterized as transient failures in nature. Therefore the customers will
not be aware that the failure is attributed to ESD susceptibility, and in such
cases soft errors will not be reported.
Solving problems caused by ESD-induced soft failures are becoming in-
creasingly important to the electronics industry for three reasons. First, the
likelihood of ESD-induced soft failures increases due to the decrease of the
supply voltage. One focus of the very-large-scale integration (VLSI) design
is on reducing the power consumption of an electronic device. This is often
achieved by having smaller-sized transistors, and hence the supply voltage of
a processor core can be reduced. However, the reduced supply voltage causes
reduction of the noise margin of the transistors. As a result, the electronic
devices become more susceptible to ESDs. Second, user experience is valued
by companies. Although most soft failures can be fixed by user interventions,
the malfunctions of the applications will adversely affect the user experiences,
and it is important to alleviate the negative impacts of soft failures. Last but
not least, it is crucial to guarantee correct operations of electronic devices in
safety-related areas. Many application-specific integrated circuits (ASICs)
need to work in places with high likelihood for ESDs, such as in automobiles
and airplanes. In these cases, the charges will build up around cars’ engines
2
and generators, and the ASICs have to be resilient against the potential ESD
hazards.
1.2 Soft Failures and Hard Failures
In an electronic system, a failure is said to occur if at least one external state
of the system deviates from the correct service state, and the deviation is
called an error. The hypothesized cause of an error is called a fault. Note
that this definition of the term fault is defined by the dependable and secure
computing community [4], and is different from a fault defined by the circuit
community. In the former definition, a fault can refer to the cause of either
a soft failure or a hard failure, whereas in the latter, a fault usually refers to
the permanent damage of a component. In this thesis, we will use the former
definition for a fault.
We define hard failures as the failures of a system due to physical damage
to a system component, which can only be repaired by the physical repair or
replacement of the damaged components. Typical examples of hard failures
caused by an ESD event can be oxide breakdown of any MOSFET [5, 6, 7,
8, 9], metal melt [10], or silicon melt [11, 12]. A body of studies on SEED
tackles different causes of ESD-induced hard failures [13, 14, 15, 16, 17], and
this topic is well-established.
On the contrary, relatively less research is done on soft failures. The term
soft failures, or soft errors, refers to failures not due to physical damage, and
the system can return to a functional state without the repair or replace-
ments of a component. The existing studies on ESD-induced soft failures are
mainly on their detection and diagnosis at the transistor level. Studies show
that the typical soft failures in an electronic device include input glitch [18],
clock noise [19], bit-flips in registers [18], bit-flips in combinational logic [20],
ground bounce [21], and de-biasing of analog circuits [22]. Figure 1.1 is an
example of a soft failure obtained during the experiments. In this figure, the
left-hand side and the right-hand side of the display are interchanged, as a
result of an ESD event.
3
Figure 1.1: ESD-induced soft failure example
1.3 Motivation
A class of soft failures which receives a lot of attention includes those induced
by energetic particle strikes [23, 24, 25, 26, 27, 28, 29, 30, 31, 32]. The
particles will inject energy into logic units of a electronic device, and cause
undesired flow of electrons. As a result, the content of a register can flip from
a logic 0 to logic 1, or vice versa. Since particle strikes are random in nature
and the probability of concurrent particle strikes is very low, such a fault is
modeled as a single event upset (SEU). Then, fault injectors are implemented
using such a fault model to evaluate a system’s robustness against failures
induced by particle strikes. SEUs are also injected at different levels of a
system to study the effects of the faults at different levels. The corresponding
detection techniques of this class of failures range from circuit level to design
level, micro-architecture level and even application level. Detailed literature
studies on fault injections and detections will be presented in chapter 2.
ESD-induced soft failures are much more serious because they are spread
out and across time. Meanwhile, ESD-induced clock noise can cause timing
violations for a large number of flip-flops in a system. Therefore, such failures
will result in errors in computations as well as in the control logic. Figure
1.2 shows how an ESD-induced fault propagate up stacks from the transistor
4
level to the computational level. Suppose an ESD event is drawing current
away from the output nodes of some NAND gates. The transistor-level char-
acterization is reduced node voltages caused by current flowing out of the
nodes. At logic level, instead of outputting a logic 1 with inputs 1 and 0, the
NAND gates output logic 0. At the computational level, the manifestation
of the fault will be incorrect computation results for one or more bits in the
result. The objective of this research is to establish a correlation between
the transistor level ESD event and its manifestations at higher levels.
Figure 1.2: Example of an ESD-induced fault propagating up stacks; the
red color highlights the corrupted bits
There is very little research that studies the higher-level manifestation
of ESD-induced soft failures [18, 33, 34]. In [18] and [33], the relationship
between an ESD event and its logic-level symptoms is studied without having
a complete digital system or any real workload. In [34], the application-level
failures are observed but no further diagnosis work is done to show how the
faults may have propagated. Therefore, no existing research has presented a
complete study on how ESD-induced fault manifested themselves at higher
levels and consequently lead to application-level failures. To the best of our
knowledge, we, for the first time, correlated the application-level failures
5
with logic-level corruptions and physical-level ESD events. Also, for the
first time, ESDs are injected in transistor-level simulations and their effects
are observed at the gate level and the application level. Meanwhile, no
existing high-level fault injectors can inject faults that accurately resemble
the characteristics of ESD-induced faults because of the lack of a fault model.
In this work, different fault models will be presented to capture the different
faulty behaviors when ESDs are injected to different parts of a system.
1.4 Methodology
The overview of this project is shown in figure 1.3. We performed hardware-
based ESD injections onto commercial chips according to the qualification
testing standard (IEC 61000-4-2). We chose three different commercial chips
and designed test programs to identify different types of soft failures at the
application level. We then came up with hypotheses for our observations and
findings based on the injection results. In order to perform a more in-depth
diagnosis of such failures and characterize them, we designed a custom test
chip using the openMSP430 core [35]. We modified the openMSP430 core by
adding high-level fault detectors in the core. The corresponding IO ring was
also designed, and primary and secondary ESD protection units are designed
to provide the same level of protection as those provided in commercial chips.
Verifications were performed to guarantee the functional correctness of the
modified core and the IO ring. Having the models for the core, the IO ring
and the ESD stimulus, we implemented automatic fault injectors to inject
ESDs in gate-level simulation and mixed-signal simulation. While performing
simulation-based fault injections, a tape-out of a complete processor core with
its IO ring will be made. Then, we will again perform hardware-based ESD
injection onto the custom test chip. The different fault injection experiments
include (1) gate-level injections on all the registers, (2) the clock signal pin,
(3) the data signal pins, and (4) the power supply net. Experimental results
are collected in the simulation-based fault injections and will be collected in
the hardware-based ESD injections. Finally, the results will be compared
and solid conclusions will be drawn if the results in the simulation-based
injections match those in the hardware-based injections. The taping-out and
the corresponding further experiments will be conducted after the completion
6
of this thesis, and this project will be continued by new researchers and our
collaborators. The primary focus of this work is on simulation-based ESD
injections at the gate level and in mixed-signal simulations.
Figure 1.3: Project overview
1.5 Contributions
In all of our simulation-based fault injection experiments, we found that most
of the fault injections on the registers, the clock signal, the GPIO data pins,
and the power supply net do not result in failures. The average probability of
testbenches successfully completed expected tasks are 94.6%, 92.5%, 98.9%,
and 100%, respectively in the four injection experiments. We found that
7
on average 0.66% of failures in the register injections and 7.3% of failures
in the injections on the clock signal occurs concurrently with the failure for
the system clock to toggle. In the clock injections, we also observe that
on average, 10.1% of the passed simulations have their clock signals phase
delayed by (2n+ 1)pi.
We also observed the number of corrupted registers in different experi-
ments. We found that in most passed trials of the injection on registers,
the number of corrupted registers is either 0 (the fault is masked) or 1 (the
fault never propagates). In very rare cases, the number of corrupted regis-
ters could be more than 100 in passed simulations. In failed trials on the
injections on registers, the number of corrupted registers ranges from 1 to
180. Meanwhile, injections on the clock signal are more likely to cause timing
issues for a wide range of registers, and more than 1000 registers could have
indeterministic values. In all trials of the injections on the data signal pins,
the numbers of corrupted registers are small.
This research makes the following contributions to the state-of-the-art re-
search:
1. Characterization of ESD-induced soft failures at the application-
level In this work, for the first time, soft failures at the application-level
are observed while performing system-level ESD testing on different test
chips, and the correlation between an ESD event and its application-
level response is studied.
2. Demonstration of the limitation of the current ESD qualifica-
tion testing practice While in the current ESD qualification testing
practice the ESD are injected a few times, we demonstrated that some
ESD-induced soft failures are low-probability events which require hun-
dreds of testings to identify.
3. Construction of a voltage sensitive flip-flop model with accu-
rate inter-module and intra-module delays This work proposes
an algorithm for constructing voltage sensitive flip-flop models for the
flip-flops in a netlist. The delay information in the standard delay file
(.sdf) will be preserved.
4. Implementation of a mixed-signal fault simulation framework
To the best of our knowledge, no prior work has been done to simulate
8
the injection of ESDs at the SPICE level and observe the failures at the
gate level in a mixed-signal simulation. We implemented a mixed-signal
simulation framework. The proposed models used in the mixed-signal
fault simulations are scalable. Further experiments and injections can
be applied to different systems to examine gate-level fault manifesta-
tions caused by a physical-level ESD event.
5. Fault modeling of ESD-induced soft failures at gate level The
nature of ESD-induced soft failures has been studied at the device-level.
In this project, for the first time, their manifestations as computational
errors are analyzed and modeled. Fault modeling of ESD-induced soft
failures will foster further results on failure analysis, diagnosis, mitiga-
tion, correction and recovery at the gate level or a higher level, thereby
enabling further research to be done without lower-level hardware ex-
pertise. The modeling will address faults induced by ESD events in
digital chips.
6. A new class of transient error This work initiates a study on a new
class of soft errors that can adversely affect computation. The proposed
characterization and analysis of ESD-induced failures will trigger in-
depth academic and practical analyses at different levels of a design.
1.6 Outline
The outline of this thesis is as follows:
• Chapter 2 describes the preliminaries and literature studies.
• Chapter 3 presents the hardware-based fault injection results.
• Chapter 4 introduces the target design of the simulation-based fault
injection experiments.
• Chapter 5 describes the different fault injectors implemented and the
injection methodology.
• Chapter 6 presents the experimental results of different simulation-
based fault injections.
9





2.1 IEC 61000-4-2 Standard
The testing described in IEC 61000-4-2 is a legal requirement in many parts
of the world. The setup of the 61000-4-2 ESD test is shown in figure 2.1. The
Figure 2.1: The IEC 61000-4-2 test setup
equipment under test (EUT) is placed on top of a 1.6 m x 0.8 m horizontal
coupling plane (HCP) and is separated from the HCP by an insulator that
is 0.5 mm in thickness. The ground plane has a dimension of 2 x 2 meters,
and is connected to the HCP via two 470 kΩ resistors. The ESD generator
will be pre-charged to a test voltage, and when the discharge is triggered,
the discharge current will flow into the EUT through the ESD gun. When
discharging, the ESD gun forms a 45-degree angle with respect to the HCP.
The ESD ground pin of the ESD gun is connected to the ground plane.
11
2.1.1 Discharge Waveform
IEC 61000-4-2 Table 3 provides the contact discharge current waveform pa-
rameters at 2 kV to be on a 2 Ω resistor:
Figure 2.2: Contact discharge current waveform parameters at 2 kV to be
on a 2 Ω resistor
• an initial peak of 7.5 A (±15%) with a rise time of 0.8 ns (±25%);
• a current of 4 A (±30%) at 30 ns;
• a current of 2 A (±30%) at 60 ns.
2.1.2 Failure Criteria
Under an ESD event, the failure criteria are classified as the following types:
1. The device continues to function normally
2. Temporary loss of functionality but the device will recover without
operator’s intervention
3. Temporary loss of functionality and operator’s intervention is required
4. Permanent loss of functionality
For type 1, no failures are observed either because no fault is manifested
due to the ESD event, or the fault is masked or corrected by the device. Type




Reliability researchers have focused attention on soft errors induced by high-
energy particle strikes. Techniques for detection vary from circuit level, logic
transfer level, micro-architecture level, operating system level and application
level. Due to the localized nature of particle strikes, such errors were modeled
as SEUs at the beginning of the 2000s. As the CMOS technology advanced,
the cells have thinner wells and lower noise margins, and the fault model
for particles strikes is extended to localized multiple bit-flips. To study the
effects of different faults at different levels and validate the dependability of
a system, fault injection techniques are necessary. Most of the developed
techniques can be categorized as follows:
• Hardware-based fault injection: This is achieved by physically ap-
plying the stimulus onto the design under test (DUT). Examples for
such injections include applying heavy ion radiation, EMI, or ESD on
the DUT. While hardware injection resembles the most realistic char-
acteristics of the real-world stimulus and is suitable for low-level fault
models, the risk of damaging the DUT is high, and low observability
limits our ability to thoroughly examine what happened inside a sys-
tem. The IEC 61000-4-2 ESD test is an example of hardware-based
injection.
• Software-based fault injection: The objective of software-based
fault injection is to reproduce failures at the software level. Errors are
often represented as bit-flips in software variables [36]. While software-
based injections are orders of magnitude faster than any other level, the
abstraction omits many circuit details. Therefore, existing high-level
error injection techniques are in general inaccurate in describing the
actual behavior of a system [37].
• Simulated-based fault injection: Simulation-based injection al-
lows faults to be injected at different levels of abstraction. The levels
of abstraction can include: architecture level [38], micro-architecture
level [39], behavioral and register-transfer logic (RTL) level [40], gate
level [41], SPICE level [42], and transistor level [43]. Simulation-based
fault injection techniques can include appropriate details of the design
13
and provide the maximum amount of observability and controllabil-
ity. However, depending on the level of abstraction, simulation-based
injectors can be extremely time-consuming.
• Emulation-based fault injection: This technique combines the mer-
its from the hardware injection and the simulation-based injection. In
emulation-based fault injection, the DUT is synthesized onto a Field
Programmable Gate Array (FPGA), thereby enabling the designer to
study the actual behavior of the circuit. An example of an emulation-
based fault injector can be found in [44].
2.3 Error Masking, Detection and Correction
Once faults are injected into the target system, it may be masked. A fault
is called being masked if the corrupted contents of a register or registers
are corrected with the expected contents. For example, if the content of a
memory location is corrupted and that piece of data is never used before it
is overwritten with some new data, the fault will not result in any failure.
Studies shows that many faults can be masked at the lower level [45, 46, 47].
If a fault is not masked, appropriate methods must be adopted to detect
the error. An error is detected if its presence is indicated by an error message
or error signal. Errors that are present but not detected are latent errors.
The detection techniques can be classified into two major categories:
1. Online error detection: online error detection techniques detect er-
rors while the device is running. The advantage of online error detection
is that once an error is detected, the appropriate measures can be taken
to recover the device from the faulty status.
• Hardware redundancy refers to the technique where a number
of hardware units are executing the same task, and the result
produced by each hardware unit will be compared. If any of the
hardware units suffer from a soft error hazard, other units will
still produce the same results, and therefore the failure can be
identified and corrected.
• Information redundancy is the addition of redundant informa-
tion to data to allow error detection and correction. This topic
14
has long been established to protect communication protocols,
memories and GPUs. The different implementations of informa-
tion redundancy-based error detection that have been studied in-
clude: parity [48, 49], cyclic redundancy checks [50, 51], check-
sums [52, 53, 54], error correction codes (ECC) [55, 56], arith-
metic codes [57], residue-inverse residue codes [58], Reed-Solomon
codes [59], two-level integrated interleaved coding [48, 60] and
Redundancy Array of Inexpensive Disks (RAID) [61, 62]. Among
these, ECC is one of the most prevalently used low-cost detection
and recovery structure that is implemented in many processors,
especially in memories, and in level 1 and level 2 caches.
• Time redundancy is the repetition of computations in order to
compare for discrepancies. Computations will be made either with
exactly the same set of instructions [63] and hardware resources,
or a different set of instructions and hardware resources [64, 65].
If any discrepancy of the results is detected, a third execution will
be invoked and a majority voting based mechanism will correct
the error and stop the fault from propagating.
• Dedicated hardware units such WatchDog Timers (WDT) are
used to detect infinite loops in the program. WDTs are very com-
monly used in microprocessors and micro-controllers due to their
low cost and high effectiveness. Other hardware units, such as
memory monitors and fatal trap detectors are designed particu-
larly to identify certain kinds of faults.
• High-level detectors are based on the observation that faults
that are masked at different system levels need not be detected [45,
46, 47, 66, 67]. Indeed, from the user’s prospective, if a fault does
not propagate up the stack and eventually results in a failure,
the user will not notice it, neither does the fault threaten the
reliability of the system. Implementations of high-level detectors
includes SoftWare Anomaly Treatment (SWAT) [68, 69].
2. Oﬄine error detection: oﬄine error detection techniques identify
errors when the device is not running. Most of the oﬄine error detection
facilities will stop the system clock, read information from the device,
and analyze the information to see whether an error has occurred. The
15
advantage of oﬄine error detection is that a large amount of information
can be read and compare against the expected value. We list two more
relevant oﬄine error detection techniques here:
• Scan chain is a common design-for-testability (DUT) facility in
recent years to increase the controllability and observability at
manufacturing test stage [70]. Most of the synthesis tools, such as
the Design Compiler from Synopsys, is capable of inserting a scan
chain into a given design. The structure of a scan chain is shown
in figure 2.3.
Figure 2.3: Scan chain structure
In the scan chain structure, there will be a MUX inserted in each
scan flop. The functional inputs of the flip-flops are connected
to port 0 of the MUX, and the scan inputs of the flip-flops are
connected to port 1 of the MUX. Note that the output of one
scan flop goes to the scan input of the next flop, thereby forming
a chain. The selection between functional and scan inputs is made
by the signal scan enable scan en. When scan en is not set, the
flops will work as if the scan structure is not present, whereas
when scan en is set, data of each scan flop can be shifted out
one-by-one.
16
The scan chain can be inserted for most of the registers inside
a design, as long as their clock signal is externally controllable.
In reaction to an event of interest, the content of the registers
in the scan chain will be compared against the expected values
and failure analysis can be conducted. If there is a hard failure,
some register content in the scan chain will be different from the
expectation, and the discrepancy is reproducible at the previous
breakpoint. Meanwhile, if there is a soft failure, the difference in
register values will still be observed, but the observation cannot
be reproduced. The scan chain is a very effective error detection
technique and there have been studies over the years to examine
the usage of the scan chain for error detection and recovery [71,
72, 73, 74, 75, 76, 77].
• Embedded logic analyzers (ELAs) have four components:
control unit, trigger unit, sample unit and oﬄoad unit. The con-
trol unit can be programmed through serial interfaces such as
JTAG, and instructions are received by the control units to con-
trol the ELA. The trigger unit will detect the occurrence of the
event of interest, and trigger the sample unit. The sample unit
contains a trace buffer. The trace buffer will track signals of in-
terest for a user-defined period of clock cycles. However, due to
area and bandwidth constraints, the capacity of the trace buffer
is very limited, and therefore much research has been done to
select the most powerful set of signals for debugging proposes
[78, 79, 80, 81, 82, 83, 84]. To be noted, most of the research on
the trace buffer focuses on post-silicon debugging for logical bugs.
Utilizing ELAs in soft failure diagnosis will introduce additional
difficulties, which will be discussed in later chapters.
2.4 Verilog-AMS and Mixed-Signal Simulation
Verilog-AMS is a modeling language for mixed-signals systems. AMS here
stands for analog mixed-signal. The term ”mixed-signal” suggests that part
of the system processes digital signals and part of the system processes ana-
log signals. Therefore, Verilog-AMS merges and extends Verilog-HDL and
17
Verilog-A. Verilog-HDL is commonly used in today’s computer-aided design
process. Meanwhile, Verilog-A allows the description of the analog parts.
Digital signals are discrete-time signals, and there are only four possible
values for any signals: logic 1 (or true), logic 0 (or false), high impedance
Z, or unknown value X. Analog signals are different in the sense that their
values reside in a continuous space. Signals values can change with respect
to time continuously or change with discrete events. The former type for
signals are known as continuous-time signals and the latter type are known
as analog discrete-event signals.
Due to the nature of digital signals and analog signals, the simulators are
designed differently. Most of the digital simulators are event-driven in which
signal values will be updated in a sequence of events. Meanwhile, analog
simulators update signal values by putting all signal values in an equation
set and then solving the equation by finding the convergent values of different
signals. Hence, mixed-signal simulators, or mixed-mode simulators, combine
the two different methods of simulation and therefore will have a discrete-
vent kernel and a continuous-time kernel. The detailed description of the
mixed-signals simulations can be found in [85].
18
CHAPTER 3
SOFT ERRORS IN HARDWARE
INJECTION
In this part of the research, a variety of applications were run on commercial
single-board computers and system-level ESD testings were performed while
running the applications. The experiments aim to observe the different types
of failures that travel up the software stack. The tests were performed per
IEC 61000-4-2 standard. Contact discharges on ground shields, power pins,
and exposed signals pins were performed. Indirect discharges on the horizon-
tal and vertical coupling planes, as well as on the heatsink of the packages
were performed. The findings of this chapter are also published in [86].
3.1 DUTs
Figure 3.1: DUT2: Microprocessor
Figure 3.1 is a single-board microprocessor. The processor has a 32-bit
ARMv6 architecture and runs on an operating system based on Unix. The
19
processor chip has an embedded CPU/GPU packaged with HDMI I/O cir-
cuitry. A separate chip is used to control the Ethernet and USB ports. Most
of the components have a supply voltage of 3.3 V, 1.8 V or 2.5 V, and the
power is supplied through the USB port whose voltage is at 5 V. ESD tests
were performed while the processor was running one of the following: (1)
games, (2) a GPU intensive task, (3) a file compression program, (4) an I/O
stream to the disk, or (5) during idle state.
Figure 3.2: DUT1: Micro-controller
Figure 3.2 shows a micro-controller. The board can output serial data to
a laptop via a USB port. Another USB interface chip is used to handle the
serial data transmission. The main processor chip and most of its peripherals
run on 5 V, while the USB interface chips run on 3.3 V. The chip also offers
a few digital output pins which are able to output either 5 V or 0 V, and
a number of analog pins which are able to produce pulse-width modulation
(PWM).
The field programmable gate array (FPGA) evaluation board is shown in
figure 3.3. It is connected to a computer to receive commands via its serial
port. The output of the FPGA is shown on an embedded display, and for
some programs, the outputs are displayed on a monitor through DVI. The
ESD was applied when the board is running one of the following applications:
(1) a slideshow, (2) a ringtone player, (3) Simon Says game, (4) self-diagnostic
programs testing SRAM/DRAM, the LEDs and the ports or the user input
20




There are two types of display corruptions observed when injecting low am-
plitude discharge onto the heatsink of the microprocessor. The first type
is referred to as display flickers, where some of the pixels in one or a few
rows are displaying some unexpected color. This type of failure lasts for a
noticeable but very limited amount of time. The second type is referred to as
display blanking, where the display goes dark for about one to two seconds,
and recovers. Figure 3.4 shows the application dependence of the two types
of display failures likelihood verses the type of program running. As shown
from the left side of the figure, the probability of getting a display flicker
after an ESD injection is significantly higher than when any other program
is running.
21
Figure 3.4: Application dependence of display failures likelihood verses type
of program running; error bars show a 95% confidence interval
3.2.2 Application-Level Failures
Besides display failures, application-level failures also occurred. Application-
level failures here refer to the situations where a program does not complete
its expected tasks correctly. The failures can be further classified into the fol-
lowing types: (1) hanging of the system, where the user lost control over the
system, (2) program hanging, where the program does not proceed to com-
plete the task, (3) unexpected program termination, and (4) incorrect output
produced. We managed to observe the following types of failures in our in-
jections on the heatsink of the microprocessor. Similar results are obtained
when injecting on HCP and VCP. The results of the injection experiments
can be seen in table 3.1.
3.2.3 Analysis and Diagnosis
Process termination and hangings are types of failures that are well stud-
ied [68, 69]. Process termination can be because of the protection mechanism
of an operating system. When a process is trying to access some memory
22
Table 3.1: 1.8 kV discharge onto single-board computer’s CPU heatsink
during file compression task
Observed Failures Percentage of Failures
Process termination 8.6
Process hangs (system operational) 2.9
Display flicker 0
Display blanking 11
location that does not belong to the memory space, the kernel may decide to
kill the program and cause this program to terminate. Hanging of a program
and of a system can be due to infinite loops, deadlocks, or live locks. In this
microprocessor, because the observation point is at a very high level, we will
not be able to further locate the fault.
Meanwhile, the display flickers can be due to noise picked up by HDMI
lines or the RTL within the processor chip package. The former, however,
will not explain the application-dependent phenomenon shown in figure 3.4.
Hence the fault has to originate from the RTL. We are neither able to provide
further explanations on the exact location, nor on the reason that application
dependency is observed.
We provide the following possible explanations for display blanking failure:
• Severe corruption of the display buffer. Then the operating system
identifies this failure, a flush will be needed to resets the display buffer.
This is the least probable explanation because flushing the display
buffer is not expected to take a time in the order of seconds.
• GPU is not responding. The operating system detects the anomalous
behavior of the GPU and resets it. The reset processor may take a few
seconds.
• Control line corruption. This is caused by noise picked up by the HDMI
line, causing either a false reset issued to the monitor, or the monitor
not detected by the operating system. Reconnection or reseting will
need the re-initiation of the monitor’s controller chip, and henceforth




Figure 3.5: Display error type I; the displayed image losses its interlacing
Two kinds of display failures were observed during the ESD injection on
the heatsink of the FPGA chip. In the first type, the displayed image loses
its interlacing, as shown in figure 3.5. This failure does not correct itself, in
contrast to display failures of the microprocessor. Failures will be corrected
when the reset signal is asserted. The second type of display failure is shown
in figure 1.1, where the left-hand side and right-hand side of the display are
exchanged, requiring a reset to correct the error. Both types of failures were
observed during the slideshow, where some image is displayed in the monitor
through the DVI port.
3.3.2 Application-Level Failures
Table 3.2 shows the different types of failures observed when injecting 3 kV
while running a slideshow. There are not many types of failures observed in
this experiment besides the display failures. There are two possible reasons to
explain this phenomenon: first, unlike the microprocessor, the FPGA board
24
Table 3.2: 3.0 kV discharge onto FPGA heatsink during slideshow task
Observed Failures Percentage of Failures
Program resets to its beginning 6.7
System hangs but can be restarted using the
reset button
3.3
The displayed image losses its interlacing, as
shown in figure referr1
1.7
The display has its left and right halves flipped,
as shown in figure 1.1
3.3
does not have an operating system. Thus, the programs are standalone and
will not be killed by any other supervisor program or by the kernel. Thus,
failures such as expected program termination will not be observed as it is.
If there is an error that causes the program to be unable to proceed, the
user will observe a system hanging instead. Second, constant program reset
may mask some failures. Due to the fact that a program-reset will clear all
content of the internal registers and restart the program from the beginning,
other soft errors that manifest simultaneously will not be observed.
Figure 3.6: Probability of an error vs. discharge voltage across all
applications
Program-resets are caused by noise on the signal pins. To validate this
point, another experiment is conducted to examine the correlation between
the voltage of the ESD injection and the probability of getting a failure. Fig-
25
ures 3.6 shows that the probability of getting a soft error, namely program
reset, is almost linearly dependent on the GUN voltage.
3.3.3 Analysis and Diagnosis
The display failures were never observed in other experiments. Both of the
display failures are believed to be caused by corruptions of pointers. This
is because while the display is interlaced/flipped, the display data remains
unaffected. It is the orientation of the data that is getting corrupted. Since
this failure is not recovered when we change the slide, we believe the pointer
to the image file is not corrupted, and henceforth the corruption must have
occurred in some display control registers. One plausible situation is when
the start pointer of the display buffer is corrupted.
3.4 Micro-Controller
3.4.1 Application-Level Failures
Since this device is not connected to any display, no display failure is observed
on this device. However, the micro-controller chip is outputting data to a
laptop’s serial monitor, and we do observe that the data being outputted
is being corrupted. In a test program, the micro-controller was performing
arithmetic tasks repeatedly, and in every new iteration, “new round” will be
displayed to indicate the newly started iteration. After some ESD injections,
the serial message was corrupted and we obtained messages such as in figure
3.7.
We performed more than a thousand injections where the micro-controller
was outputting data from the serial interface, with and without a simultane-
ous PWM, with and without enabling a watchdog timer. Following are the
different types of faults we observed:
1. Serial output hangs, reset possible with reset button
2. Serial output hangs, power cycle needed for recovery
26
Figure 3.7: Two examples of sustained serial output corruption
3. Serial output hangs, PWM output operational, the system can be reset
using reset button
4. Serial output hangs, PWM operational, power cycle required for recov-
ery
5. Serial output hangs, and PWM duty cycle becomes constant
6. Serial output hangs, watchdog timer triggered to reset the system
7. Transient serial output corruption
8. Sustained serial output corruption
9. Program reset
10. SRAM data corruption
Failure types 1 and 3, 2 and 4 are similar failures when a PWM is enabled
and disabled, respectively. Table 3.3 is obtained when the micro-controller
is performing some read and write to different memory locations with the
PWM and watchdog both disabled.
Notice that when the primary output of the micro-controller is the serial
port, any failure on the serial output will mask some other failures, such
27
Table 3.3: 3.25 kV discharge onto micro-controller’s heatsink during a





Serial output hangs, system can
be reset using reset button
2.0
2
Serial output hangs, power cycle
needed to reset
7.0
8 Sustained serial output corruption 5.0
9 Program reset 1.0
10 SRAM data corruption 4.0
as the SRAM data corruption. The next set of experiments is to find out
if the core is still functional when the serial output fails. The core will be
outputting a PWM whose duty cycle varies between 10 to 90 percent in a
human-noticeable timescale. The results are shown in table 3.4
Table 3.4: 3.25 kV discharge onto micro-controller’s heatsink during a





Serial output hangs, PWM still operational
system can be reset using reset button
3.0
4
Serial output hangs, PWM still operational
power cycle needed to reset
8.0
5




Serial output hangs, and watchdog
timer resets the system
1.0
7 Transient serial output corruption 1.0
8 Sustained serial output corruption 3.0
3.4.2 Analysis and Diagnosis
For failure types 1 and 2, we locate the failure to the serial chip after our
measurement at the TX (transmit) port of the micro-controller chip. The
measured waveform shows that the micro-controller chip is still functional.
For failure types 3 and 4, the micro-controller chip is evidently functional
28
because it is still outputting a PWM with varying duty cycle. Therefore, in
these cases, output hanging is due to the failures in the serial communication
chip.
For failure type 5, the micro-controller is not working anymore, because
it is no longer updating the duty cycle of the PWM. A PWM can still be
outputted when the core hangs, because that is done by some peripheral of
the system, which may work independently from the core. The cause of this
failure is most likely due to a failure in the clock module, making the system
clock unable to toggle. Explanations such as infinite loops and corrupted PC
value will not explain the fact that the watchdog timer is not triggered to
reset the system. Thus, the only possibility is that the watchdog timer is not
counting down anymore, due to its failed clock.
In failure type 6, the watchdog timer is triggered and the system is reset.
The watchdog timeout threshold, in this case, is set to 8 seconds: if the
watchdog timer does not receive an update from the core for more than 8
seconds, it will issue a command to reset the core. After resetting, the core
recovers and become functional again. In this case, the cause of the error can
be an infinite loop or invalid program counter value. Notice that this failure
only occurred about 1 percent of the failures, possibly because the serial chip
is much more fragile than the micro-controller chip.
Transient corruption of the serial output data is due to noise picked up
by the clock signal of either the DUT or the laptop, resulting in the serial
output to be off by one or more clock cycles. Thus, in some cases we will see
some characters turned into random symbols, and in other cases we observe
that an entire message becomes a random symbol string, but the length of
the random symbol string is approximately the same as the intended output
message. The sustained corruption of the serial output is most likely due to
pointer corruptions in the micro-controller chip or in the serial chip. This is
because although the messages are corrupted, all characters, including space
and line breaker in the corrupted message were present in the original mes-
sage. This evidently shows that there are no data corruptions, otherwise
there must be some new characters or symbols, such as those observed in
transient serial corruptions. However, we are not able to further locate the
faults and explain how such a pattern is formed.
29
3.4.3 Further Diagnosis
One problem with further diagnosing the microprocessor chip and the FPGA
chip is their long detection latency. After a fault manifests, it will propagate
in the hardware to different hardware units and can eventually show up.
Thus, after a failure is observed, it is already too late to locate the fault.
The micro-controller chip, meanwhile, does not run any operating systems;
it simply executes the instruction that is written into its FLASH memory.
This allows us to further isolate and locate the faults, which will be impossible
when an operating system is running.
There are two main ideas of the further diagnosis techniques:
1. The device is programmed using assembly language to have
full controllability of the instructions used We will therefore have
control over the hardware units being used. For example, when defining
a variable, most compilers will allocate memory space to that variable,
and the interaction with the memory will introduce extra hardware
units i.e. the memory, and extra detection latencies. In contrast, when
programming the device using assembly, we can use general propose
registers (GPRs) to store a variable, and therefore minimize the detec-
tion latency.
2. The primary output of the chip should be used for error de-
tection Using the memory mapped I/Os of the micro-controller chip
will minimize the error detection latency. It takes a few hundred in-
structions to output anything for the serial output port, but the output
from the memory-mapped I/Os only takes a few instructions. Further-
more, outputting from the serial port will introduce additional hard-
ware units, which are also susceptible to ESD-induced faults.
A program is designed in order to count the number of bit-flips in a single
GPR or in one memory location of the SRAM. The pseudo-code in algo-
rithm 1 explains the error detection mechanism.
30
Algorithm 1 Algorithm for detecting bit-flips in a single register
1: setup()
2: R17 ← val
3: while 1 do
4: delay()
5: R16 ← val
6: R16 ← R16 xor R17
7: IOreg ← R16
8: end while
As shown in algorithm 1, the setup() clears the value of all GPRs, con-
figures the ports as outputs and loads a pre-defined binary string, val, into
the GPR being monitored. Delay was added to ensure the output of the
micro-controller can be observed by a human before being flushed by new
content. An XOR operation is performed using GPR R17 and R16 as its
operands. Both R17 and R16 will have the same content if ESD injections
are not performed. The content of GPR R17 is programmed during the setup
phase, whereas the content of the GPR R16 will be loaded from the instruc-
tion memory in each iteration. The output of the XOR operation should be
all zeros if its two operands are identical, and the output is then written into
the IO register. That datapath of this program is shown in figure 3.8
Figure 3.8: Datapath of the test program
We performed five hundred injections while running each program. Table
3.5 shows the results of the experiments.
In the table, sustained error refers to the observed failures which will only
be recovered by power cycling, whereas transient error refers to observed
31
Table 3.5: 5.2 kV discharge onto micro-controller’s heatsink during a
program detecting bit-flips
Observed Failures Percentage of Failures
Sustained error 1.0
Transient error 2.8
failures which will recover without any user intervention. We observed that
a reset button will always correct any failure. Thus, there is no corruption
to the embedded flash memory. Otherwise, the same type of failure will
reappear when the corrupted instruction is being executed. Note that the
content of GPR R17 is loaded at the beginning of the program, and it is not
updated after that. On the contrary, the content of the instruction register,
the value of GPR R16 , and the IO register is updated in every iteration of the
program. Hence, if GPR R17 is corrupted, the corruption will be indicated by
the program in each iteration, causing a sustained error. Meanwhile, if any
of the IR, GPR R16 and the IO register is/are corrupted, the corruption will
be flashed in the next iteration, thereby resulting in a transient corruption.
In both sustained and transient errors, single and multiple bit-flips are
detected, and the flipped bits are not necessarily adjacent. For example, the
content of GPR R17 changed from 00000000 to 00100110 in one trial.
A further program is designed to identify the extent of corruptions in the
register file. In this program, the output of each GPR-under-test will be
shown one by one via the GPIO ports. Algorithm 2 shows the test program.
Algorithm 2 Algorithm for detecting bit-flips in multiple registers
1: setup()
2: R25 ← val
3: while 1 do
4: for i = 16 to 24 do
5: delay()
6: Ri ← val
7: Ri ← Ri xor R25
8: IOreg ← Ri
9: end for
10: end while
The results of the experiment are shown in table 3.6. The ESD-induced
32
Table 3.6: 4.6 kV discharge onto micro-controller’s heatsink during a
program detecting bit-flips in one or more registers
Failures









failures are most typically localized, but in some cases the corruption is
widespread.
3.5 Observations and Lessons Learned
A cross-DUT comparison is made on the common failures. The common
failures observed are display failures, system hanging, and unexpected resets.
Among the failures, the display failures in the microprocessor and some unex-
pected resets are believed to be caused by noise picked up at signal lines. The
system hanging and register content corruptions may be attributed to ESD-
induced supply noise, and the transient serial output corruption is due to a
clock glitch. Therefore, we list three major types of ESD-induced physical-
level events for further investigation: (1) input glitch, (2) clock glitch, and
(3) supply noise.
To further validate the cause of ESD-induced failures and investigate the
effectiveness of the aforementioned ESD-induced physical-level events on the
gate level, a custom test chip is designed for tape-out. The test chip will have
approximately the same level of ESD protection as a commercial chip, but
with customized parts for specific goals of this research. Other motivations
of having a custom test chip are the follows:
1. Better observability needed for diagnosis proposes Our ability
to diagnose different failures is limited by many factors. First of all, the
observability of the commercial chip is very limited. Although register
accessible to instruction can be outputted through the GPIO ports,
most of the registers in the control path will not be available. Thus, in
our experiments, we can only have an in-depth diagnosis on data error
33
described in section 3.4. Further information is needed to diagnose
failures such as system hanging and program termination.
2. The source code of the core needed for better understand-
ing the chip and to simulate the chip Our understanding of the
commercial chips is very limited. Although architecture information
is available in the official manuals of the different commercial chips,
the implementation details will not be understood without the source
code of the chip. Furthermore, simulating the core in software will
provide a more comprehensive understanding of the datapath and the
control flow of the core, and henceforth help us understand the detailed
implementations.
3. Custom ESD detectors needed Furthermore, not every single ESD
event will result in a failure. Indeed, based on our experimental results,
the probability of an ESD event causing a failure is very low. Given
the fact that our test locations and voltages are sometimes not required
by the IEC 61000-4-2 standard, the likelihood of failures due to an
ESD event will be lower. Thus another motivation for the custom test
chip is that we can implement custom ESD event detectors in the test
chip. The detectors will be sensitive to a transient high voltage or high
current event and will work as indicators of ESD events of different
amplitude. The custom test chip enables us to find the correlation
between an observed failure and an ESD flag raised by the custom
detectors.
4. Fault injection can be performed in software Note that designing
a custom test chip is not only about taping-out, models also have to
be designed and verified in CAD tools. With the models of custom
designed circuitries, custom detectors and the source code of the core,
we can perform simulation-based ESD injection and explore the effect of
an ESD stimulus at the physical level as well as at the gate level. Then,
the results obtained from simulation-based fault injection experiments
on the custom test chip will be compared with hardware-based fault
injection experiments, and be validated.
34
CHAPTER 4
TARGET SYSTEM – OPENMSP430
4.1 Overview
The target system is openMSP430, an open-source micro-controller core that
runs the same instruction set architecture as the MSP430 micro-controller
from Taxes Instruments. The block diagram of openMSP430 is shown in
figure 4.1.
Figure 4.1: Block diagram of openMSP430; figure obtained from
openMSP430’s user manual
• The frontend module contains the execution state machine and is in
charge of fetch and decode.
35
• The execution unit executed decoded instructions based on the exe-
cution state machine. The arithmetic logic unit and the register file
are inside the execution unit.
• The memory backbone controls the interactions that occur between
the frontend, execution-unit, serial debug interfaces, data memory, and
program memory.
• The serial debug interface communicates with the host is achieved
via a standard two-signal interface following the UART serial proto-
col. The program memory can be initialized through the serial debug
interface.
• The basic clock module processes the input clock and generates dif-
ferent clocks for different proposes under different power modes.
• The special function registers (SFRs) store configuration registers
such as the non-maskable interrupt(NMI), watchdog, etc.
Figure 4.2: The memory organization of openMSP430; figure obtained from
openMSP430’s user manual
• The program memory has a configurable size between 1 kB to 59
kB. The configuration used in this project is 4 kB. The data memory
has a configurable size between 128 B to 32 kB. The configuration used
36
in this project is 1 KB. The peripheral memory has a configurable
size between 512 B to 32 kB. The configuration used in this project is
512 B. The memory space organization is shown in figure 4.2.
4.2 Tape-out Preparations and Module Descriptions
Besides having the source code of the processor core, a few further steps
are taken as preparations for the tape-out. Some of the steps that are more
relevant to the scope of this thesis are listed below:
• Modifications are done to the core to include high-level error detectors,
and a standalone I2C slave module. Some minor modifications are also
made to suit the objectives of this project.
• The core is synthesized using the Synopsis Design Compiler version
G-2012.06-SP5-5. The library used is the 130 nm ibm13rfrvt package.
The scan chain is inserted using the DFT Compiler integrated with
Design Compiler Shell.
• Place-and-route (PNR) is done using Design Compiler and Encounter.
• The analog parts, such as the IO ring are designed and routed to the
core.
The block diagram of the SoC is shown in figure 4.3. In figure 4.3, the
processor core will communicate with other systems via the IO ring. The
red color components are analog modules; the white color modules are wrap-
pers which have no further functional proposes; the light blue color modules
are digital modules and the green color modules are connect module. To be
noted, the digital modules are already synthesized, and their corresponding
transistor-level schematics are also available. However, for verification and
simulation proposes, in the thesis, the digital modules will be treated as gate-
level models or functional models. Meanwhile, the connect module does not
physically exist in the actual SoC, and they only exist in simulations. The
detailed explanation of each module will be found in the next few sections.
37
Figure 4.3: Block diagram of the openMSP430 SoC
4.2.1 IO Ring
As shown in figure 4.4, the IO ring is a ring structure of electrical pads that
serve as a connection between the pins of the IC and the die. The pins
are electrically connected by ultrasonic welding of copper wires (bond wires)
onto the top-level metal used. ESD protective elements are placed under the
bond pads and prevent the discharge from entering the core. These include
dual diodes (which serve as primary protection) which route the ESD current
away from the core and into the power rails. Secondary protection elements
include snapback devices such as the GGNMOS and smaller diodes. They
remain inactive during normal operation. They also serve to protect the
38
output buffers from exceeding its breakdown voltage. They must be placed
on every pin of the IC and pass qualification levels of the human body model
(HBM) or charged-device model (CDM) and sometimes surge protection as
Figure 4.4: IO ring schematic
well. All the cells have a 3.3 V operating voltage. The ports outside the ring
are the pins of the test chip package, and the input range of those pins is 0 -
3.3 V. The ports inside the ring will connect to the processor core and other
uncore components, and the operating voltage of the processor core is 1.5 V.
We adopted different cells from the earlier research of our group [18].
39
4.2.2 ESD Event Detectors
The ESD event detectors designed to capture a quantitative measurement of
the voltage deviation, above and below, the nominal supply voltage for both
the core power-domains and high-voltage domains. Sensitive logic elements
will again be included within the core so that soft-failures can be directly
linked to supply noise levels.
A prominent mechanism suspected of causing soft failures is supply noise.
However, it is virtually impossible to monitor the on-chip supply voltages
with external equipment, without introducing invasive measurement tech-
niques. To meet this need, internal monitors were developed. These moni-
tors are intended to sample the maximum and minimum potential differences
between a supply net and its reference during ESD, and provide a readout of
those values following the ESD event. This is a difficult task because, during
ESD, it cannot be assumed that the power supply is adequate to power active
devices within the core. Further, since various supplies are used, monitors
have to be tailored to each supply domain.
The detectors fall into two categories: under-voltage and over-voltage.
Under-voltage detectors are intended to sample and hold voltages that are
lower than the nominal supply voltage. Over-voltage monitors are intended
to sample and hold voltages that are greater than the supply voltage. In
both cases a capacitor is used to store an analog voltage during an ESD
event and upon power-supply recovery, a flash analog-to-digital converter
(ADC) is used to store that value digitally until readout can occur at a later
time. We customized our detector based on earlier research [18].
4.2.3 Core
The core is synthesized and placed-and-routed using the Design Compiler.
Table 4.1 summarizes the combinational and non-combinational logic units
count. In any simulations, the netlist after place-and-route is used, along
with the delay information specified in a standard-delay-file (.sdf file). In the
.sdf file, inter-module delays and intra-module delays are both specified to
make the simulations more realistic.
40
Table 4.1: Synthesis results
Logic unit type Unit count Total area (um2)
Combinational logic units 1579 60384
Sequential logic units 9401 84782
Total 10980 145166
4.2.4 High-Level Fault Detectors and Indicators
There are four types of higher-level detectors and indicators added to the
processor’s source code:
1. Hanging detector The hanging detector is implemented by a counter,
whose most significant bit is used as an indicator. At each rising clock
edge, the counter will increment if the current CPU state machine is at
fetch state or when the clock is toggling. The counter will increment in
two different rates such that the change of the most significant bit of the
counter will be in two different patterns. Note that if the content of the
counter is incremented, the clock has to be toggling. The size of the
counter is carefully designed so that when the test chip is executing
some tasks, the changing pattern in both cases will be captured by
human beings. Meanwhile, the hanging detector will also work as a
timer for other programmable detectors’ values to be loaded.
2. End-of-program detector The end-of-program detector is developed
based on the idea that we have full controllability over the test program.
Therefore, for all the test programs, we can append some instructions
at the end of the program, to load particular values to different regis-
ters. When certain registers are having the expected values, the end-
of-program register will be set, and its content will only be cleared if
a system reset is issued. In our case, we load GPR r13, r14, r15 with
the ASCII value of the string “Done! ”. The end-of-program detector
is used to indicate that a program has come to its end.
3. Slow-down detector When test programs are designed, simulation
can be made to obtain the precise information on the number of clock
cycles needed to finish the program. This information can be pro-
grammed into a down-counter in the core. When the down-counter’s
value reaches 0, the program is expected to finish. If the end-of-program
41
indicate register is not yet set, we know the program is taking longer
time than expected. The size of the down-counter is 32 bits. This is
because given a 100 MHz clock, a 32-bit down-counter will take around
1.5 ∗ 109 clock cycles to counter from its maximum value down to zero.
This will take at least 40 seconds for the given clock speed, and will be
enough for test purposes. In practice, due to possible interruptions, the
program will take a longer time than expected. We, therefore, allow
a 20% tolerance by programming the expected number of clock cycles
multiplied by 1.2 into the down-counter. The process of programming
the down-counter will be explained in the next detector.
4. Invalid memory access detector Another benefit of having full con-
trollability over the test program is that we know the range of memory
that will be accessed. Therefore, we can program certain register to
contain the upper and lower bound of the program memory and the
data memory. If any memory access is out of the programmed range, an
error will be indicated. To program the registers to contain the memory
bound information, the hanging detector counter will be used. When
the value of the hanging detector counter is within some range, the
memory bound information will be loaded from some registers. Table
4.2 contains the information on where the information is loaded:
Table 4.2: Programmable registers data sources
Registers to be programmed Data loaded from
Program memory upper bound r4
Program memory lower bound r5
Data memory upper bound r6
Data memory lower bound r7
Down-counter lower bits r8
Down-counter upper bits r9
Since the hanging detector counter will clear if it overflows, the value of
the hanging detector counter will satisfy the range requirement again
when the program runs continuously. Therefore, a lock will be placed
on the programmable registers after their first load. The lock will not
be released until the system reset. This is to prevent the information
from being overwritten.
42
4.2.5 Program Memory and Data Memory
The program and data memory are compiled by ARM 130 nm SRAM com-
pilers. The size of the program memory is 4 kB. Each word is 16 bits, and
there will be 2048 words. The write mask has a mask width of 8 bits and
a mask size of 2 bits. Single port memory will be used in this design. The
data memory is 1 kB with 512 16-bit words. The memory-enable signal of
both memories is active low.
4.2.6 I2C Slave Protocol
The inter-integrated circuit protocol, or I2C (I2C) protocol, is intended to
allow the communications between multiple masters and slave integrated
chips. It only requires two signal wires, namely the clock signal and the data
signal, to exchange information.
Figure 4.5 shows how I2C protocol work. The communication will be ini-
Figure 4.5: I2C data transmission waveform; figure obtained from
openMSP430 manual
tiated by a master device with a start bit, followed by the device address and
operation. Then, the device with the called address will reply an acknowledg-
ment. Then the master device will send the address that it intends to access,
and the callee will reply another acknowledgment once the address has been
successfully received. A similar procedure repeats for the data bits. In the
end, the master device will send a stop signal to terminate the transaction.
Note that both of the blue colored sequences are 8 bits. If an error occurred
during transmission, for example, the callee received 9 data bits instead of
8, then a nack will be replied, informing the master device that a failure has
occurred and the transmission has to be re-issued.
In the openMSP430, an I2C slave device will be used. There are several
reasons not to add a master device into the design. First of all, the control
43
sequence of the I2C master has to be carefully designed. That means addi-
tional datapath and CPU instructions must be added to the core, along with
the verification processes. Additionally, the compiler also has to be changed,
otherwise the newly added instructions will not be compiled. Therefore, it
requires a considerable amount of effort in order to include an I2C master
inside the design. Meanwhile, we observed that if noise is picked up by the
I2C bus, both the master and the slave device will be aware of that. Note
that the aim of the experiments is to find how ESD-induced faults manifest;
an I2C slave device will be sufficient in this case.
The I2C slave protocol has eight 8-bit storage units. In the future hardware
injection, data will be written into the storage units while the processor core
is running any other tasks. A failure will be detected if the read data does
not match the write data, or the received address is out of the defined address
space.
4.2.7 Scan Chain
The observability enhancement infrastructure on the test chip is decided to
be the scan chain for the following reasons:
• It is important and necessary that we eliminate all electrical faults, such
as the stuck-at faults, during manufacturing. This is because electrical
bugs will lead to hard failures and will increase the difficulty of pin-
pointing the soft failure. Therefore, the insertion of the scan chain not
only serves as an observability enhancement measure, but also provides
design-for-testability support.
• The scan chain allows the content of almost all registers to be pro-
grammed. Then, ESD can be injected when the system is oﬄine, and
the new register contents can be checked against the programmed value.
• When trying to find out a correlation between the ESD event detector’s
flag and an actual fault, it is necessary to check all register contents.
Only scan chain can provide this level of observability.
• The main drawbacks of the scan chains are their low speed and high
area cost. In the scope of this project, the speed is not a significant
44
factor to consider because, unlike build-in-self-tests (BISTs) where a
lot of test patterns are to be applied, we only have to scan the test
data out once after an ESD event. The marginal area cost of the scan
chain does not significantly increase the cost of the test chip as well.
Meanwhile, ELAs will not be used because of the limited size of the trace
buffers.
4.2.8 Connect Modules
The connect modules are modules that connect the digital signals and the
analog signals. Both analog-digital (A-D) connects and digital-analog (D-A)
connects have two ports: one being defined as analog electrical and one as
digital IO.
The A-D connect is relatively simple. If the input voltage is below the
threshold voltage, then due to the positive feedback, the output voltage will
converge to a logic 1. Otherwise, the output voltage will converge to a logic
zero. The threshold voltage is typically defined as the difference of the supply
and the ground voltage. Therefore, the threshold voltage used from the input
stimulus to the IO ring is 1.65 V and the threshold voltage used from the IO
ring to the core is 0.75 V.
The D-A connect has more cases, because as mentioned, there are four
possible values for a digital signal: logic 1, logic 0, unknown value X, high
impedance Z. Each of them will correspond to an output voltage along with
an output impedance. When the input is logic 1, the output voltage will be
either 3.3 V or 1.5 V, depending on the supply voltage of a particular circuit.
When the input voltage is 0, the output voltage will also be zero. We define
the output voltage as the threshold voltage when the input value is unknown
or high impedance. When the input is either a logic 1, logic 0 or unknown,
the output resistance will be in the scale of ohms to tens of ohms. When the
input is high impedance, the output impedance will have 50 kΩ resistance.
Table 4.3 summarizes the input and output characteristics.
45
Table 4.3: Input and output characteristics of a D-A connect module






The input stimulus block is not shown in the block diagram of the open-
MSP430 SoC. The propose of this block is to supply the primary inputs to
the system. The clock generators block will be instantiated in the input stim-
ulus. Other primary control inputs, such as the cpu enable and scan enable
Figure 4.6: Block diagram of the input stimulus module
will also be provided here. Beside control inputs, there are also data in-
puts such as the scan input for the scan chain, as well as the input for the
GPIOs. All input stimulus is written in Verilog, to resemble the cleanest
inputs without any noise. Then, all digital input will be converted to con-
tinuous discipline logic using digital-to-analog connect modules, as shown in
figure 4.6. As in figure 4.3, the white, blue and green module denote wrapper
46
modules, digital modules, and connect modules.
4.2.10 ESD Gun
The ESD gun model is shown in figure 4.7. The discharge waveform of
different loads in simulations are compared against the measurement results,
and the results validate that the SPICE level ESD gun model resembles the
physical ESD gun described in 61000-4-2 standard.




There are two key things that we want to explore on the openMSP430 system.
First, we want to understand how bit-flips in register level affect the system
at computational and application levels; and second, we want to understand
how ESD events occurring at the physical and transistor level manifest as
bit-flips at the register level.
The first objective can be achieved by having an RTL-level or gate-level
fault injection. This is conventionally done by injecting bit-flips into the
registers during simulations and observe if any failures result. To make the
gate-level injection more accurate with respect to ESD-induced errors, we
need an accurate model that resembles the realistic circuit behavior in re-
sponse to an ESD event. In other words, we need to understand how ESD
events occurring at the physical level manifest themselves as bit-flips. To
explore that, we need another fault injector that models the current and
voltage characteristic of an ESD event. Thus, a simulator at the SPICE level
or transistor level will be preferred. However, due to the large scale of sim-
ulation, the core cannot be afforded to be simulated at transistor level, and
indeed, it does not have to be.
We implemented three different fault injectors to serve different injection
proposes:
1. Automated gate-level fault injector
2. Mixed signal fault injector on clock and input pins
3. Mixed signal fault injector on the power supply net
48
5.1 Automated Gate-Level Fault Injector
The automated gate-level fault injector is developed to inject different types
of bit-flips into the injection target. Note that this fault injector is simulated
without any analog components. The modules used in the simulation are
the gate-level core, the program memory, and the data memory, as well as
the Verilog testbench module, which will supply the input stimulus to the
core. In the Verilog testbench module, two DUTs will be instantiated and
simulated simultaneously. One DUT will serve as the control group, where
there will be no fault injected, and the other DUT will serve as the experi-
mental group, where a fault is injected during simulation. The workloads of
the simulations are the different testbenches in the openMSP430 project. A
testbench is said to pass if the stimulus completes and simulation produces
a result that matches the expected result of the simulation. Otherwise, a
testbench is said to fail. The pseudo-code of the automated injector is de-
scribed in algorithm 3.




3: while i < maximum injection count do
4: total inj ← total inj + 1
5: (start, duration, target)← generateInjectionParameters();
6: generateInjectionF ile(start, duration, target);
7: simulate()
8: if simulation pass then
9: total pass← total pass+ 1
10: else
11: total fail← total fail + 1
12: injectionLog ← parse(simulationLog)
13: end if
14: i← i+ 1
15: end while
16: stats← post process(injectionLog)
49
All the registers’ values in the two DUTs will be compared at runtime.
A parser script is used to identify all registers in the synthesized gate-level
netlist. Each register in the netlist will be assigned with an index. At the
positive clock edge, every pair of registers is compared. If any pair has
mismatched values, a fault will be detected, and the index of the register
will be recorded. In this method, the fault will initially be identified at the
clock cycle of the injection. As the error propagates through the system, the
number of corrupted register will increase. In the end, some faults will cause
the testbench to fail, or in other words, lead to a failure. In other cases, a
failure will not result, and the testbench will pass. In the later case, the fault
is said to be masked.
The generateInjectionParameters() function returns the start time of
the injection, the duration of the injection and the target signal of the in-
jection. We assume the fault can occur in random time. Therefore, we will
generate the start time of the injection randomly, following a uniform distri-
bution probability density. The duration can be generated in different ways.
A normally distributed duration will resemble the distribution of the dura-
tion of actual ESD events in real life. However, the mean and variance of
the duration of an actual ESD event is largely dependent on the material,
humidity, and temperature, and will not be discussed in the scope of this
project, Another way of generating the duration is to generate a sequence
of uniformly spaced vectors. Then a set of injection experiment will be per-
formed in order to examine what is the threshold of injection duration below
which there will be no failures observed. The injection targets are flip-flops
in a system, and it can be generated randomly, or as a sequence vectors.
With the injection parameters generated, the algorithm will generate a
Verilog file with the injection information, and the Verilog file will be in-
cluded in the testbench files. Algorithm 4 shows the pseudo-code of the
Verilog file. The target signal is flipped for the length of the duration using
the force() and release() tasks of Verilog.
50




4: if (target == 1’bx) then
5: error(”Fault not injected!”)
6: force(target ← target)
7: else if (target == 1’b1) then
8: force(target ← 1’b0)
9: else if (target == 1’b0) then





Note that the high impedance case is not considered because it is not
a possible value of a register. Meanwhile, a fault will not be injected if a
register’s value is not initialized. This is because the register’s value will
be initialized and the injected fault will be overwritten. Thus, most of such
cases are trivial and will not be considered.
After the Verilog injection file is generated, a simulation will be performed.
The simulation() function in algorithm 3 consists of four steps:
1. A new terminal will be created. This is to prevent the main function
from being killed by simulation-induced system errors.
2. A delay of a few seconds will be introduced to prevent the read after
write memory issues. The newly updated Verilog injection file will be
used in compilation, and a delay will guarantee the updates to the
injection will are made before the compilation starts.
3. The simulation script will be executed. We support and created scripts
for three different simulation engines: ncsim 15.20-s008 from Cadence,
VCS mx-H-2013.06-SP1 from Synopsis, and Modelsim 10.1d from Men-
tor Graphics.
51
4. Another delay will be introduced to avoid read after write memory
issues of the simulationLog file.
If in a simulation trial, the testbench passes, then the pass will be recorded
in the total pass. Otherwise, the simulationLog will be parsed to identify
more information about the failure. The possible failures of a simulation
could be a time-out, stimulus not completed, stimulus completed with some
wrong outputs. The information will be appended to an injectionLog file.
After all the injections are done, the injectionLog file will be parsed to get
more statistics for the injection environment.
5.2 Injecting ESD on Clock and Input Pins
To study on ESD-induced clock glitches and input glitches, faults need to be
injected at a level that is as close to physical level as possible [37]. However,
simulation the entire processor core at the SPICE and transistor level takes
an enormous amount of time and is unnecessary. Therefore, all digital circuits
will be simulated at gate-level of behavioral-level, and the rest of the modules
will be simulated at the SPICE or transistor level. Table 5.1 summarizes the
different modules in the injection:
Table 5.1: Modules and level of simulation




Program memory Verilog (Behavioral level)
Data memory Verilog (Behavioral level)
Processor core Verilog (Gate-level)
Our point of injection is on the clock IO pin of the system, in between
the clock generator chip and the custom chip’s package. A contact discharge
using the charged device model will be performed. Now there are four signals
to be observed during the propagation of ESD-induced high voltage. The
propagation path is illustrated in figure 5.1. The clock gen signal is generated
by some off-chip clock generator. The clock generator is mimicked by a
module inside our input stimulus block. The output of the clock generator is
52
Figure 5.1: ESD injection propagation path
then connected to the input pin of the openMSP430 system on its IO ring.
Then, the clock signal propagates through the IO ring, to arrive at the input
of the openMSP430 core. Meanwhile, the control signals from the core will
also reach the clock generator module.
Note that when injecting on the clock signal, the only signal being distorted
will be the clock. Thus, the way to improve the speed of the simulation is to
characterize the distorted waveform at the logic level, and apply the distorted
waveform to the automated gate-level fault injector.
Injections were performed at 300 ns at different voltages, and two typical
waveforms are performed in figure 5.2. In all experimental trials, we observed
that the voltage at the injection point was pulled up within 1 ns. This is
because V = LdI
dt
, and when there is a large amount of current going into
a node in a short amount of time, the voltage will be pulled up. Then,
the voltage gradually drops to its normal operating voltage due to current
exiting node B. The duration that the node voltage is distorted from its
normal operating voltage is positively related to the pre-charge voltage. The
voltage at node C roughly follows the voltage trend of node B, but at a lower
magnitude. In the end, the voltage at node C will be parsed by the A-D
connect module which a threshold of 0.75 V, and the value at node D is a
discrete-time logic.
Notice when the voltage at node B is pulled up, the value at node D is
characterized by a prolonged period of logic 1, as observed in both of the
injection trials. When current is exiting the node, however, the the LdI
dt
effect may temporarily pull the voltage below the threshold, as seen in figure
5.2b. In this case, although when the clock generator is supplying a logic 1,
53
(a) Trail 1: at pre-charge voltage of 2000 V
(b) Trail 2: at pre-charge voltage of 3000 V
(c) Trail 3: at pre-charge voltage of 5000 V
Figure 5.2: Waveforms of node A,B,C,D when injecting ESD at 300 ns and
at different pre-charge voltages
54
the actual clock value at the core will be seen as a logic 0. At other times,
although the voltage at node B is dropping, the resulted voltage at node C
still remains above the threshold. This causes the clock to have an extra
prolonged period logic 1. Notice that the clock generator was stuck at 1 for
a few clock cycles. This is because the clock glitch to the openMSP430 core
caused the clock generator control signal to malfunction for a period of time,
and at the next detected clock edge, the functionality is restored. For trial 3,
however, the clock control signals are distorted when the clock generator is
outputting a logic 0. As a result, the clock generator is stuck at logic 0, and
after the ESD current flows away, the clock voltage at node B will not be
pulled up by the clock generator anymore. Therefore, no more positive edge
will be detected, and the system will no longer be functionally recovered.
We performed ESD injection at different pre-charge voltages and exam-
ined the logic value at node D. There are four types of characterizations
of distortion at node D. Table 5.2 lists the different characterizations. The
four types of characterized clock glitches will serve as fault models in the
gate-level simulation.







A prolonged logic 1 will overwrite
the current clock value
Positive ≤2500 Trail 1
A prolonged logic 1 will overwrite
the current clock value, followed by
a prolonged logic 0 overwriting
clock value
Positive ≥3000 Trail 2,3
A prolonged logic 1 will overwrite
the current clock value
Negative ≤2500 Not shown
A prolonged logic 0 will overwrite
the current clock value, followed by
a prolonged logic 1 overwriting
clock value
Negative ≥3000 Not shown
ESD injections on GPIO pins are done in a similar method. We will also
be injecting ESD at different pre-charge voltages and observe the distorted
waveform in the mixed-signal simulation, and characterize the fault model
at the gate level. One difference between the injections on the clock and
on GPIO pins is that the IO cells are different, causing relatively different
55
responses at SPICE level. But overall, the characteristics are the same for
both the GPIO and the clock.
5.3 Injecting ESD on Supply Net
When an ESD discharge occurred on the supply or the ground pins of a chip,
the resulted supply noise may cause the registers’ contents to flip. Figure 5.3
shows the voltages for the supply voltages (Vdd and Vss), the input volt-
age (Vin), and threshold voltage (Vth) of a flip-flop during an ESD event.
Initially, a logic 1 is stored in the flip-flop. At about 15 ns, an ESD event
occurred and discharge current flowed into the supply net, causing both the
Vdd and Vss to elevate. The threshold voltage, being the average of the
Vdd and Vss, is elevated as a result. However, the input will remain at a
stable level, and as a result, will be less than the threshold voltage for a short
amount of time. When Vin is less than Vth, the content of the flip-flop will
flip.
To model this effect, several modifications have to be made. First of all,
Figure 5.3: Different node voltages
the voltage has to be seen by each flip-flop and the flip-flop has to be sen-
sitive to the supply voltage. The gate-level model in figure 5.4a is a logic
element, whose only inputs are clock, reset, and data in. Thus, a Verilog
flip-flop model will not be sensitive to supply noise, and therefore will not
be an accurate model of the flip-flops in the fault injection experiment. One
modification is to make the flip-flops supply sensitive by using a Verilog-AMS
description of a flip-flop, as shown in figure 5.4b. By changing all Verilog
flip-flops to supply-sensitive flip-flops, we are changing the gate-level simu-
lation of the core to a mixed-signal one. However, the simulation speed will
56
(a) Verilog DFF model
(b) Verilog-AMS DFF model
(c) proposed model
Figure 5.4: Different register models
not be adversely affected, since the supply voltage is expected to stay stable
if ESD is not injected. As a result, few calculations will be added. On the
other hand, the critical limitation of using only the supply-sensitive model
is that the intra-model delay will not be capture. The inter-model delay can
still be modified if a script is carefully written to process the standard delay
file; but to capture the intra-module delay, the supply-sensitive models has
to be created and verified separately for each type of flip-flop used. Still,
there will be no guarantee that the timing will be accurate.
We proposed the simulation model for flip-flops as described in figure 5.4c.
The Verilog model of the design will still be used to capture all the delay
information. At the same time, another path will be made from the input
of the flip-flop to a supply-sensitive flop-flip model. At each clock edge, the
57
value in the Verilog flip-flop model will be compared against that in the
supply-sensitive model. If there is any mismatch, an error has occurred,
and will be reported. In this way, the gate-level manifestation of faults
in response to a SPICE-level ESD event will be observed. Meanwhile, the
value in the supply-sensitive model will be forced into the Verilog-model,
to allow the continuation of simulation with corrupted values. Thus, the
application and computation level effect of supply-noise induced soft failures
can be examined.
Another important step is to obtain the supply information. Due to the
different physical distances from each flip-flop cell to the power supply and
the R-C characteristics along the trace, the supply voltage seen by each flip-
flop may slightly differ. Therefore, it is important to find out the supply
voltages that actually arrives at each flip-flop cell. After place-and-route
is done, an “extract RC information” option can be found in the Cadence
Virtuoso platform. That option allows the extraction of all or partial RC
parameters from the place-and-route file. However, the simulation speed will
be severely affected if the supply voltage of each cell is coming from a distinct
node. This is because analog simulation solvers find voltage values by solving
equation sets. When the number of variables increases dramatically, the time
needed to find a set convergent points will be much longer.
Our proposed model for the power distribution network is based on the
assumption that cells that are physically close to each other will have a
negligibly small difference in their supply voltage. To be more precise, we
define cells as physically close to each other if they locate in the small unit in
the power distribution network. The layout of the power is shown in figure
5.5a. The green color strips represent the power network for the Vdd plane
in the openMSP430 package. A similar network shown in blue represents the
power network for the Vss plane. Some cells in the IO ring will be connecting
the Vdd and Vss planes. The supply net will be modeled by resistors. A
resistor model for the Vdd net shown in figure 5.5. The unit resistance per
unit length can be found in technology files. We assume the Vdd or Vss of all
flip-flops in one location unit to be the average of the Vdd or Vss voltages at
its four corners. Therefore, the threshold voltage will be the average voltage
at its four Vdd corners and four Vss corners. In fact, the threshold voltage
is all that is needed to be accessible by the supply sensitive flip-flops.
The pseudo-code of the model implementation is shown in algorithm 5.
58
(a) Verilog DFF model
(b) Resistor model of the Vdd net
Figure 5.5: Modeling the power distribution network
59
Algorithm 5 High-level description of construction of models to simulate
supply noise induced soft failures
1: for all flip− flops do
2: (name, clock, data in) = parseFromNetlist()
3: constructF lipF lopModel(name, clock, data in)
{construct the flip-flop model in figure 5.4c}
4: location = parseFromLayout(name)
5: location unit = putIntoUnits(location)
6: if location unitnotinlocation unit list then
7: append location unit to location unit list
{find all different units at which at flip-flop is located}
8: end if
9: end for
10: constructResistorNet(location unit list)
{Construct the power distribution network model, and the voltage output





The openMSP430 package provides testbenches to verify the functionality of
the processor core and its peripherals. In the injection experiments, we chose
three typical testbenches for simulations:
1. The control intensive testbench with mostly branching instructions
2. The IO intensive testbench with mostly IO instructions such as writing
to and reading from the memory and the GPIO ports
3. The computation intensive testbench with mostly logic and arithmetic
instructions such as XOR
In each testbench, modified instructions will be compiled and loaded into the
program memory of the openMSP430. Each testbench will check the values
of different registers in the core, and if the values of some particular registers
match the expected values, a simulation will pass. Otherwise, a simulation
will fail. Both the gate-level injection and the mixed-signal injections will be
performed using the three testbenches.
6.2 Results for Gate-Level Injections
We performed fault injections over all registers at random times for 1 clock
cycle. The results are shown in figure 6.1. The system-level behavior can be
categorized into the following categories:
1. Pass 0 corrupted indicates that the simulation passes all the value
checks in a particular testbench. At the end of the simulation, when the
61
register values in the experimental group are compared against those
in the control group, all the register-pairs have identical values. In this
case, although a fault is injected, it is masked and will never lead to a
system failure.
2. Pass 1 corrupted indicates that the simulation passes all the value
checks in a particular testbench. However, at the end of the simula-
tion, when the register values in the experimental group are compared
against those in the control group, exactly 1 pair of registers has differ-
ent values. In this case, it is most likely that corrupted register value
is not used by any other registers, and therefore the corrupted value
does not propagate.
3. Pass 2+ corrupted indicates that the simulation passes all the value
checks in a particular testbench. However, at the end of the simula-
tion, when the register values in the experimental group are compared
against those in the control group, more than two pairs of registers
have different values. In this case, the fault starts to propagate, but
does not result in a failure.
4. Timeout due to mem error indicates the simulation does not finish
within a predefined amount of time due to memory-related issues. The
memory out-of-bounds detector indicates the out-of-bounds access to
the memory. The failures are most likely due to runaway code, i.e. the
program counter is corrupted and instructions are loaded from some
undefined memory space.
5. Timeout due to clk indicates the simulation does not finish within
a predefined amount of time due to clock-related issues. The clock
generator signal of the experimental group stops toggling and the main
clock of the registers becomes unknown. Simulation of the experimental
group will not proceed in this case.
6. Timeout due to other reasons indicates the simulation does not
finish within a predefined amount of time due to other reasons (i.e.
reasons not related to memory or clock). In this case, only the slow-
down detector will indicate a timeout, and no other detectors report
anomalous system behavior.
62
(a) Results for control intensive testbench
(b) Results for IO intensive testbench
(c) Results for computation intensive testbench
Figure 6.1: Classification results for gate-level fault injection
63
7. Stimulus not completed indicates some expected tasks are not ex-
ecuted and the simulation fails as a result. This is most likely due to
corruptions in the control registers.
8. Wrong results indicate all expected tasks are executed, but the results
of the tasks do not meet the expected values. This could be due to
corruptions in the data registers.
We make a few observations from the results, and provide explanations for
some of the observations:
1. The likelihood of simulation passing is very high In all three
testbenches, the combined likelihoods for “Pass 0 corrupted”, “Pass
1 corrupted”, and “Pass 2+ corrupted” are 91.8%, 92.7%, and 94.1%,
respectively. Among the passed simulations, the likelihoods for masked
faults are 62.8%, 55.9%, and 69.6%. In other words, most of the injected
faults are masked by the system, without causing any failures. This
can happen when the injected register is not used until the next update
of its value.
2. The IO intensive testbench has the lowest likelihood for “Pass
0 corrupted” but the highest likelihood for “Pass 1 corrupted”
While the combined likelihood for passed simulations is about the same
for the three testbenches, the IO intensive testbench has only a 55.9%
likelihood for masking the fault (type 1), approximately 10% lower
than that in the other two testbenches. However, its likelihood for
the corrupted register value not propagating (type 2) is approximately
10% higher than that in the other two testbenches. This is because
in the IO intensive testbench, many general-purpose registers are not
used in any instructions, and when they are not used, the corrupted
value does not propagate or recover. On the contrary, in the other two
testbenches, general-purpose registers are used to perform arithmetic
operations or to test different branching conditions. As a result, the
corrupted register values will either lead to a failure, or be flushed with
clean values. Note that there are 12 general-purpose registers in this
design consisting of a total of 384 injections. Injections on the general-
purpose registers may have caused the likelihood differences in type 2
and type 3 passes in different testbenches.
64
3. The control intensive testbench has a likelihood for stimu-
lus not completed that is much larger than that in the other
testbenches The likelihood for stimulus not completed for the control
intensive, the IO intensive and the arithmetic intensive testbenches are
1.52%, 0.189%, and 0.7%, respectively. This is because of the num-
ber of branch instructions in the testbench. The testbench with more
branch instructions is more likely to have run-away codes, whereas test-
benches that perform IO tasks and arithmetic tasks have a much lower
likelihood to be affected.
4. The likelihood of clock-related failures is much higher in the
control intensive testbench The likelihood of having a timeout due
to clock issues in the control intensive testbench is 1.52%, much larger
than 0.28% and 0.47% in the other two testbenches.
To further investigate the experimental results, we plotted histograms on
the number of corrupted registers in passed and failed simulations for each
testbench. The results are shown in figure 6.2. On the left-hand side of each
row (i.e. in figure 6.2a, figure 6.2c and figure 6.2e), the histograms show
the number of corrupted registers in passed simulations in the three different
testbenches; on the right-hand side of each row, the histograms show the
number of corrupted registers in all failed simulations.
For passed simulations, those with 0 or only 1 register corrupted will not
be recorded, since the distributions are already clearly shown in figure 6.1.
Only trials with more than two corrupted registers are shown (the outliers
are not shown). In figure 6.2a, figure 6.2c and figure 6.2e, the number of cor-
rupted registers is mostly distributed between 2-15 (recall that those having
0 or 1 register corrupted will not be recorded). The outlier for the control-
intensive testbench is 169; the outlier for the IO intensive testbench is 69;
and the outliers for the computation intensive testbench are 56 and 69. In
the trials with outliers, the times of injections are at relatively late stages of
the simulations and the faults have not propagated to the registers that are
checked at the end of simulations.
In figure 6.2b, figure 6.2d and figure 6.2f, the numbers of corrupted registers
range from 1 to 180, and there are no outliers. Most of the failures with fewer
than 18 corrupted registers are due to the failed result checks, whereas failures
with a relatively larger number of corruptions are related to timeouts. Since
65
in each of the testbenches, the number of exercised registers is limited due
to the test program, we do not have cases where the number of corrupted
registers is greater than 180.
6.3 Results for Clock Injections
The injections on clocks are performed at gate-level simulations. In the
first set of experiments, we performed three hundred injections to each of
the testbenches using the characterizations in table 5.2. The results of the
injections are shown in figure 6.3.
The failure classification in figure 6.3 is different from that in figure 6.1
because of the different types of system reponses observed. “Clock flip” is
a symptom observed where the clock signal of the system is flipped to its
inverse value. In other words, the phase of the clock is delayed by (2n+ 1)pi.
For the failure trials, it is either due to clock failure, where the clock is no
longer toggling, or due to timeouts caused by other reasons. We make the
following observations of the results:
1. Most of the clock injections will not lead to any failures This is
because the behavior of the clock is characterized by a prolonged period
of logic 1 followed by a prolonged period of logic 0. Therefore, the next
positive clock edge will be detected after the clock glitch. From the
system’s perspective, one clock cycle is longer than usual, but it will
not lead to any failures in the system. However, this observation is
only true for some systems with a single clock domain. For multiple
clock domain systems, once a clock stops toggling for a few clock cycles,
synchronization problems may start to occur.
2. Clock flips do not necessarily result in failures in a single clock
domain system We observe a noticeable number of cases with a clock
off-phase by (2n + 1)pi in passed trials. Still, in single clock domain
systems, failures do not result. However, in multiple clock domain
systems, a clock flip is more likely to cause failures. This is because the
clock off-phase fault will remain in the system, whereas a prolonged
clock cycle without pi phase delay will disappear after the clock glitch.
Therefore, if there are no transactions between the two parts of the
66
(a) Histogram of number of corrupted
registers in a passed trial for control
intensive testbench; outliers not shown
(b) Histogram of number of corrupted
registers in a failed trial for the control
intensive testbench
(c) Histogram of number of corrupted
registers in a passed trial for the IO
intensive testbench; outliers not shown
(d) Histogram of number of corrupted
registers in a failed trial for the IO
intensive testbench
(e) Histogram of number of corrupted
registers in a passed trial for the
computation intensive testbench;
outliers not shown
(f) Histogram of number of corrupted
registers in a failed trial for the
computation intensive testbench
Figure 6.2: Histograms showing the number of corrupted registers
67
system with different clocks, a failure is not likely to occur after the
ESD-induced clock noise. However, if the two clocks are having a phase
difference, then there will be chances of synchronization issues each time
the two parts exchange any data or control information.
3. Most of the failures observed are caused by a clock no longer
toggling In other words, the injections on the clock are most likely to
result in sustained failures on the clock signal, which can only be re-
covered by a reset. In fact, all failures observed in the control intensive
testbench and in the arithmetic intensive testbench are such failures.
4. Very few failures are caused by non-clock-related failures The
failures caused by non-clock-related corruptions are only observed in
the IO intensive testbench. In order to reproduce similar failures in
the other testbenches, injections were performed in the other two test-
benches using the injection parameters in such failures. However, we
failed to produce similar failures.
We performed a deeper analysis on the relationship between the number of
corrupted registers and the number of clock cycles after the injections on all
failed simulations. A register is defined to be corrupted in a failed trial if its
content ever deviates from the expected value. To be noted, we deem a reg-
ister as corrupted regardless of whether its value is restored to the expected
value again. Table 6.3, table 6.4, and table 6.5 show the data for all failed
simulations in the control intensive testbench, the I/O intensive testbench,
and the arithmetic intensive testbench, respectively. Because of the large size
of the tables, they are shown at the end of this chapter. In columns 1, 2, and
3 of the three tables, the injection information is provided should any injec-
tion trials be reproduced for further study. Column 1 specifies the start time
of the injection in nanoseconds. Column 2 shows the length of the prolonged
period of logic 1 because of the injection current going into the clock signal
pin, and column 3 shows the prolonged period of logic 0 due to the injection
current exiting the clock signal pin. Both the time periods in columns 2 and
3 are in nanoseconds. If columns 1-3 of a failed trial are colored in black,
the failure is caused by a frozen clock signal which no longer toggles due to
corruptions in the control signals for the clock generations. Otherwise, if the
data is colored in blue (which only occurs in the IO intensive testbench), the
68
(a) Results for control intensive testbench
(b) Results for IO intensive testbench
(c) Results for computation intensive testbench
Figure 6.3: Classification results for gate-level clock injection
69
failure is not caused by a frozen clock. Each failed data value in columns 4
to 11 of the first row shows the number of corrupted registers at the cycle of
injection, after 1 cycle, 2 cycles, 3-5 cycles, 6-10 cycles, 11-20 cycles, 21-50
cycles, and more than 50 cycles, respectively. The corresponding columns in
the second row, highlighted in gray, are the cumulative number of register
corruptions up to that period.
Three kinds of trends can be observed from the results:
1. Initially small number of corruptions, followed by a large num-
ber of corruptions This is the most common scenario. In most of
the failure data, initially there is a small of number register corrup-
tions – usually several tens of corruptions are observed there. Then
the number of corrupted registers increases slowly 2-5 cycles after the
injection, and the number of corrupted registers increases sharply 6-20
cycles after the injection. Very few registers will be corrupted after 20
clock cycles after the injection. The failure waveforms of some trials
are examined, and most of the corrupted registers have unknown values
(X). This indicates that the register values become indeterministic in
simulations.
2. Initially large number of corruptions, followed by very few
further corruptions This can be observed in each testbench. In those
failures right after the injections, the number of corrupted registers goes
up to the thousand level, and there will be very few register corruptions
after that. The failure waveforms of those trials are examined, and most
of the corrupted registers have unknown values.
3. Initially small number of corruptions, followed by slow in-
creases This phenomenon is only associated with failures that are not
related to clock corruptions, and these failures are only observed in the
IO intensive testbench. In those trials, there are initially several tens
of register corruptions, and as the simulation continues, the corruption
propagates to different registers. The total number of corrupted reg-
ister is between 100 to 150 at the end of the testbench. The failure
waveforms are examined, and most of the corrupted registers have a
logic value of either 1 or 0.
70
The similar likelihoods of having a passed trial and a failed trial suggest
that the failures caused by an ESD on the clock signal are not program
dependent. Rather, it is more likely to be clock-edge dependent: a failure
may result if the the clock is corrupted at a certain time of a clock period.
However, when a study is performed on the injection starting times and
durations, no statistical significance is found.
6.4 Results for Data Pin Injections
We performed approximately 300 injections each on a GPIO data pin in the
three testbenches. All the GPIO data pins have similar tasks and therefore
one GPIO pin is representative for the behavior of any GPIO pins. The sum-
mary of the results is shown in table 6.1. We make the following observations
of the results:
Table 6.1: Results summary for injection on data pins






Pass 100 3-8 No failure
Fail 0 N.A. N.A.
IO intensive
Pass 96.65551 3-8 No failure
Fail 3.34449 8-9 Wrong results
Arithmetic
intensive
Pass 100 3-8 No failure
Fail 0 N.A. N.A.
1. There are no failures on the control intensive testbench and
the arithmetic intensive testbenches Since there is no communi-
cation between the GPIO and the rest of the core in these two test-
benches, the faulty data on the GPIO pin will not propagate to other
parts of the core. However, the noise is picked up by an IO register
and corrupted that IO register. The IO register then communicates
with the peripheral timer module, and the fault propagates within the
peripheral modules. This also explains the small number of corrupted
registers in the core of the passed simulation trials. To be noted, in
the passed trials, all the corrupted register values are recovered to their
expected values. In other words, if noise on the input IO pin does not
71
cause a failure or silent data corruptions (SDCs, which refer to the un-
detected data corruptions), there will be a high likelihood that those
faults get masked.
2. The likelihood that a simulation will pass is very high A 96.655%
likelihood of passing is observed in the simulations. In all failed cases,
the number of corrupted registers is small, because there are very few
data dependencies between different parts of the tasks in the testbench
and the faults will stop propagating after some point. A further study
of the starting time that the injection is performed, and the result
shows that all failed simulations have injection times that are
in between some very specific timestamps, right before a read
instruction is about to be issued from the core. In other words, a sim-
ulation will only fail if a read instruction is issued within the period of
corruption. This is because the input of the GPIO pin is already driven
by some voltage, and after the ESD-induced noise, the input data of
the GPIO will no longer be corrupted. Therefore, the IO registers will
restore their expected values. A fault will only propagate and cause a
failure in the system if it is accessed by the system within the period
of corruption. We conclude that the likelihood of data noise causing
failures is very small. In real workloads, the likelihood that an input
noise is causing a failure will be lower for the following reasons:
• In this testbench, the frequency of read commands from the core is
orders of magnitude larger than in a typical real workload. There-
fore, the likelihood that a fault will cause a failure is smaller.
• This testbench is checking all eight GPIOs as a whole since they
correspond to the bits in an 8-bit IO register, and therefore any
single-bit corruption will result in a failure. However, in a typical
workload of a micro-controller, different GPIOs will serve different
purposes and only one particular bit of the IO register will be
used at a time. Therefore, the likelihood of the bit of interest is
corrupted will be eight times lower.
72
6.5 Injections on the Supply Net
The injections on the power supply net were conducted using different pre-
charge voltages. We injected ESDs with pre-charge voltages of ±2000 V,
±4000 V,±6000 V,±8000 V,±10000 V,±12000 V,±14000 V, and±16000 V
on the VDDIO pin of one of the VDDIO cells in the system. We found that
none of the injections causes a failure. When taking a closer look at the
voltages at different nodes, we found that the supply voltage fluctuation is
not big enough to pull a flip-flop’s threshold voltage above its input voltage,
as illustrated in figure 5.3. Table 6.2 shows the results.
The fact that no failures are observed in this experiment can be attributed
to two reasons. First, our onboard protection circuits suppress the transient
sharp increase of the IO voltage. The aforementioned primary and secondary
protections in the IO ring prevent the IO voltage from rising too quickly. As
a result, the maximum voltage observed at VDDIO is 5.13 V, and lasted for
a few nanoseconds. Second, the core is placed relatively in the center of the
power supply net. Meanwhile, the three VDD cells, which supply a voltage
to the supply net, are placed on the boundaries of the power supply net.
When the injections are made to a VDDIO cell, the injection current goes
from the VDDIO cell to the nearest VDD cell, causing the voltage of that
VDD cell to increase. However, two other VDD cells are farther from the
point of injection, and the voltage increase seen there will be very limited.
Due to the voltage divider law, the node voltages in the center of the supply
net will increase, and the node voltages will be lower than the VDD voltage
near the point of injection. Moreover, since the threshold voltage is half the
sum of VDD and ground, the VDD node voltage has to be pulled up to twice
its value to cause any bit-flips. Thus, a larger voltage increase is needed at
VDD cells and VDDIO cells.
We did not perform any injection with pre-charge voltages with a magni-
tude more than ±16000 V for two reasons:
1. The IEC 61000-4-2 standard only prescribes qualification testing up
to that voltage magnitude. In daily life, the frequency of ESD events
with more than 16000 V in magnitude is low enough to be ignored.
Meanwhile, given that the energy of an ESD discharge is proportional
to the square of its voltage, performing tests with higher magnitudes
may shorten a product’s life and possibly cause hard failures in the
73






















2. The Cadence simulator has an internal threshold voltage, beyond which
the points of convergence will no longer be accurate. This is a common
limitation when simulating an ESD event in software.
6.6 Suggested Recovery Strategies
Based on our experimental results, there are two major kinds of symptoms:
IO-related failures and clock-related errors. The solutions for IO-related fail-
ures have already been addressed in [24, 87]. Meanwhile, the errors caused
by ESD-induced clock noise are typically characterized by (1) delay of the
clock, (2) corruptions of approximately a hundred registers, and (3) failure
of clock to toggle, and register values becomes indeterministic. In our sim-
ulations, type (1) error does not result in any failures because our modified
openMSP430 core is a single-clock domain design. Therefore, in multiple
clock domain systems, failure will only occur if different functional units in
different clock domains communicate with each other. Then, the problem
74
is very similar to IO-related issues, which former studies have addressed.
The second and third types of failures are caused by a number of register
corruptions. To recover from these types of failures, three different possible
solutions are briefly suggested and evaluated below:
1. Checkpoint and rollback This is the most commonly used solu-
tion in distributed systems and supercomputers. The system will save
checkpoints while executing tasks, and will rollback to the latest clean
checkpoint if an error is detected. The disadvantage of this solution
is the significant time or area cost. Since the scale of corruption is
large and the physical locations of the corrupted registers are not pre-
dictable, the value of a wide range of registers needs to be saved. This
will be achieved either by introducing duplicated registers, or by spend-
ing additional instructions to allow register contents to be saved into
the memory.
2. Pulling down the system clock If all the registers of a design are
positive clock-edge triggered, then pulling down the clock upon the
detection of the ESD event will help reduce the chance of having a fail-
ure. This is because pulling down the clock signal disables all updates
made to registers during the period of corruption from the clock. Af-
ter the clock becomes clean again, the pull-down will be released and
the device can continue to function normally. However, in this case
we have to assume that the detection latency and latency to issue the
pull-down is within one clock cycle, since most corruptions occurs in-
stantaneously after the ESD event. This can be very hard to achieve
in modern high-frequency processors.
3. Resetting the system If a high correlation is found between the
ESD detector’s event indication and an actual soft failure, then a quick
and efficient way of recovery is to issue a reset. We deem resetting
as the best solution for ESD-induced soft failures in the context of
microprocessors and micro-controllers for the following reasons:
• Given that microprocessors and micro-controllers are low cost
products, the device area is a very significant consideration in
the design of those products. Giving that the frequency of ESD
events is low in real life, we should minimize the additional cost
75
for recovery from those failures. Reset-based solutions will not
incur additional area costs.
• Most microprocessors and micro-controllers perform real-time tasks.
In other words, they are used to sample data and respond to the
sampled data. Therefore, the computations in the future do not
often depend on the computations made in the past. Thus, once
an error is detected and a reset is issued, new tasks can be executed
normally. On the other hand, a backward-propagation-based solu-
tion such as checkpoint-and-rollback are more suitable for complex
computations that may take days and months, so that we do not
invalidate the previous computations in case of a soft failure.
76
Table 6.3: Number of corrupted registers with respect to number of clock
cycles in the control intensive testbench
start s.t.1 s.t.0 cyc 0 cyc 1 cyc 2 3-5 6-10 11-20 21-50 50+
445.24 17.56 38.05
100 0 10 6 83 851 0 0
100 100 110 116 199 1050 1050 1050
1175.49 13.39 36.71
31 0 19 6 88 880 0 0
31 31 50 56 144 1024 1024 1024
605.38 10.19 31.27
48 0 8 6 734 221 0 1
48 48 56 62 796 1017 1017 1018
525.56 18.44 36.77
48 0 8 9 12 946 0 0
48 48 56 65 77 1023 1023 1023
725.78 11.52 39.38
21 0 20 10 81 887 1 0
21 21 41 51 132 1019 1020 1020
785.63 18.24 38.32
33 0 7 8 98 876 0 0
33 33 40 48 146 1022 1022 1022
1285.44 17.27 25.75
54 0 7 5 739 215 0 0
54 54 61 66 805 1020 1020 1020
1065.15 15.79 30.68
1066 0 0 3 3 0 0 1
1066 1066 1066 1069 1072 1072 1072 1073
1595.84 11.27 27.92
23 0 27 5 750 218 0 0
23 23 50 55 805 1023 1023 1023
745.85 18.36 29.73
21 0 22 3 820 156 0 0
21 21 43 46 866 1022 1022 1022
655.83 12.1 34.91
16 0 27 7 748 226 0 0
16 16 43 50 798 1024 1024 1024
815.44 18.9 38.11
32 0 14 6 13 959 0 0
32 32 46 52 65 1024 1024 1024
1665.93 15.72 30.84
22 0 11 18 102 869 0 0
22 22 33 51 153 1022 1022 1022
775.26 13.19 35.87
42 0 7 4 97 848 0 0
42 42 49 53 150 998 998 998
575.76 11.75 37.34
17 0 14 18 101 869 0 0
17 17 31 49 150 1019 1019 1019
1295.5 15.19 30.31
20 0 10 4 123 837 0 0
20 20 30 34 157 994 994 994
1535.36 16.64 27.84
31 0 20 7 672 293 0 0
31 31 51 58 730 1023 1023 1023
1725.45 15.03 26.31
21 0 7 9 112 846 0 1
21 21 28 37 149 995 995 996
645.94 14.01 35.68
23 0 8 20 16 958 0 0
23 23 31 51 67 1025 1025 1025
675.24 18.37 39.48
140 0 7 3 252 628 0 0
140 140 147 150 402 1030 1030 1030
485.58 19.23 27.58
53 0 23 7 83 875 0 0
53 53 76 83 166 1041 1041 1041
77
Table 6.4: Number of corrupted registers with respect to number of clock
cycles in the IO intensive testbench
start s.t.1 s.t.0 cyc 0 cyc 1 cyc 2 3-5 6-10 11-20 21-50 50+
335.57 11.84 27.85
29 0 42 2 844 142 41 0
29 29 71 73 917 1059 1100 1100
1545.3 16.95 34.86
34 0 5 15 31 11 17 24
34 34 39 54 85 96 113 137
825.31 12.37 29.2
41 0 10 0 25 12 16 24
41 41 51 51 76 88 104 128
1245.33 11.18 26.44
27 0 18 23 109 865 0 0
27 27 45 68 177 1042 1042 1042
1255.24 17.59 29.15
85 0 26 10 47 876 1 0
85 85 111 121 168 1044 1045 1045
1095.6 12.09 26.03
18 0 22 3 114 886 0 0
18 18 40 43 157 1043 1043 1043
775.45 16.65 29.59
28 0 10 23 25 15 13 20
28 28 38 61 86 101 114 134
1735.85 12.38 39.42
18 0 16 8 38 644 337 0
18 18 34 42 80 724 1061 1061
1425.65 18.58 37.17
20 0 17 4 47 952 1 0
20 20 37 41 88 1040 1041 1041
1305.92 13.85 28.69
25 0 10 12 43 971 0 0
25 25 35 47 90 1061 1061 1061
425.31 10.61 34.02
51 0 8 14 38 984 0 0
51 51 59 73 111 1095 1095 1095
465.48 17.94 26.26
28 0 6 8 85 621 341 1
28 28 34 42 127 748 1089 1090
595.13 13.57 38.77
1067 116 0 0 0 1 0 0
1067 1183 1183 1183 1183 1184 1184 1184
725.31 19.84 28.26
30 0 2 14 15 18 22 21
30 30 32 46 61 79 101 122
645.69 12.37 26.14
18 0 10 18 40 973 0 0
18 18 28 46 86 1059 1059 1059
1425.39 18.0 26.7
20 0 17 4 47 953 0 0
20 20 37 41 88 1041 1041 1041
1525.65 12.83 26.17
21 0 11 19 38 969 0 0
21 21 32 51 89 1058 1058 1058
715.57 17.05 30.95
29 0 19 0 21 14 25 22
29 29 48 48 69 83 108 130
1455.3 19.04 31.01
37 0 13 11 34 53 916 0
37 37 50 61 95 148 1064 1064
1565.27 14.07 38.15
29 0 7 24 29 47 926 0
29 29 36 60 89 136 1062 1062
1185.25 18.24 37.14
55 0 14 11 38 612 329 0
55 55 69 80 118 730 1059 1059
78
Table 6.5: Number of corrupted registers with respect to number of clock
cycles in the arithmetic intensive testbench
start s.t.1 s.t.0 cyc 0 cyc 1 cyc 2 3-5 6-10 11-20 20-50 50+
435.67 17.67 26.02
21 0 24 6 789 197 0 0
21 21 45 51 840 1037 1037 1037
835.34 18.29 31.96
24 0 12 9 38 939 0 0
24 24 36 45 83 1022 1022 1022
705.38 14.23 28.97
29 0 7 24 109 855 0 0
29 29 36 60 169 1024 1024 1024
565.29 17.13 35.46
72 0 7 24 246 666 0 1
72 72 79 103 349 1015 1015 1016
1205.35 16.62 37.38
42 0 0 24 10 946 0 0
42 42 42 66 76 1022 1022 1022
905.36 19.52 32.68
56 0 15 1 76 872 0 0
56 56 71 72 148 1020 1020 1020
345.68 17.57 26.07
37 0 4 4 675 327 25 0
37 37 41 45 720 1047 1072 1072
1275.12 14.46 27.61
957 136 0 0 0 0 0 0
957 1093 1093 1093 1093 1093 1093 1093
335.24 19.03 37.26
157 0 15 2 305 589 6 0
157 157 172 174 479 1068 1074 1074
1115.14 14.34 29.06
1222 0 0 0 0 0 0 0
1222 1222 1222 1222 1222 1222 1222 1222
1325.92 13.56 30.85
16 0 16 8 706 275 0 0
16 16 32 40 746 1021 1021 1021
1315.31 12.43 33.48
39 0 8 10 112 854 0 0
39 39 47 57 169 1023 1023 1023
515.46 16.23 39.11
88 0 17 7 274 610 0 0
88 88 105 112 386 996 996 996
1005.68 10.78 30.81
26 0 20 6 726 245 0 0
26 26 46 52 778 1023 1023 1023
925.56 16.8 26.54
103 0 3 4 886 0 0 0
103 103 106 110 996 996 996 996
395.92 11.85 27.31
8 0 29 5 704 303 17 0
8 8 37 42 746 1049 1066 1066
455.39 11.68 27.5
60 0 22 15 947 1 0 0
60 60 82 97 1044 1045 1045 1045
1085.71 11.02 34.04
28 0 15 5 110 12 5 34
28 28 43 48 158 170 175 209
1545.33 13.75 36.65
62 0 0 12 11 936 0 0
62 62 62 74 85 1021 1021 1021
715.3 19.68 26.19
69 0 26 3 898 2 0 0
69 69 95 98 996 998 998 998
1445.9 17.21 38.52
29 0 4 21 20 951 0 0
29 29 33 54 74 1025 1025 1025
665.12 10.93 39.43
374 672 0 0 0 0 0 0
374 1046 1046 1046 1046 1046 1046 1046
1245.44 12.59 35.64
41 0 6 5 739 226 0 1
41 41 47 52 791 1017 1017 1018
695.58 14.06 30.51
21 0 16 6 678 300 0 0





Although many works have been done to study ESD-induced soft failures,
very little research focuses on the application-level effects of the ESD-induced
soft failures [34]. In [34], the application-level failures are observed, but no
analysis was conducted at the RTL level or architecture level. In [18] and
[33], experiments were conducted to observe the bit-flips caused by electro-
static discharge using hardware-based injection methods, and the test chip
is designed only for such testings. To the best of our knowledge, we are the
first to inject the ESD at the physical level and to observe and diagnosis the
chip at the application level, and to find a correlation between the circuit
level voltage fluctuation and the logic level corruption. Meanwhile, no prior
researchers have taped-out test chips as large as our scale, for ESD-induced
soft failures investigations. Furthermore, for the first time, we established
fault models for ESD-induced soft failures.
We have studied application-level ESD-induced soft failures for a variety
of commercial chips. As we diagnose the failures, we realized the limitations
of using commercial test chips due to lack of observability and lack of con-
trollability. We overcame those limitations by designing our custom test chip
with DFT features and ESD event detectors. We selected a silicon-proven
open source processor core whose instruction level architecture is widely used
in industry, and we implemented approximately the same level of ESD pro-
tection on the test chip. Therefore, the characterized failure types will be
representative of common soft failures in micro-controllers. We performed
simulation-based fault injections in mixed-signal simulation to study the cor-
relation of faults at both physical and logical levels. In order to model the
effect of supply fluctuation seen by each individual flip-flop without losing
80
much of simulation speed, a new model of mixed-signal flip-flop simulation
has been proposed. In our various experiments, we found that most of ESD-
induced soft failures will lead to spread-out corruptions of register values,
and with regard to that, we propose that the most suitable solution for soft
failures are resets.
7.2 Future Work
The duration and scope of this project are beyond those reported in this
thesis. There is further research to be carried out after the completion of
this thesis.
7.2.1 Validating Simulation Results in Silicon
By the end of this thesis, the test chip design is completed but has not yet
been taped-out. Further research should and will be done to validate the
results of the simulation-based fault injections using a hardware-based fault
injection on the real chip. The observability issue will be addressed mostly
by the scan-chain, and programs should be written to read-out the register
contents using another micro-controller and to find a mapping between the
binary sequence read out and the values of each register. The ESD event
detectors will be set at different sensitivity levels to detect ESD events, and
the correlation between the trigger of detectors and an ESD-induced soft
failure. The proposed high-level detectors will reduce the amount of effort
needed to conduct failure diagnosis.
7.2.2 Further Recovery Solutions
System reset is by far the most cost-efficient recovery solution to ESD-induced
soft failures. However, the hidden assumption is that our ESD event detectors
will only detect ESD events that are significant enough to cause failures.
Otherwise, the false alarm will trigger a system reset which may degrade the
performance of ASICs. Even if the false alarm rate cannot be reduced to
zero, a probabilistic study can be conducted using the accuracy of the ESD
81
event detectors and the expected time to reset the system, to find an optimal
decision regarding whether or not to reset the system.
7.2.3 New Transient Error Type
The fault due to ESDs are characterized and further research can treat ESD-
induced faults as a new error type, to study the faulty behavior of computer
systems should this type of fault is injected. A higher level simulator can




[1] C. Prasad, “Advanced CMOS reliability challenges,” in VLSI Design,
Automation and Test (VLSI-DAT), 2014 International Symposium on.
IEEE, 2014, pp. 1–2.
[2] JEDEC, “System level ESD part 1: Common misconceptions and rec-
ommended basic approaches,” 2011.
[3] JEDEC, “System level ESD part 2: Implementation of effective ESD
robust designs,” 2013.
[4] A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr, “Basic con-
cepts and taxonomy of dependable and secure computing,” IEEE Trans-
actions on Dependable and Secure Computing, vol. 1, no. 1, pp. 11–33,
2004.
[5] C.-Y. Lin, M.-D. Ker, P.-H. Chang, and W.-T. Wang, “Study on the
ESD-induced gate-oxide breakdown and the protection solution in 28nm
high-k metal-gate CMOS technology,” in Nanotechnology Materials and
Devices Conference (NMDC), 2015 IEEE. IEEE, 2015, pp. 1–4.
[6] A. A. Salman, R. Gauthier, C. Putnam, P. Riess, M. Muhammad,
M. Woo, and D. E. Ioannou, “ESD-induced oxide breakdown on self-
protecting GG-nMOSFET in CMOS technology,” IEEE Transactions
on Device and Materials Reliability, vol. 3, no. 3, pp. 79–84, 2003.
[7] J. Wu and E. Rosenbaum, “Gate oxide reliability under ESD-like pulse
stress,” IEEE Transactions on Electron Devices, vol. 51, no. 7, pp. 1192–
1196, 2004.
[8] J. Wu, P. Juliano, and E. Rosenbaum, “Breakdown and latent damage
of ultra-thin gate oxides under ESD stress conditions,” Microelectronics
Reliability, vol. 41, no. 11, pp. 1771–1779, 2001.
[9] T. W. Chen, C. Ito, W. Loh, W. Wang, S. Mitra, and R. W. Dut-
ton, “Macro-model for post-breakdown 90nm and 130nm transistors and
its applications in predicting chip-level function failure after ESD-CDM
events,” in Reliability Physics Symposium, 2007. Proceedings. 45th An-
nual. IEEE International. IEEE, 2007, pp. 78–85.
83
[10] A. J. Walker, H. Puchner, and S. P. Dhanraj, “High-voltage CMOS ESD
and the safe operating area,” IEEE Transactions on Electron Devices,
vol. 56, no. 8, pp. 1753–1760, 2009.
[11] Z. Pan, S. Holland, D. Schroeder, and W. H. Krautschneider, “Under-
standing the mechanisms of degradation and failure observed in ESD
protection devices under system-level tests,” IEEE Transactions on De-
vice and Materials Reliability, vol. 10, no. 2, pp. 187–191, 2010.
[12] Z. Tan, D. Wu, and J. Liu, “Failure mechanism of microwave semi-
conductor transistor caused by electrostatic discharge,” High Voltage
Engineering, 2014.
[13] E. A. Amerasekera and C. Duvvury, ESD in Silicon Integrated Circuits.
John Wiley & Son, 2002.
[14] C. Duvvury and H. Gossner, System Level ESD Co-design. John Wiley
& Sons, 2015.
[15] Y. Cao, U. Glaser, J. Willemen, F. Magrini, M. Mayerhofer, S. Frei, and
M. Stecher, “ESD simulation with Wunsch-Bell based behavior model-
ing methodology,” in Electrical Overstress/Electrostatic Discharge Sym-
posium (EOS/ESD), 2011 33rd. IEEE, 2011, pp. 1–10.
[16] L. Lou, C. Duvvury, A. Jahanzeb, and J. Park, “SPICE simula-
tion methodology for system level ESD design,” in Electrical Over-
stress/Electrostatic Discharge Symposium (EOS/ESD), 2010 32nd.
IEEE, 2010, pp. 1–10.
[17] D. Johnsson and H. Gossner, “Study of system ESD codesign of a re-
alistic mobile board,” in Electrical Overstress/Electrostatic Discharge
Symposium (EOS/ESD), 2011 33rd. IEEE, 2011, pp. 1–10.
[18] N. Thomson, Y. Xiu, R. Mertens, M.-S. Keel, and E. Rosenbaum,
“Custom test chip for system-level ESD investigations,” in Electrical
Overstress/Electrostatic Discharge Symposium (EOS/ESD), 2014 36th.
IEEE, 2014, pp. 1–10.
[19] J. Lee, J. Lim, C. Jo, B. Seol, A. Nandy, T. Li, and D. Pommerenke, “A
study of a measurement and simulation method on ESD noise causing
soft-errors by disturbing signals,” in Electrical Overstress/Electrostatic
Discharge Symposium (EOS/ESD), 2011 33rd. IEEE, 2011, pp. 1–5.
[20] Y. Xiu, N. Thomson, R. Mertens, and E. Rosenbaum, “A mech-
anism for logic upset induced by power-on ESD,” in Electrical
Overstress/Electrostatic Discharge Symposium (EOS/ESD), 2014 36th.
IEEE, 2014, pp. 1–10.
84
[21] R. Mertens, N. Thomson, Y. Xiu, and E. Rosenbaum, “Theory of active
clamp response to power-on ESD and implications for power supply
integrity,” in Electrical Overstress/Electrostatic Discharge Symposium
(EOS/ESD), 2014 36th. IEEE, 2014, pp. 1–10.
[22] P. Besse, K. Abouda, and C. Abouda, “Identifying electrical mechanisms
responsible for functional failures during harsh external ESD and EMC
aggression,” Microelectronics Reliability, vol. 51, no. 9, pp. 1597–1601,
2011.
[23] E. Ibe, H. Taniguchi, Y. Yahagi, K.-i. Shimbo, and T. Toba, “Impact of
scaling on neutron-induced soft error in SRAMs from a 250 nm to a 22
nm design rule,” IEEE Transactions on Electron Devices, vol. 57, no. 7,
pp. 1527–1538, 2010.
[24] P. Ramachandran, S. K. S. Hari, M. Li, and S. V. Adve, “Hardware
fault recovery for I/O intensive applications,” ACM Transactions on
Architecture and Code Optimization (TACO), vol. 11, no. 3, p. 33, 2014.
[25] S. K. Sastry Hari, R. Venkatagiri, S. V. Adve, and H. Naeimi, “GangES:
Gang error simulation for hardware resiliency evaluation,” in ACM
SIGARCH Computer Architecture News, vol. 42, no. 3. IEEE Press,
2014, pp. 61–72.
[26] S. K. Sastry Hari, R. Venkatagiri, S. V. Adve, and H. Naeimi, “GangES:
Gang error simulation for hardware resiliency evaluation,” in ACM
SIGARCH Computer Architecture News, vol. 42, no. 3. IEEE Press,
2014, pp. 61–72.
[27] K. Pattabiraman, N. Nakka, Z. Kalbarczyk, and R. Iyer, “SymPLFIED:
Symbolic program-level fault injection and error detection framework,”
in Dependable Systems and Networks with FTCS and DCC, 2008. DSN
2008. IEEE International Conference on. IEEE, 2008, pp. 472–481.
[28] K. Pattabiraman, Z. Kalbarczyk, and R. K. Iyer, “Automated derivation
of application-aware error detectors using static analysis: The trusted
illiac approach,” IEEE Transactions on Dependable and Secure Com-
puting, vol. 8, no. 1, pp. 44–57, 2011.
[29] H. Cho, S. Mirkhani, C.-Y. Cher, J. A. Abraham, and S. Mitra, “Quan-
titative evaluation of soft error injection techniques for robust system
design,” in Proceedings of the 50th Annual Design Automation Confer-
ence. ACM, 2013, p. 101.
[30] S. S. Mukherjee, J. Emer, and S. K. Reinhardt, “The soft error problem:
An architectural perspective,” in High-Performance Computer Architec-
ture, 2005. HPCA-11. 11th International Symposium on. IEEE, 2005,
pp. 243–247.
85
[31] S. Mitra, M. Zhang, N. Seifert, T. Mak, and K. S. Kim, “Built-in soft
error resilience for robust system design,” in Integrated Circuit Design
and Technology, 2007. ICICDT’07. IEEE International Conference on.
IEEE, 2007, pp. 1–6.
[32] M. Zhang, S. Mitra, T. Mak, N. Seifert, N. J. Wang, Q. Shi, K. S.
Kim, N. R. Shanbhag, and S. J. Patel, “Sequential element design with
built-in soft error resilience,” IEEE Transactions on Very Large Scale
Integration (VLSI) Systems, vol. 14, no. 12, pp. 1368–1378, 2006.
[33] N. A. Thomson, Y. Xiu, and E. Rosenbaum, “Soft-failures induced by
system-level ESD,” IEEE Transactions on Device and Materials Relia-
bility, vol. 17, no. 1, pp. 90–98, 2017.
[34] B. Orr, P. Maheshwari, H. Gossner, D. Pommerenke, and W. Stadler,
“A systematic method for determining soft-failure robustness of a sub-
system,” in Electrical Overstress/Electrostatic Discharge Symposium
(EOS/ESD), 2013 35th. IEEE, 2013, pp. 1–8.
[35] O. Girard, “openMSP430,” May 2015. [Online]. Available:
https://opencores.org/project,openmsp430
[36] K. S. Yim, Z. Kalbarczyk, and R. K. Iyer, “Measurement-based analysis
of fault and error sensitivities of dynamic memory,” in Dependable Sys-
tems and Networks (DSN), 2010 IEEE/IFIP International Conference
on. IEEE, 2010, pp. 431–436.
[37] H. Cho, S. Mirkhani, C.-Y. Cher, J. A. Abraham, and S. Mitra, “Quan-
titative evaluation of soft error injection techniques for robust system
design,” in Proceedings of the 50th Annual Design Automation Confer-
ence. ACM, 2013, p. 101.
[38] A. Gaisler, “Leon3 processor.” [Online]. Available:
http://www.gaisler.com
[39] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi,
A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti,
R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and
D. A. Wood, “The gem5 simulator,” SIGARCH Comput. Archit.
News, vol. 39, no. 2, pp. 1–7, Aug. 2011. [Online]. Available:
http://doi.acm.org/10.1145/2024716.2024718
[40] D. Mueller-Gritschneder, P. R. Maier, M. Greim, and U. Schlichtmann,
“System C-based multi-level error injection for the evaluation of fault-
tolerant systems,” in Integrated Circuits (ISIC), 2014 14th International
Symposium on. IEEE, 2014, pp. 460–463.
86
[41] M. Maniatakos, N. Karimi, C. Tirumurti, A. Jas, and Y. Makris,
“Instruction-level impact analysis of low-level faults in a modern micro-
processor controller,” IEEE Transactions on Computers, vol. 60, no. 9,
pp. 1260–1273, 2011.
[42] Z. Sheng, W. Xiao, and L. Mao, “An automated technique for circuit-
level simulation-based soft fault injection,” Journal of Computer-Aided
Design & Computer Graphics, vol. 3, p. 011, 2009.
[43] S. Nimara, A. Amaricai, and M. Popa, “Sub-threshold CMOS circuits
reliability assessment using simulated fault injection based on simula-
tor commands,” in Applied Computational Intelligence and Informatics
(SACI), 2015 IEEE 10th Jubilee International Symposium on. IEEE,
2015, pp. 101–104.
[44] A. Pellegrini, R. Smolinski, L. Chen, X. Fu, S. K. S. Hari, J. Jiang,
S. V. Adve, T. Austin, and V. Bertacco, “CrashTest’ing SWAT: Ac-
curate, gate-level evaluation of symptom-based resiliency solutions,” in
Proceedings of the Conference on Design, Automation and Test in Eu-
rope. EDA Consortium, 2012, pp. 1106–1109.
[45] M. Dimitrov and H. Zhou, “Unified architectural support for soft-error
protection or software bug detection,” in Proceedings of the 16th Interna-
tional Conference on Parallel Architecture and Compilation Techniques.
IEEE Computer Society, 2007, pp. 73–82.
[46] M.-L. Li, P. Ramachandran, S. K. Sahoo, S. V. Adve, V. S. Adve, and
Y. Zhou, “Understanding the propagation of hard errors to software and
implications for resilient system design,” in ACM SIGARCH Computer
Architecture News, vol. 36, no. 1. ACM, 2008, pp. 265–276.
[47] G. Lyle, S. Chen, K. Pattabiraman, Z. Kalbarczyk, and R. Iyer, “An
end-to-end approach for the automatic derivation of application-aware
error detectors,” in Dependable Systems & Networks, 2009. DSN’09.
IEEE/IFIP International Conference on. IEEE, 2009, pp. 584–589.
[48] M. Pflanz, K. Walther, C. Galke, and H. T. Vierhaus, “On-line error
detection and correction in storage elements with cross-parity check,”
in On-Line Testing Workshop, 2002. Proceedings of the Eighth IEEE
International. IEEE, 2002, pp. 69–73.
[49] G. Aydos and G. Fey, “Exploiting error detection latency for parity-
based soft error detection,” in Design and Diagnostics of Electronic Cir-
cuits & Systems (DDECS), 2016 IEEE 19th International Symposium
on. IEEE, 2016, pp. 1–6.
87
[50] Y. Guo, G. Li, and M. Lou, “Detection and recovery of memory-resident
corrupted data of mobile communication billing system based on cyclic
redundancy check,” International Journal of Digital Content Technology
and Its Applications, vol. 6, no. 18, p. 309, 2012.
[51] C.-H. Lin, J.-C. Liu, W.-C. Hsu, H.-W. Wang, W.-C. Lin, and J.-W. Li,
“Image tampering detection and recovery using dual watermarks and
cyclic redundancy checks,” in Advanced Communication and Network-
ing. Springer, 2010, pp. 134–143.
[52] S. Banerjee, A. Chatterjee, and J. A. Abraham, “Checksum based error
detection in linearized representations of non linear control systems,” in
Test Symposium (LATS), 2016 17th Latin-American. IEEE, 2016, pp.
182–182.
[53] D. Barbara´, R. Goel, and S. Jajodia, “A checksum-based corruption
detection technique1,” Journal of Computer Security, vol. 11, no. 3, pp.
315–329, 2003.
[54] N. G. Bardis, O. P. Markovskyi, C. Cepisca, and K. Argiri, “Burst error
control based on weighted checksum,” Proceeding ICCOM, vol. 10, 2010.
[55] S. Pontarelli, G.-C. Cardarilli, M. Re, and A. Salsano, “Error detection
in addition chain based ECC point multiplication,” in On-Line Testing
Symposium, 2009. IOLTS 2009. 15th IEEE International. IEEE, 2009,
pp. 192–194.
[56] F. M. Alzahrani and T. Chen, “On-chip triple-error correction and
quadruple-error detection ECC structure for ultra-large, single-chip
memories,” Computers & Electrical Engineering, vol. 26, no. 5, pp. 317–
335, 2000.
[57] S. Ben-Jamaa, C. Weidmann, and M. Kieffer, “Analytical tools for op-
timizing the error correction performance of arithmetic codes,” IEEE
Trans. Commun, vol. 56, no. 9, pp. 1458–1468, 2008.
[58] A. Avizienis, “Low-cost residue and inverse residue error-detecting codes
for signed-digit arithmetic,” in Proc. 5th IEEE Symposium on Computer
Arithmetic, 1981, pp. 165–168.
[59] V. Dzhigan, “Simple algorithm of single error correction under decoding
of Reed-Solomon codes,” Automatic Control and Computer Sciences,
vol. 28, no. 6, pp. 34–38, 1994.
[60] M. S. Sundary and V. Logisvary, “Multiple error detection and correc-
tion over GF (2 m) using novel cross parity code,” in Intelligent Systems
and Control (ISCO), 2016 10th International Conference on. IEEE,
2016, pp. 1–6.
88
[61] F. Lenkeit, D. Wu¨bben, and A. Dekorsy, “Reliability-aware iterative
detection scheme (RAID) for distributed IDM space-time codes in re-
lay systems,” EURASIP Journal on Advances in Signal Processing, vol.
2013, no. 1, p. 70, 2013.
[62] L. Zhiming, S. Jichang, Y. Xiaohua, and W. Yaping, “Fault detection for
high availability RAID system,” in Networked Computing and Advanced
Information Management (NCM), 2010 Sixth International Conference
on. IEEE, 2010, pp. 27–32.
[63] Y. Qiu, Q. Tan, J. Xu, and Y. Zhao, “An efficient approach to detect
and correct control-flow errors for linear assembly,” in Biomedical En-
gineering and Informatics (BMEI), 2015 8th International Conference
on. IEEE, 2015, pp. 560–565.
[64] T. Hong, Y. Li, S.-B. Park, D. Mui, D. Lin, Z. A. Kaleq, N. Hakim,
H. Naeimi, D. S. Gardner, and S. Mitra, “QED: Quick error detection
tests for effective post-silicon validation,” in Test Conference (ITC),
2010 IEEE International. IEEE, 2010, pp. 1–10.
[65] D. Lin, T. Hong, Y. Li, F. Fallah, D. S. Gardner, N. Hakim, and S. Mi-
tra, “Overcoming post-silicon validation challenges through quick error
detection (QED),” in Proceedings of the Conference on Design, Automa-
tion and Test in Europe. EDA Consortium, 2013, pp. 320–325.
[66] K. Pattabiraman, G. P. Saggese, D. Chen, Z. Kalbarczyk, and R. K. Iyer,
“Dynamic derivation of application-specific error detectors and their im-
plementation in hardware,” in Dependable Computing Conference, 2006.
EDCC’06. Sixth European. IEEE, 2006, pp. 97–108.
[67] N. J. Wang and S. J. Patel, “Restore: Symptom-based soft error detec-
tion in microprocessors,” IEEE Transactions on Dependable and Secure
Computing, vol. 3, no. 3, pp. 188–201, 2006.
[68] M.-L. Li, P. Ramachandran, S. K. Sahoo, S. V. Adve, V. S. Adve, and
Y. Zhou, “Swat: An error resilient system,” Proceedings of SELSE, 2008.
[69] S. K. Sastry Hari, M.-L. Li, P. Ramachandran, B. Choi, and S. V. Adve,
“mSWAT: Low-cost hardware fault detection and diagnosis for multicore
systems,” in Proceedings of the 42nd Annual IEEE/ACM International
Symposium on Microarchitecture. ACM, 2009, pp. 122–132.
[70] M. Bushnell and V. Agrawal, Essentials of Electronic Testing for Digital,
Memory and Mixed-Signal VLSI Circuits. Springer Science & Business
Media, 2004, vol. 17.
89
[71] K.-J. Lee, P.-H. Tang, and M. A. Kochte, “An on-chip self-test archi-
tecture with test patterns recorded in scan chains,” in Test Conference
(ITC), 2016 IEEE International. IEEE, 2016, pp. 1–10.
[72] H. Chen, Z. Qi, L. Wang, and C. Xu, “A scan chain optimization method
for diagnosis,” in Computer Design (ICCD), 2015 33rd IEEE Interna-
tional Conference on. IEEE, 2015, pp. 613–620.
[73] M. Rithesh, G. Harish, B. B. Ram, and S. Yellampalli, “Detection and
analysis of hardware trojan using scan chain method,” in VLSI Design
and Test (VDAT), 2015 19th International Symposium on. IEEE, 2015,
pp. 1–6.
[74] F. A. Siddiqui and P. Gour, “Scan-chain-based multiple error recovery in
TMR systems (SMERTMR),” in Computational Intelligence on Power,
Energy and Controls with Their Impact on Humanity (CIPECH), 2014
Innovative Applications of. IEEE, 2014, pp. 374–378.
[75] M. Ebrahimi, S. G. Miremadi, H. Asadi, and M. Fazeli, “Low-cost scan-
chain-based technique to recover multiple errors in TMR systems,” IEEE
Transactions on Very Large Scale Integration (VLSI) Systems, vol. 21,
no. 8, pp. 1454–1468, 2013.
[76] M. Ebrahimi, S. G. Miremadi, and H. Asadi, “Sctmr: A scan chain-
based error recovery technique for TMR systems in safety-critical appli-
cations,” in Design, Automation & Test in Europe Conference & Exhi-
bition (DATE), 2011. IEEE, 2011, pp. 1–4.
[77] H.-M. Lin and J.-M. Li, “Column parity and row selection (CPRS): A
BIST diagnosis technique for multiple errors in multiple scan chains,”
in Test Conference, 2005. Proceedings. ITC 2005. IEEE International.
IEEE, 2005, pp. 9–16.
[78] D. Chatterjee, C. McCarter, and V. Bertacco, “Simulation-based signal
selection for state restoration in silicon debug,” in Proceedings of the In-
ternational Conference on Computer-Aided Design. IEEE Press, 2011,
pp. 595–601.
[79] A. Nahir, A. Ziv, R. Galivanche, A. Hu, M. Abramovici, A. Camilleri,
B. Bentley, H. Foster, V. Bertacco, and S. Kapoor, “Bridging pre-silicon
verification and post-silicon validation,” in Proceedings of the 47th De-
sign Automation Conference. ACM, 2010, pp. 94–95.
[80] K. Basu, P. Mishra, and P. Patra, “Constrained signal selection for post-
silicon validation,” in High Level Design Validation and Test Workshop
(HLDVT), 2012 IEEE International. IEEE, 2012, pp. 71–75.
90
[81] K. Basu and P. Mishra, “RATS: Restoration-aware trace signal selec-
tion for post-silicon validation,” IEEE Transactions on Very Large Scale
Integration (VLSI) Systems, vol. 21, no. 4, pp. 605–613, 2013.
[82] H. F. Ko and N. Nicolici, “Automated trace signals selection using the
RTL descriptions,” in Test Conference (ITC), 2010 IEEE International.
IEEE, 2010, pp. 1–10.
[83] K. Basu and P. Mishra, “Efficient trace signal selection for post sili-
con validation and debug,” in VLSI Design (VLSI Design), 2011 24th
International Conference on. IEEE, 2011, pp. 352–357.
[84] S. Ma, D. Pal, R. Jiang, S. Ray, and S. Vasudevan, “Can’t see the forest
for the trees: State restoration’s limitations in post-silicon trace signal
selection,” in Proceedings of the IEEE/ACM International Conference
on Computer-Aided Design. IEEE Press, 2015, pp. 1–8.
[85] K. Kundert and O. Zinke, The Designer’s Guide to Verilog-AMS.
Springer Publishing Company, Incorporated, 2013.
[86] S. Vora, R. Jiang, S. Vasudevan, and E. Rosenbaum, “Application level
investigation of system-level ESD-induced soft failures,” in Electrical
Overstress/Electrostatic Discharge Symposium (EOS/ESD), 2016 38th.
IEEE, 2016, pp. 1–10.
[87] H. Cho, C.-Y. Cher, T. Shepherd, and S. Mitra, “Understanding soft
errors in uncore components,” in Design Automation Conference (DAC),
2015 52nd ACM/EDAC/IEEE. IEEE, 2015, pp. 1–6.
91
