Techniques for the evaluation and the improvement of emergent technologies’ behavior facing random errors

Enrico Costenaro

To cite this version:


HAL Id: tel-01279142
https://tel.archives-ouvertes.fr/tel-01279142
Submitted on 25 Feb 2016

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
THÈSE

Pour obtenir le grade de

DOCTEUR DE L’UNIVERSITÉ DE GRENOBLE ALPES

Spécialité : Nano Électronique et Nano Technologies

Arrêté ministériel : 7 août 2006

Présentée par

Enrico COSTENARO

Thèse dirigée par Mihail NICOLAIDIS
et co-encadrée par Marian-Dan ALEXANDRESCU

préparée au sein du laboratoire TIMA
dans l’École Doctorale Électronique, Électrotechnique, Automatique et Traitement du signal

Techniques pour l’évaluation et l’amélioration du comportement des technologies émergentes face aux fautes aléatoires

Thèse soutenue publiquement le 9 décembre 2015,
devant le jury composé de :

Dr. Mir SALVADOR
CNRS (Laboratoire TIMA, Grenoble), Président

Professor Massimo VIOLANTE
Politecnico di Torino, Rapporteur

Professor Abbas DANDACHE
Université de Lorraine, Rapporteur

Dr. Mihail NICOLAIDIS
CNRS (Laboratoire TIMA, Grenoble), Directeur de thèse
Acknowledgments

Firstly, I would like to express my sincere gratitude to my thesis supervisor Michael Nicolaidis, head of ARIS group of TIMA, for the continuous support of my PhD study and related research, for his guidance, his criticism and encouragement, as well as good humor and optimism.

My sincere thanks also goes to my thesis co-supervisor Dan Alexandrescu, CEO of IROC Technologies, I am grateful for his encouragement, for granting me time to write my thesis, for his help in proofreading my manuscript and especially for his friendship.

I wish to express my thanks to Prof. Massimo Violante and Prof. Abbas Dandache for serving on my thesis committee and providing their feedback.

I would like to thank Dr. Salvador Mir, head of TIMA Laboratory, for agreeing to be the president of my thesis jury.

I thank my colleagues at IROC Technologies for all the good time we have had since I joined the company. In particular, I am grateful to Dr. Adrian Evans with whom I collaborated on several works and Jocelyne Baudoin who helped proofreading all the french texts I wrote.

Last, but certainly not least, I would like to thank my family: my wife, Sara, my parents, Giovanni and Mariella, and my brothers, Daniele and Alberto, for encouraging and supporting me for my choices, my work and throughout writing this thesis.
## Contents

1 Introduction .................................................. 1
   1.1 Introduction .............................................. 1
   1.2 Radiation Environments and Anomalies ...................... 2
      1.2.1 Space Radiation Environment .......................... 2
      1.2.2 Terrestrial Radiation Environment ....................... 11
   1.3 Single Event Effects - Mechanism and Classification ......... 18
      1.3.1 Particles and Interactions ............................ 18
      1.3.2 Single Event Effects Classification .................... 24

2 Single Event Effect Analysis ................................ 35
   2.1 Introduction .............................................. 35
   2.2 Technology SER Characterization ............................ 36
      2.2.1 Memory Intrinsic SER Characterization .................. 36
      2.2.2 Standard Cell Intrinsic SER Characterization .......... 36
   2.3 Masking Effects .......................................... 37
      2.3.1 Electrical De-Rating ................................... 37
      2.3.2 Logic De-Rating ........................................ 37
      2.3.3 Temporal De-Rating ..................................... 39
      2.3.4 Functional De-Rating ................................... 41
      2.3.5 Memory De-Rating ....................................... 43
   2.4 Overall SER Computation ................................... 43

3 Single Event Transient Analysis .............................. 47
   3.1 Introduction .............................................. 47
   3.2 SET Characterization of the Standard Cell Library .......... 49
      3.2.1 TFIT Overview ......................................... 51
      3.2.2 Per-cells state SER figures ............................. 53
      3.2.3 Overall per-cell SER figures ............................ 53
      3.2.4 Transistor Contribution to the Overall Cell SER .......... 55
      3.2.5 Single Event Transient SER: Influencing Factors ........ 55
   3.3 SET Propagation Analysis ................................ 59
      3.3.1 Classic serial fault simulation approach ................ 60
      3.3.2 Accelerated SET simulation .............................. 65
      3.3.3 Static, probabilistic fault propagation approaches ....... 67
   3.4 Conclusions ............................................... 68

4 Single Event Analysis for Sequential Logic .................. 71
   4.1 Introduction .............................................. 71
   4.2 Single Event Effects in Sequential Cells .................... 72
   4.3 SER Analysis of Sequential Cell States ..................... 74
4.3.1 SEU results for standard flip-flops ....................... 76
4.3.2 SET results for standard flip-flops ....................... 79
4.4 Master and Slave Temporal De-Rating ....................... 80
4.4.1 Long Paths ........................................ 80
4.4.2 Short Paths ........................................ 81
4.4.3 Further Comments .................................... 81
4.5 Cell State Analysis in Complex Designs ..................... 82
4.6 State-Aware SER Improvement ................................ 83
4.6.1 Preference Towards Lower SER Data State ............... 83
4.7 Conclusions ............................................. 84

5 Derating Analysis of a Complex CPU .......................... 85
5.1 Introduction ............................................. 85
5.1.1 Statistical Confidence .................................. 86
5.2 Logic De-Rating Analysis .................................. 87
5.3 Memory De-Rating Analysis ................................ 87
5.4 Functional De-Rating Analysis .............................. 90
5.4.1 Fault Classification .................................... 90
5.5 Conclusion .............................................. 102

6 Conclusion .................................................. 105
6.1 Single Event Transient Analysis ............................... 105
6.2 Single Event Analysis for Sequential Logic .................. 106
6.3 Derating Analysis of a Complex CPU ......................... 106
6.4 Summary .................................................. 107

A Flip-Flop SEU Reduction through Minimization of the Temporal
Vulnerability Factor (TVF) .................................. 109
A.1 Introduction ............................................. 109
A.2 Overview of SEU Masking Factors ......................... 109
A.3 TVF Optimization ....................................... 113
A.4 Area and Power Constraints ................................ 115
A.5 Experimental Results .................................... 116
A.6 Conclusions ............................................. 118

Bibliography .................................................. 121
Glossary

**ASIC** Application Specific Integrated Circuit. 85

**BEOL** Back End Of Line. 16

**BGA** Ball Grid Array. 16

**BPSG** Boron Doped Phosphosilicate Glass. 14, 15

**CMOS** Complementary Metal Oxide Semiconductor. 24, 28, 30

**CPU** Central Processing Unit. 2, 43, 49, 82, 86, 87, 89–92, 106

**DoE** Design of Experiment. 51

**DRAM** Dynamic Random Access Memory. 26, 28

**DUE** Detectable Uncorrectable Error. 2, 86, 90, 93, 95–97, 103, 107

**DUT** Device Under Test. 59

**ECC** Error Correcting Code. 71, 89

**EDA** Electronic Design Automation. 52, 75

**EDR** Electrical De-Rating. 35, 37, 59, 106

**FDR** Functional De-Rating. 36, 41, 43, 86, 87, 90, 107

**FET** Field Effect Transistor. 28

**FIT** Failure In Time. 50, 51, 53, 73, 76, 79, 80, 82, 83, 86

**FPGA** Field Programmable Gate Array. 28

**GCR** Galactic Cosmic Rays. 3, 8

**IC** Integrated Circuit. 2

**LA** Low Alpha. 18

**LDR** Logical De-Rating. 35, 37, 59, 60, 86, 87, 106, 107

**LET** Linear Energy Transfer. 11, 19, 21, 51, 55

**MBU** Multiple Bit Upset. 28, 36
MCU  Multiple Cell Upset. 28, 36
MDR   Memory De-Rating. 35, 86, 87, 107
MOS   Metal Oxide Semiconductor. 30, 33
MOSFET Metal Oxide Semiconductor Field Effect Transistor. 28, 30
PC    Program Counter. 93
PIPB  Propagation Induced Pulse Broadening. 41, 42, 60, 106
PPSFP Parallel Pattern Single Fault Propagation. 87, 107
PW    Pulse Width. 37, 40, 51, 68, 76
RTL   Register Transfer Language. 82
SAA   South Atlantic anomaly. 5, 6
SBU   Single Bit Upset. 26, 28, 36
SDC   Silent Data Corruption. 2, 86, 90, 93, 95–97, 103, 107
SDF   Standard Delay Format. 49, 59, 61
SE    Soft Error. 35, 43
SEB   Single Event Burnout. 28
SECDED Single Error Correct, Double Error Detect. 28
SEE   Single Event Effect. 1, 15, 18, 22, 24, 26, 28, 35–37, 47–50, 55, 65, 71–74, 78, 84, 85
SEFI  Single Event Functional Interrupt. 28
SEGR  Single Event Gate Rupture. 28, 30
SEL   Single Event Latchup. 28
SEMT  Single Event Multiple Transient. 26
SEMU  Single Event Multiple Upset. 26
SESB  Single Event induced Snap Back. 30
<table>
<thead>
<tr>
<th>Term</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>SEU</td>
<td>Single Event Upset.</td>
</tr>
<tr>
<td>SoCFIT</td>
<td>System on Chip FIT. A tool developed by IROC Technologies to estimate the SER sensitivity of complex circuits.</td>
</tr>
<tr>
<td>SRAM</td>
<td>Static Random Access Memory.</td>
</tr>
<tr>
<td>TCAD</td>
<td>Technology CAD.</td>
</tr>
<tr>
<td>TDR</td>
<td>Temporal De-Rating.</td>
</tr>
<tr>
<td>TFIT</td>
<td>Transistor FIT. A tool developed by IROC Technologies to estimate the SER sensitivity of individual cells. These can be memory cells, sequential cells or combinatorial cells.</td>
</tr>
<tr>
<td>TID</td>
<td>Total Ionizing Dose.</td>
</tr>
<tr>
<td>UBM</td>
<td>Under Bump Metallurgy.</td>
</tr>
<tr>
<td>ULA</td>
<td>Ultra Low Alpha.</td>
</tr>
<tr>
<td>VCD</td>
<td>Value Change Dump.</td>
</tr>
<tr>
<td>VPI</td>
<td>Verilog Procedural Interface.</td>
</tr>
</tbody>
</table>
CHAPTER 1

Introduction

Contents

1.1 Introduction ........................................... 1
1.2 Radiation Environments and Anomalies ................... 2
  1.2.1 Space Radiation Environment ......................... 2
    1.2.1.1 Particles trapped by the earth’s magnetic field ... 3
    1.2.1.2 Galactic Cosmic Rays ........................... 8
    1.2.1.3 Solar Particle Events .......................... 9
  1.2.2 Terrestrial Radiation Environment ..................... 11
    1.2.2.1 Thermal Neutrons .............................. 14
    1.2.2.2 Muons ........................................ 15
    1.2.2.3 Alpha Particles ............................... 15
1.3 Single Event Effects - Mechanism and Classification ...... 18
  1.3.1 Particles and Interactions ............................ 18
    1.3.1.1 Gamma and X-Ray Ionization ..................... 18
    1.3.1.2 Energetic Particle Ionization .................... 19
    1.3.1.3 Cumulative Radiation Effects ................. 24
  1.3.2 Single Event Effects Classification ................... 24
    1.3.2.1 Soft Errors .................................... 26
    1.3.2.2 Hard Errors ................................. 28

1.1 Introduction

The continuing evolution of the technology allows building increasingly complex electronic devices integrating more and more functions. This evolution is not free of problems, or more appropriate, challenges to overcome. An increasing source of problems concerning the reliability of new technological processes is the perturbation induced by energetic particles (the Single Event Effects(SEEs)). First applications to incite some interest with respect to SEEs were obviously limited to specific applications: aero-space, high-reliability, nuclear facilities equipment and implantable medical devices. However, technological advances make possible the continuous diminution of the transistor size, rendering the components more sensitive to perturbations induced by radiation. Thus, is no longer possible to ignore Single Events for future and present technologies working in a natural environment.
The focus of this thesis is on soft error analysis and mitigation techniques for very large circuits. The main contributions consist of advancement in the analysis of Single Event Transient and Single Event Upset.

The manuscript is organized as follows. The first chapter is divided in two parts, the first section provide an overview of the radiation environments where microelectronic devices and integrated circuits may be used. The second section presents the main single event effects with their triggering mechanism including some basic notion of nuclear physics. Chapter 2 introduces the soft error analysis methodology presenting the basic de-rating effects. In chapter 3 presents a detailed Single Event Transient analysis flow, from the technology characterization to the propagation analysis. Chapter 4 describes an advanced Single Event analysis for sequential logic, including some considerations about the Single Event Transient (SET) sensitivity of sequential cells. A low cost Single Event Upset (SEU) mitigation solution was proposed and validated on a large (190kFF) design.

Chapter 5 presents the results of functional analysis of a single-core-implementation of a complex commercial Central Processing Unit (CPU) counting about 250k Flip-Flops. Three representative benchmarks were considered for this analysis. For each benchmark scenario, three fault injection campaigns were performed. From the fault injection results a mitigation scenario is proposed. The results obtained show that the failure rates, both Silent Data Corruption (SDC) and Detectable Uncorrectable Error (DUE), can be reduced considerably by hardening a limited percentage of flip-flops instances.

This is followed, in chapter 6, by the conclusions and a discussion of plans for future work. In addition to the main topics of the thesis, additional work performed in the context of a collaborative project is presented in the appendix A: a technique for the mitigation of flip-flop soft-errors through an optimization of the Temporal De-Rating (TDR) is proposed.

Overall, the work presented in this manuscript seeks to propose new techniques and methodologies in order to enable finer grained analysis of the effect of faults while having industry-level performances.

### 1.2 Radiation Environments and Anomalies

Microelectronic devices and Integrated Circuits (ICs) can be exposed to a wide range of radiation environments. The types of particles, their energies, fluxes, and fluences (or total dose) can vary considerably among the different radiation environments that electronics devices can be exposed to. These differences can lead to large variations in radiation-induced degradation. In this section, we present an overview of the different radiation environments.

#### 1.2.1 Space Radiation Environment

The concentrations and types of particles in the space environment vary significantly with altitude, angle of inclination and solar activity. As such, it is nearly
impossible to define a typical space environment [Schwank 2008]. Particles present in the space radiation environment include.

1. Particles trapped by the earth’s magnetic field - high energy particles (Galactic Cosmic Rays (GCR) and Solar Particles) that are not deflected by the magnetic field and become trapped in the planet’s magnetic field (Figure 1.1).

2. Galactic Cosmic Rays (GCR) - high-energy atomic nuclei coming outside the solar system, from which all of the surrounding electrons have been stripped away during their passage through the galaxy.

3. Solar Particle Events - high-energy nuclei that are associated with solar activity, they are ejected from the Sun due to plasma heating, acceleration, and numerous other forces.

1.2.1.1 Particles trapped by the earth’s magnetic field

The Earth’s magnetic field creates a geomagnetic cavity known as the magnetosphere [Faruk 2012] (Figure 1.1). The magnetic field lines trap low-energy charged particles. These trapped particles consist primarily of electrons and protons, although some heavy ions are also trapped.

![Figure 1.1: Solar wind and Earth’s magnetosphere.](image)

The trapped particles gyrate spirally around the magnetic field lines and are reflected back and forth between the poles where the fields are confined. The motion of the trapped particles is illustrated in Figure 1.2 [Faruk 2012]. This motion of charged particles forms bands of electrons and protons around the earth, which constitute the two primary radiation belts (Figures 1.1, 1.3) or Van-Allen Belts.
Chapter 1. Introduction

Figure 1.2: Motion of trapped particles in the earth’s magnetosphere [Faruk 2012].

Figure 1.3: Van Allen radiation belts.
1.2. Radiation Environments and Anomalies

The boundaries of these domains at the equator are illustrated in Figure 1.4 [Faruk 2012]. Because of variations in the magnetic field lines with latitude, the boundaries of the radiation belts vary with latitude (angle of inclination). The domains can be divided into five regions. Trapped protons exist primarily in regions one and two that extend from slightly above 1 earth radius to 3.8 earth radii (purple bar in Figure 1.4 [Stassinopoulos 1988]). The distribution of proton flux as a function of energy and radial distance is given in Figure 1.5 [Stassinopoulos 1988].

![Figure 1.4: Boundaries of the domains for solar flare and trapped protons and outer and inner zone electrons [Faruk 2012].](image)

Trapped protons in the earth’s magnetosphere can have energies as high as 500 MeV [Faruk 2012]. Note that the altitude corresponding to the peak in flux decreases with proton energy. Protons with energies greater than 10 MeV primarily occupy regions one and two [Faruk 2012]. Protons originating from solar flares are present predominantly in regions four and five (Figure 1.4 [Stassinopoulos 1988]).

South Atlantic Anomaly Above the Atlantic Ocean centered off the coast of South America, earth’s inner Van Allen radiation belt comes closest to the Earth’s surface, causing a region of increased proton flux at relatively low altitudes. This region is called the South Atlantic anomaly (SAA), and exists because Van Allen radiation belts are aligned with the magnetic axis of the Earth, which is tilted by 11 degrees from the rotation axis of the Earth, and are not symmetrically placed with respect to the Earth’s surface [Barth 1997].

In this region, the flux of protons with energies greater than 30MeV can be as much as $10^4$ times higher than in comparable altitudes over other regions of the earth. At higher altitudes, the magnetic sphere is more uniform and the South Atlantic Anomaly disappears [Barth 1997], [Petersen 1981]. Figure 1.6 shows maps of the proton belt structure at altitudes of 500 km, 1000 km, and 3000 km, indicating

---

1Distances are specified in earth radii (one earth radius is equal to 6380 km) referenced to the center of the earth, i.e., one earth radius is at the earth’s surface.
the location of the SAA at low altitudes and the emergence of the background Van Allen belt structure at 3000 km [Barth 1997].

Electrons are present predominantly from region one to region four [Faruk 2012]. The electron domain is divided into two zones, an inner outer zone (Figure 1.4). The outer zone electrons have higher fluxes (10 times higher) and energies than the inner zone electrons. The maximum energy of trapped electrons is approximately 7 MeV in the outer zone; whereas, the maximum energy is less than 5 MeV for electrons in the inner zone [Faruk 2012]. At these energies, electron interactions are not a threat for single-event effects, but they must be considered for the total-dose effects.

Fluxes of electrons and protons in particular orbits can be estimated from existing models. The primary models of Earth’s radiation belts that are in widespread use are AP-8 [Sawyer 1976] and AE-8 [Vette 1991]. The AP-8 models are of trapped protons and include AP-8 MAX and AP-8 MIN, valid for periods of solar maximum and solar minimum, respectively. The AE-8 models for trapped electrons similarly include AE-8 MAX and AE-8 MIN. Experimental data indicate that trapped particle populations are highly dynamic. Large solar events are known to have created temporary proton belts and have enhanced the electron belts [Gussenhoven 1991, Gussenhoven 1993, Dyer 1993]. These results indicate that the static AP-8 and AE-8 models may significantly underestimate the concentration of protons and electrons [Vette 1977, Abel 1994, Heynderickx 1996]. New, dynamic models of the trapped particle radiation environment are being developed to describe
1.2. Radiation Environments and Anomalies

Figure 1.6: Integral proton flux contours as a function of latitude and longitude at 500 km, 1000 km, and 3000 km.
the solar-modulated environment more accurately [Piet 2006, Brautigam 2004].

1.2.1.2 Galactic Cosmic Rays

Galactic cosmic rays (GCR) originate from sources outside our solar system and are always present. In the absence of solar activity, cosmic radiation is composed entirely of galactic radiation. GCRs are atomic nuclei from which all of the surrounding electrons have been stripped away during their high-speed passage through the galaxy. Outside of our solar system, the spectrum of galactic cosmic rays is believed to be uniform. Its composition as a function of atomic mass is given in Figure 1.7 [Sexton 1992, Meyer 1974]. It consists mostly of protons (Hydrogen nuclei) (85%) and alpha particles (helium nuclei) (14%). Less than 1% of the galactic cosmic ray spectrum is composed of high-energy heavy-ions.

![Figure 1.7: Flux of galactic cosmic ray particles for atomic masses up to 59 [Sexton 1992, Meyer 1974].](image)

Figure 1.7 shows that the flux of protons is more than two orders of magnitude higher than the flux of either carbon or oxygen. The energy spectrum of galactic cosmic rays is given in Figure 1.8 [Adams 1981]. Note that the x-axis unit of Figure 1.8 is $MeV/nucleon$, thus, for carbon with 12 nucleons, the point at $100MeV/nucleon$ on the x-axis corresponds to an energy of 1.2$GeV$. For most ions, the flux peaks between 100 and 1000$MeV/nucleon$. For carbon, the peak flux is at an energy of approximately 2.4$GeV$. For protons and alpha particles, the energy of the ion can be more than 100$GeV/nucleon$. At these high energies, it is nearly impossible to shield electronic devices from cosmic rays.
1.2. Radiation Environments and Anomalies

1.2.1.3 Solar Particle Events

The frequency and severity of solar particle events is naturally dependent on the solar activity. Solar particle events (most commonly referred to as solar flares, but also including larger events such as coronal mass ejections) are random in nature, but follow the 11-year cycle of solar activity.

Figure 1.9 shows solar event proton fluences for solar cycles 20-22, superimposed over a plot of the sunspot number [Barth 1997].

High-fluence proton events occur during solar active years. Interestingly, the galactic cosmic ray flux is anti-correlated to the solar cycle, with maximum galactic cosmic ray flux occurring during solar minimum conditions [Nuntiyakul 2014].

After a solar flare occurs, particles begin to arrive near the earth within tens of minutes, peak in intensity within two hours to one day, and are gone within a few days to one week (except for some solar flare particles which are trapped in the earth’s radiation belts).

In a solar flare, energetic protons, alpha particles and heavy ions are emitted. In most solar flares, the majority of emitted particles are protons (90-95%) and alpha particles. Heavy ions constitute only a small fraction of the emitted particles, and the number of heavy ions is normally insignificant compared to the background concentration of heavy ions from galactic cosmic rays. In a large solar flare, the number of protons and alpha particles can be greatly enhanced (10^4 times) over the background galactic cosmic ray spectrum; whereas, the number of heavy ions for a large solar flare approaches up to 50% of the background galactic cosmic concentration of heavy ions [Adams 1981]. Associated with a solar flare is the solar

Figure 1.8: Energy spectrum of galactic cosmic rays [Adams 1981].
Figure 1.9: Correlation of proton solar event fluence to sunspot number for solar cycles 20-22. Sunspot number shown by the solid line plot [Barth 1997].
wind or solar plasma. The solar wind usually arrives near the earth within one to two days after a solar flare [Shea 1988]. As the solar wind strikes the magnetosphere, it can cause disturbances in the geomagnetic fields (geomagnetic storm), compressing them towards the earth. As a result, the solar wind can enhance the total dose received by devices in low-earth orbits.

![Figure 1.10: Flux of cosmic ray particles at solar maximum, at solar minimum, and for Adams’s [Adams 1982] 10% worst-case environment.](image)

Figure 1.10 is a plot of the angular flux of cosmic ray particles (both solar and galactic) during solar minimum and maximum inside a spacecraft in a geosynchronous orbit with 25 mils of aluminum shielding as a function of the Linear Energy Transfer (LET)

The term linear energy transfer (LET) is frequently used to describe the energy loss per unit path length of a particle as it passes through a material. LET has units of $\text{MeV} \cdot \text{cm}^2/\text{mg}$. Because the energy loss per unit path length (in $\text{MeV/cm}$) is normalized by the density of the target material (in $\text{mg/cm}^3$), LET may be quoted roughly independent of the target. It can easily relate the LET of a particle to its charge deposition per unit path length. In silicon, an LET of 97 $\text{MeV} \cdot \text{cm}^2/\text{mg}$ corresponds to a charge deposition of 1 $pC/\mu m$.

**1.2.2 Terrestrial Radiation Environment**

When cosmic rays reach the earth’s atmosphere, they collide with atomic nuclei in air and create cascades of interactions and reaction products (Leptons, Photons, Hadrons), including neutrons, called air shower (Figure 1.12). Figure 1.11 shows the energy required for protons (blue) and other ions (red) to penetrate the magnetosphere [Stassinopoulos 1988].

2 The term linear energy transfer (LET) is frequently used to describe the energy loss per unit path length of a particle as it passes through a material. LET has units of $\text{MeV} \cdot \text{cm}^2/\text{mg}$. Because the energy loss per unit path length (in $\text{MeV/cm}$) is normalized by the density of the target material (in $\text{mg/cm}^3$), LET may be quoted roughly independent of the target. It can easily relate the LET of a particle to its charge deposition per unit path length. In silicon, an LET of 97 $\text{MeV} \cdot \text{cm}^2/\text{mg}$ corresponds to a charge deposition of 1 $pC/\mu m$. 
Figure 1.11: Total energy required to penetrate the magnetosphere
Figure 1.12: Schematic view of cosmic rays causing cascades of particles

- Cosmic rays
  - Primaries disappear ~25km
  - Low energy deflected

- ~1600/m²/s cascade to sea-level
  - ~100/cm²/s at 12000m
  - ~1/cm²/s sea-level flux
The intensity of cosmic-ray-induced neutrons (and other secondary cosmic radiation, including protons) in the atmosphere varies with altitude (Figure 1.13) [O’Brien 1971, O’Brien 1978], location in the geomagnetic field and solar magnetic activity. Atmospheric shielding at a given altitude is determined by the mass thickness per unit area of the air above, called areal density or atmospheric depth.

![Particle Flux vs. Altitude at 54° Latitude](O'Brien 1971, O'Brien 1978)

The location and conditions for the reference cosmic-ray-induced terrestrial neutron differential flux have been chosen to be New York City (Figure 1.12) outdoors at sea level at a time of average solar activity. Per the JEDEC specification [JESD89A 2006], the reference neutron flux in New York City is 13 neutrons/cm²/hour [Gordon 2004].

1.2.2.1 Thermal Neutrons

Thermal Neutrons are neutrons that have lost kinetic energy until they reach a state where they are in thermal equilibrium with their environment. Certain nuclear fission reactions become much more probable with these low-energy neutrons and result in reactions yielding charged particles.

The most common such reaction is with the $^{10}$B isotope of boron. Boron is used as a p-type dopant and is also used as an implant in insulating layers formed of Boron Doped Phosphosilicate Glass (BPSG). Thermal neutrons can interact with these materials, and the charged particles generated from this reaction can induce soft errors [Baumann 2005].

Recent work on thermal neutrons [Fang 2014, Wen 2010a, Wen 2010b] has shown
that even when BPSG is not used in the fabrication process, devices can have a sensitivity to thermal neutrons.

### 1.2.2.2 Muons

Atmospheric muons represent an important part of the natural radiation constraint at ground level. Muons belong to the Meson component in the atmospheric cosmic ray cascades and are the products of the decay of charged pions via the weak interaction. They constitute the most preponderant charged particles at sea level [O’Brien 1971, O’Brien 1978]. Muons are charged particles; both negative and positive muons can lose their kinetic energy by ionization process when they travel through matter [Serre 2012]. But this interaction with matter is tenuous and muons can travel large distances in matter, thus deeply penetrating into material circuits. Ziegler and Lanford have been the first authors to point out precisely how muons can interact with matter at relatively low incident primary energies [Ziegler 1979]. They decompose the interaction into three primary processes:

1. **Muon direct ionization wake.** A charged muon loses its kinetic energy passing through semiconductor material by excitation of bound electrons and frees electron-hole pairs along its path as a result.

2. **Electromagnetic scattering** which induces energetic coulomb silicon nucleus recoil.

3. **Capture** of the negative muons by atomic nuclei when they are quasi stopped in matter. This complex capture mechanism releases recoiling heavy nuclei with a simultaneous emission of light particles (neutrons, protons, deuterons, α particles, etc.).

Muons-induced SEEs were predicted in a number of early works on microelectronic reliability: Wallmark and Marcus provided a brief investigation of the role of these particles as one of the fundamental physical limits to continued microelectronic scaling [Wallmark 1962]. Ziegler and Lanford provided a much expanded investigation of cosmic ray induced error rates and predicted the coming of a dramatic increase in errors with decreased critical charge [Ziegler 1979]. Recent Experimental results [Sierawski 2010, Sierawski 2011] indicate technology scaling increases the sensitivity of microelectronics to soft errors from low-energy muons.

### 1.2.2.3 Alpha Particles

Alpha particles are a type of ionizing radiation emitted through the decay of unstable isotopes. They consist of two protons and two neutrons bound together into a particle identical to a helium nucleus. In semiconductor devices, the main source of alpha particles is from packaging materials. There are three radioactive decay chains (Figure 1.14) that are primarily responsible for the α particles:
1. The Thorium chain, which starts with Thorium-232 and finishes with Lead-208.

2. The Uranium-238 chain, which starts with Uranium-238 and finishes with Lead-206.

3. The Uranium-235 chain, which starts with Uranium-235 and finishes with Lead-206.

Certain reactions along these decay chains result in the emission of an alpha particle as shown by the red arrows in Figure 1.14.

![Thorium and Uranium Decay Chains](image)

Figure 1.14: Thorium and Uranium Decay Chains.

Alpha-particles-induced soft errors are primarily of concern for Ball Grid Array (BGA) packages, especially flip-chips. Figure 1.15 shows the cross section of a typical flip-flip package. Any materials that are within 100µm of the die could potentially emit alpha particles that affect the die. The energy of alpha particles is attenuated as they traverse the Back End Of Line (BEOL) materials before they reach the substrate (Figure 1.16).

**Specifications of Materials Alpha Particle Emissivity** There are no formal definitions or standards (JEDEC, ISO, IEC, IEEE, etc.) for acceptable levels ma-
1.2. Radiation Environments and Anomalies

Figure 1.15: Cross Section of Typical Flip-Chip Package.

Figure 1.16: Cross-section schematic of a UBM and solder bump for a flip-chip interconnect.
terials alpha particle emissivity. The following two terms are commonly used in industry [Wilkinson 2011]:

- **Low Alpha (LA)** - Low Alpha: \(\text{emissivity} \leq 0.05 \alpha/\text{cm}^2/\text{hour} \) (sometime expressed as \(50 \alpha \cdot \text{kr/hr}^{-1} \cdot \text{cm}^{-2}\))

- **Ultra Low Alpha (ULA)** - Ultra Low Alpha: \(\text{emissivity} \leq 0.002 \alpha/\text{cm}^2/\text{hour}\) (sometime expressed as \(2 \alpha \cdot \text{kr/hr}^{-1} \cdot \text{cm}^{-2}\))

### 1.3 Single Event Effects - Mechanism and Classification

Single Event Effects (SEEs) are induced by the interaction of an ionizing particle with electronic components. Ionizing particles can be primary (such as heavy ions in space environment or alpha particles produced by radioactive isotopes contained in the die or its packaging), or secondary (recoils) created by the nuclear interaction of a particle, like a neutron or a proton with silicon, oxygen or any other atom of the die.

Energetic Particles can ionize (directly or indirectly) atoms, generating electron-hole pairs. As long as the energies of the generated electrons and holes are higher than the minimum energy required to create an electron-hole pair, these new electrons and holes can generate additional electron-hole pairs (A single, high-energy incident photon, electron, or proton can create thousands of electron-hole pairs).

This section describes the physical mechanisms that induce SEEs and to defines and classify the different ways they alter circuit operation. First, a brief introduction of nuclear physics, interaction mechanisms, and energy deposition, is given. Then, the effects observed in electronic devices will be defined and classified.

### 1.3.1 Particles and Interactions

#### 1.3.1.1 Gamma and X-Ray Ionization

Photons interact with material through three different processes, namely the photoelectric (or fluorescent) effect, the compton effect, and pair production effect [McLean 1987]. These processes are illustrated in Figure 1.17. For each of these processes, the primary result of the interaction is the creation of energetic secondary electrons.

Low-energy photons interact with material predominantly through the photoelectric effect. The photoelectric effect is illustrated in Figure 1.17.a. The incident photon excites an electron from an inner shell of the target atom to a high enough state to be emitted free of the target atom. The incident photon is completely absorbed. Thus, the photoelectric effect creates a free electron (photoelectric electron) and an ionized atom. In addition, as the photoelectric electron is emitted, an electron in an outer orbit of the atom will fall into the spot vacated by the photoelectron causing a low-energy photon to be emitted.
For higher-energy photons, Compton scattering is the most probable type of interaction. Compton scattering is illustrated in Figure 1.17.b. The photon collides with an atom and it transfers a fraction of its energy to an electron of the target atom, giving the electron sufficient energy to be emitted free of the target atom. For Compton scattering, a photon of lower energy is created which is free to interact with other target atoms. It can also create a free electron and an ionized atom.

Pair production occurs only for very-high energy photons ($E > 3\text{MeV}$). This process is illustrated in Figure 1.17.c. The incident photon collides with a target atom creating an electron-positron pair. A positron has the same properties as an electron (charge and mass), except that the charge is positive. The incident photon is completely annihilated in pair production.

![Figure 1.17: Schematic drawing of three processes through which photons interact with material: a) photoelectric effect, b) Compton scattering, and c) pair production [McLean 1987].](image)

1.3.1.2 Energetic Particle Ionization

There are two primary methods by which energetic particle radiation releases charge in a semiconductor device: direct ionization by the incident particle itself, and ionization by secondary particles created by nuclear reactions between the incident particle and the struck device. Both mechanisms can lead to integrated circuit malfunction.

**Direct Ionization**  As an energetic particle passes through a material, it loses energy by excitation and ionization of atoms, creating a very high density electron-hole plasma along the path of the particle. The amount of energy that a particle deposits per unit depth in a material is given by its stopping power. The mass-stopping power is defined as the linear energy transfer, LET.
Figure 1.18: Particle Interactions Methods: Direct and Indirect Ionization

Figure 1.19: Stopping power (LET) versus depth for an alpha particle in silicon [Sexton 1992].
The integral of \( \text{LET} \) over path length gives the total deposited energy. Figure 1.19 [Sexton 1992] is a plot of stopping power (\( \text{LET} \)) for 2.5\(MeV\) alpha particle as a function of depth in silicon. The point of maximum stopping power is called the Bragg peak. The \( \text{LET} \) for a given particle depends on the target material and the particle’s energy.

![Figure 1.20: Schematic diagram and time dependence for charge collection by drift, funneling, and diffusion [Schwank 2008].](image)

If an energetic particle passes through a \( p - n \) junction, charge can be collected at the electrodes by drift of carriers from the depletion region. The drift of carriers to the electrodes occurs within hundreds of picoseconds after a particle strike. This is represented as \( Q_D \) in Figure 1.20 [Schwank 2008].

The amount of charge that is collected by drift of carriers within the depletion region can be greatly enhanced by field funneling [Hsieh 1981] \( (Q_F \) in Figure 1.20). The density of the electron-hole created by the ion strike is considerably greater than the doping concentration of typical p-n junctions [McLean 1982]. The high concentrations of electron and holes in the plasma will distort the original depletion region of the junction along the path of the ion. As a consequence, the junction field region creates a funnel region that extends down into the substrate as depicted in Figure 1.20.

The funnel will exist as long as the concentration of electron-hole pairs in the plasma created by the ion strike is large compared to the doping concentration of the substrate. Diffusion of carriers to the edge of the junction depletion or funnel region contributes a another component to the collected charge. The diffusion of carriers takes much longer (nanoseconds to microseconds) than the drift component. The diffusion of carriers is noted as \( Q_{DF} \) in Figure 1.20.
Indirect Ionization  Light particles (neutron and protons) usually [Weulersse 2011] do not produce enough charge to cause SEEs by direct ionization. However, protons and neutrons can both produce significant SEE rates due to indirect mechanisms. As a high-energy proton or neutron enters the semiconductor lattice it may undergo a nuclear interaction with a target nucleus. Any one of several nuclear reactions may occur, including:

1. Elastic Interaction
2. Inelastic Interaction
3. Inelastic Collision
4. Nuclear Fission

Elastic Interaction  In the elastic process (Figure 1.21), the recoil nucleus is identical to the target nucleus. In the collision, the total kinetic energy and the momentum of the neutron-target nucleus system are conserved. A fraction of the energy of the neutron is given to the nucleus [Nicolaidis 2011, Chapter 2].

![Figure 1.21: Elastic Interaction](image)

Inelastic Interaction  Nonelastic interactions (spallation) group all the interactions that result in a fragmentation of the nucleus in two or more recoil fragments. Generally, the lighter recoil is indicated to describe the reaction: \((n, p)\), \((n, \alpha)\), \((n, d)\). The heavier element is obtained by the equilibrium of the number of neutrons and protons before and after the reaction. With \(^{28}\text{Si}\) as the target nucleus, \((n, p)\) reaction results in a proton and Al recoil while \((n, \alpha)\) reaction results in \(He\) and \(Mg\) recoils (Figure 1.22) [Nicolaidis 2011, Chapter 2].
1.3. Single Event Effects - Mechanism and Classification

![Inelastic Interaction Diagram](image)

**Inelastic Collision** - \((n, n')\)  In this reaction, the incident neutron is absorbed in the target nucleus and a short time later a neutron is ejected with a lower energy, sharing a part of the total kinetic energy with the recoil target nucleus (Figure 1.23) [Nicolaidis 2011, Chapter 2].

![Inelastic Collision Diagram](image)

**Fission**  Two isotopes of boron exist, \(^{10}\text{B}\) (19.1% abundance) and \(^{11}\text{B}\) (80.1% abundance). Different from other isotopes \(^{10}\text{B}\) is highly unstable when exposed to neutrons. Furthermore, while other isotopes emit only gamma photons after absorbing a neutron, the \(^{10}\text{B}\) nucleus fissions, producing an excited \(^{7}\text{Li}\) recoil nucleus and an alpha particle (Figure 1.24). Although neutrons with any energy can induce
fission, the probability decreases rapidly with increasing neutron energy. Therefore, only thermal neutrons need to be considered [Nicolaidis 2011, Chapter 1].

1.3.1.3 Cumulative Radiation Effects

Electronics used in space or highly radioactive environments may be degraded due to the cumulative effect of exposure to radiation.

**Total Ionizing Dose** Total Ionizing Dose (TID) effect results from charge being trapped in the oxide layer and causing a change in the characteristics of the transistor. Cumulative long term ionizing damage due to protons and electrons can cause devices to suffer threshold voltage shifts, increased device leakage (power consumption), timing changes, decreased functionality, etc.

**Displacement Effects** Highly energized particles may displace atoms in the silicon lattice (Figure 1.25) of active devices and thereby affect their function. Bipolar devices and especially optical devices may be very sensitive to this effect. Complementary Metal Oxide Semiconductor (CMOS) integrated circuits are normally not considered to suffer degradation by displacement damage.

1.3.2 Single Event Effects Classification

SEE is general term that groups all the possible effects induced by the interaction of an ionizing particle with electronic components. These effects are classified in hard errors and soft errors: Hard errors are non-recoverable errors. Soft errors may be recovered by a reset, a power cycle or simply a rewrite of the information. The following sections presents
1.3. Single Event Effects - Mechanism and Classification

Figure 1.25: Displacement Effect

Figure 1.26: Single Event Transient production: interaction of an ionizing particle with an inverter.
### 1.3.2.1 Soft Errors

**Single Event Transient**  Single Event Transients (SET) are momentary voltage or current disturbances affecting combinational gates. In the case where a single particle strike (the particle itself or its recoils) would affect two or more combinational gates, the SEE is called *Single Event Multiple Transient (SEMT)* [Harada 2011, Rossi 2005]. Although an SET does cause a transient in the gate output struck by the recoil ion, it may propagate through subsequent gates and eventually cause an Soft Error when it reaches a memory element (Figure 1.27 [Gadlage 2009]).

![Figure 1.27: Single Event Transient: generation and propagation](Gadlage 2009)

**Single Event Upset**  A Single Event Upset SEU occurs when an ionizing particle strike modifies the electrical state of a storage cell, such that an error is produced when the cell is read. In an *Static Random Access Memory (SRAM)* or a flip-flop, the state of the memory is reversed. In a *Dynamic Random Access Memory (DRAM)*, the charge stored can be slightly modified and interpreted as a wrong value by the read circuitry. In the case where a single particle strike (the particle itself or its recoils) would affect two or more combinational gates, the SEE is *Single Event Multiple Upset (SEMU)* [Dodd 2003].

**Single Bit Upset, Multi Cell Upset, Multi Bit Upset - SBU, MCU, MBU**  Single Bit Upsets(SBUs) are events, equivalent to SEUs, induced in a memory by SEEs. The interaction of an ionizing particle with the memory is obviously dependent on the type of the memory. As an example, the production mechanism of the SBU in an SRAM device is similar to the SEU mechanism described earlier for sequential cells (Figure 1.28).
1.3. Single Event Effects - Mechanism and Classification

Figure 1.28: Single Event Upset mechanism.
In the case where a single particle strike (the particle itself or its recoils) would affect two or more memory cells, the SEE is called Multiple Cell Upset (MCU). If the physically neighborhood cells affected by particle interaction belong to the same logical word, then an Multiple Bit Upset (MBU) is produced.

The MCU/MBU analysis is particularly interesting when considering the eventual error protection mechanisms. SBUs are corrected by the most common error-mitigation techniques - the Single Error Correct, Double Error Detect (SECDED) codes, such as the Hamming code. Thus, the SBUs in SECDED protected memories will not need particular care. However, the MBUs will not be corrected by this code, causing further errors in the circuit. The MBU probability can be considerably reduced by implementing column multiplexing.

**Single Event Functional Interrupt - SEFI**

Single Event Functional Interrupt (SEFI) is a broad term that refers to an anomalous behavior observed in complex devices (flash memories, DRAM, SRAM, Field Programmable Gate Array (FPGA), microprocessors, micro-controllers, etc.). It can be the result of the upset of some registers or latches that are used in the configuration of the working modes of these complex devices. The effect of a SEFI is detectable and it does not result in permanent damage. A SEFI can be recovered by resetting or power-cycling the device.

**Single Event Latchup - SEL**

A Single Event Latchup (SEL) is a potentially catastrophic condition where a low resistance path develops between power supply and ground [Sexton 2003] on a device that remains after the triggering event is removed. When currents are sufficiently high, metal traces can vaporize, bond wires can fuse open, and silicon regions can be melted due to thermal runaway. Once latched, this high current condition will continue until power is removed from the device or it fails catastrophically. Figure 1.29 shows the parasitic structure leading to latchup on the cross-section of a Bulk CMOS technology. Figure 1.30 shows the $I - V$ characteristic of a latchup.

### 1.3.2.2 Hard Errors

**Single Event Burnout - SEB**

A Single Event Burnout (SEB) occurs when the passage of an energetic particle heavy ion causes a power Field Effect Transistor (FET) to enter second breakdown [Sexton 2003]. If not rapidly quenched, the resultant high current causes the device to go into thermal runaway resulting in destructive failure. Single Event Gate Rupture (SEGR) is often observed simultaneously with SEB in power Metal Oxide Semiconductor Field Effect Transistors (MOSFETs). Figure 1.31 shows the parasitic structure responsible of SEB on the cross-section of a vertical power MOSFET.

**Single Event Gate Rupture - SEGR**

Single-Event Gate-Rupture (SEGR) is a condition where the gate dielectric isolating the gate and channel regions fails [Sexton 2003]. The SEGR process [Sexton 2003] is initiated when a heavy ion strikes...
1.3. Single Event Effects - Mechanism and Classification

Figure 1.29: Structure Leading to Latchup Highlighted on the Cross-Section of a Bulk CMOS Technology (n-substrate material).

Figure 1.30: I-V Characteristic of a Latchup and Thyristor Formed by PNPN Junction.
the device in the neck region (the neck region is the area between the p-body diffusions at the surface). The ion strike creates a filament of electron-hole pairs. For an n-channel power MOSFET, the generated holes drift toward the interface and the electrons toward the drain contact due to the electric field resulting from the positive drain bias. Upon reaching the interface, the holes start to pile up at the interface and leak off, only slowly, toward the source contact. This pool of positive charge increases the electric field in the oxide, and when the field exceeds a critical value, oxide breakdown occurs. The collected holes then discharge through the oxide, heating the structure locally. If the breakdown current lasts long enough, it creates a permanent short-circuit through the oxide. Figure 1.32 shows the SEGR mechanism.

**Single Event induced Snap Back - SESB**  A Single Event induced Snap Back (SESB), is a stable regenerative condition similar to latchup, caused by drain-to-source breakdown in normal n-Metal Oxide Semiconductor (MOS) transistors [Sexton 2003]. Like latchup, a high current condition results that can cause permanent damage to a device. Unlike latchup, a p-n-p-n four layer structure is not necessary for snap back. For this reason, it is often referred to as transistor latchup.

All device types have the epitaxial n-layer on the highly doped n+ material. Since the off-state N-MOS transistors are responsible for the snapback in a CMOS circuit, we expect a well developed depletion region around the drain as shown in 1.33.a. Soon after the passage of an ion through the depletion region, the electron-hole pairs commence movement along the field lines. Most electrons travel toward the drain, whereas the holes move mainly toward the source (Figure 1.33.b ). Some holes, however, travel through the p-regions toward the ground plane. At this
1.3. Single Event Effects - Mechanism and Classification

Figure 1.32: SEGR Mechanism [Sexton 2003]
Figure 1.33: SESB Mechanism [Ochoa 1983]
stage the parasitic bipolar transistor can be turned on as shown in 1.33.c. Once the parasitic transistor is turned on and the regenerative breakdown condition has occurred, the transistor can be shut off only when the current between the drain and the source is reduced below the cut-off (sustaining) current level. The effect of funneling may accelerate the onset of the snapback. The introduction of the p-well feature slightly complicates the picture since additional parasitic (vertical) bipolar transistors become active. Nevertheless, the basic model describes the main snapback mechanism, i.e., a low resistance path is formed between the source and the drain of the off-state-N-MOSMOS transistor [Ochoa 1983].
2.1 Introduction

Hardware is intrinsically unreliable. External and internal perturbations can cause data corruption, faulty states and unpredictable circuit behavior. Single Events Effects (SEEs) represent a particularly representative example of such issues. SEEs are caused by energetic particles from the environment (neutrons, protons, heavy ions, muons, ...) or from the device’s own materials (alpha particles emitted by radioactive contaminants).

Hardware (Low-level) faults represents the direct outcome of a Single Event on the output of the affected cell: Single Event Transients in combinational cells and Single Event Upsets in sequential cells. These faults must propagate in the logic network up to the input of a memorization element: a memory block or a sequential cell. Then, If the event is latched in the memorization element, the fault becomes a Soft Error. If the Soft Error causes an observable modification of the expected systems’ behavior, then the Functional Failure is a usable concept for this occurrence. Analyzing the effect of faults induced by Soft Error (SE) in complex integrated circuits remains challenging. The vast majority of faults do not propagate due to the various de-rating (or masking effects): Electrical De-Rating (EDR), Logical De-Rating (LDR), TDR, Memory De-Rating (MDR) and
Functional De-Rating (FDR). In the following sections, these de-rating (or masking) mechanisms are described.

In the following section will present the three main step of an Soft Error Rate (SER) analysis methodology: the technology SER characterization of standard cells and memory blocks, the various de-rating factors and finally the overall SER calculation. The SER methodology presented focuses on non-destructive SEEs: Bit Upsets in memory blocks and sequential cells, Transients in combinational cells.

2.2 Technology SER Characterization

The technology SER characterization is the first step of the presented SER methodology. Raw SER data should be provided in terms of raw (intrinsic) rate/probability of occurrence of logic SEU or SET for combinational, sequential and memory cells for a specific environment.

The final operating environment should be also carefully analyzed, of a particular interest to most commercial and aeronautical applications is the natural background, terrestrial environment characterized by a natural contribution of atmospheric neutrons and internal alpha particles from contaminants. The neutron SER is specific to the technology and the environment (altitude and localization). The alpha contribution depends strongly on the sensitivity of the cell to alpha particles and the alpha emissivity rate of the packaging materials.

2.2.1 Memory Intrinsic SER Characterization

SRAM memory blocks are prime targets for any reliability-related initiatives. Their high integration levels, reduced features size and small critical charge make them sensitive to Single Event Effects. Thus, the first task consists in characterizing the SEE performances of the memory instances with regard to Single Event Upsets affecting data stored in the memorization cells but also Single Event Effects on addressing/decoding/control logic. This analysis should concern both qualitative (type, manifestation and outcome of the events) and quantitative aspects (event rate/type for a given working environment).

The set of memory characterization data is comprised of: Single Bit Upset (SBU) rates and Multiple Cell Upset (MCU) rates for each possible pattern and Multiple Bit Upsets (MBU) rates for different column multiplexing configurations.

2.2.2 Standard Cell Intrinsic SER Characterization

SEUs - Single Event Upsets affecting the sequential cells of the design have an obvious impact on the circuit reliability. Single Event Transients - SETs affecting combinational cells are much more difficult to characterize: the transient pulses at the output of the cell have various shapes, amplitudes and durations. Moreover, the SET parameters strongly depend on the state and the neighborhood of the cell. Lastly, the standard cell library may contain an order of magnitude more
combinational cells with various functions and drive strengths than sequential cells, requiring an adequate characterization effort through radiation testing or simulation. Moreover, the actual cell behavior also depends on the circuit implementation and the usage of the cells for a given workload. The fan-out of a combinational cell will have a strong impact on SET characteristics. The signal values on the inputs of the cells will determine the sensitive transistors and implicitly the occurrence probability of the SEE.

Standard cell SER characterization data should be provided in terms of raw (intrinsic) rate/probability of occurrence of logic SEU or SET for combinational and sequential cells for a given environment. Pulse width distributions of logic, rectangular SETs should be also provided.

2.3 Masking Effects

Not all radiation induced faults propagate and produce errors because of the numerous masking effects. The raw rate of faults can be de-rated to obtain an effective error rate using a de-rating factor. In this work, a de-rating factor of 1 indicates that all faults propagate and a value of 0 indicates that all faults are blocked. The definitions are important as some authors use the term masking factor to indicate the fraction of faults that are masked, which is the opposite of a de-rating factor.

2.3.1 Electrical De-Rating

The Electrical De-Rating (EDR) quantifies the electrical attenuation of an SET and thus its capability to propagate through the logic network.

One aspect of electrical de-rating is accounted for by considering how the induced analog pulse is mapped to a digital pulse. The shape of a radiation induced pulse is shown in Figure 2.1 and based on the logic threshold voltage, it can be modeled as a digital pulse of width Pulse Width (PW). Pulses whose amplitude never reaches $V_{th}$ are masked [Hane 2008, Tanaka 2009]. A further aspect of EDR relates to the fact that when the pulse duration is short (comparable to the gate rise/fall times), it may be attenuated as it passes through downstream gates.

2.3.2 Logic De-Rating

Logic De-Rating (LDR) consists in evaluating the propagation of the logic fault from the output of the affected cell to the inputs of a sequential/memory cell. According to the state of the circuit (the values of the signals and cell outputs), the propagation of the fault is subject to logic blocking [Vilchis 2012], [Nicolaidis 2011, Chapter 5]. Figure 2.2 presents the concept of Logic De-Rating for SEUs: within the clock cycle when it occurs (figure 2.2.a) and over several clock cycles (figure 2.2.b). And for SET (figure 2.2.c).
Chapter 2. Single Event Effect Analysis

Figure 2.1: Electrical De-Rating

Figure 2.2: Logic De-Rating of SEUs and SETs
2.3.3 Temporal De-Rating

Temporal de-rating (TDR\textsuperscript{1}) relates to the opportunity window of a fault (SET or SEU) to be latched in a down-stream memorizing element (Flip-Flop, Latch, Memory). If the fault is not stored in a register or a memory cell, there is no impact in the functioning of the circuit (the fault is dropped). The memorization of the fault depends on its type:

**SEU Temporal De-Rating** SEUs must arrive in the affected register early in the clock period in order to propagate through the logic network and reach the next sequential stage. Figure 2.3 shows the cases of masked and unmasked SEUs.

An SEU affecting a sequential cell, unless logically-masked, will propagate through the downstream logic combinational network and reach the next sequential logic stage. The SEU will remain on the flip-flop outputs until the next latching (clock) cycle.

![Figure 2.3: SEU Temporal derating](image)

Since the SEU arrival is a random phenomenon (likewise to the SET) the SEU Temporal De-Rating is defined as the ratio between the opportunity window and the clock period; the opportunity window depends on the paths’ delay and on the setup/hold times of the downstream flip-flop. The TDR is given by equation 2.1 [Nicolaidis 2011, Chapter 5].

\[
SEU \text{ Temporal De-Rating} = \frac{t_{\text{slack}} + \frac{t_{\text{setup}}}{2} - \frac{t_{\text{hold}}}{2}}{T_{\text{clock}}} \tag{2.1}
\]

\textsuperscript{1}In some works[Seifert 2004, Ghahroodi 2011, Bramnik 2013, Nguyen 2005] the term Temporal Vulnerability Factor (TVF) is used instead of TDR.
If the downstream paths are relaxed (high slack), the opportunity window for a SEU is quite large. If the flip-flop belongs to the critical path or similarly-timed paths, the slack is very low and thus the opportunity window is reduced, with a sharp decrease in the TDR values. Intuitively, it is clear that the same circuit will exhibit worse sequential SER values at lower clock frequencies.

**SET Temporal De-Rating**  
SETs must cause an incorrect value on the input of a memorizing element during the latching window. Figure 2.4 shows the cases of masked (figure 2.4(a)) and unmasked (figure 2.4(a)) SET. Single Event Transients manifest as short pulses on the output of the affected cell. SETs are possible in the case of combinational cells but also in the case of sequential elements such as flip-flop and latches when the Single Event only affects the output stages but not the inner memorization loop [Alexandrescu 2013].

![Figure 2.4: SET Temporal De-Rating](image)

An SET, unless logically-masked, will propagate through the downstream combinational network and reach the next sequential logic stage. The TDR represents probability of the SET to be memorized. Since the SET is a random phenomenon, the TDR depends on the pulse width ($PW$) and the clock period. To a first approximation, the TDR is proportional to the ratio of the induced pulse width to the clock period as shown in equation 2.2, where $p_i$ is the probability of having a transient with $PW = i$.

$$TDR_{SET} = \frac{\sum_{i=\min PW}^{\max PW} p_i \cdot i}{T_{clk}}$$  \hspace{1cm} (2.2)

For recent technology nodes, most of the combinational cells exhibit pulse duration of tens/hundreds picoseconds [Evans 2013b, Costenaro 2013a], which is still low with regard to nanosecond-clock period. Thus the temporal de-rating is strong for the SETs and will reduce their relative criticality. However, extensive test results have shown that at higher frequencies, the TDR factor increases and the effect of SETs is more severe [Mahatme 2011, Nguyen 2005, Gill 2009].

In figure 2.5 the possible alignments of a pulse compared to the sampling clock edge are shown, both for the case when the pulse is longer and shorter than the setup-hold window. For the cases of $PW > t_{setup} + t_{hold}$, the Overlapping Width, $OW$, is defined to be the extent of the pulse that lies within the setup-hold window.

![Figure 2.5: Possible Alignments of Pulses](image)
The error capture probability is taken to be proportional to the ratio of $OW$ to the full setup-hold window: $\frac{OW}{t_{\text{hold}} + t_{\text{setup}}}$.

For the cases of $PW < t_{\text{setup}} + t_{\text{hold}}$, the two violation cases are considered together, and the error latching probability is taken to be linear with the overlapping width ratio as before. A SET can occur at any time in the clock period with a uniform probability. In both cases above, averaged over the full clock period, the overall error latching probability is $\frac{PW}{T_{\text{clk}}}$. The calculated error probability for each case is shown in table 4.5.

![Figure 2.5: SET Pulse Alignment Cases](Evans 2013a, Costenaro 2013b)

It is important to note that in this analysis, the pulse width of interest is that at the input to the sampling flip-flop. The shape of a radiation induced pulse may be distorted as it propagates through a combinatorial network. This effect is referred to as Propagation Induced Pulse Broadening (PIPB) (Figure 2.6) and has been extensively studied in [Cavrois 2008, Sterpone 2011, Ferlet-Cavrois 2010].

2.3.4 Functional De-Rating

Functional De-rating (FDR) [Silburt 2009], [Nicolaidis 2011, Chapter 5] evaluates whether the Soft Error has any observable impact (failure classes) on the functioning of the circuit, board or system. It takes in account the actual usage of the circuit and the function of the system.

The observability criteria could involve both objective and subjective aspects to discriminate between the fault-free and faulty states of the system. An objective example of the discrimination criteria could be the comparison of the primary outputs of the circuit under test to the reference, fault-free ones. Any difference could be legitimately classified as a failure. However, a subjective observer may add his
Table 2.1: SET Pulse Alignment and Capture Probabilities [Evans 2013a, Costenaro 2013b]

<table>
<thead>
<tr>
<th>Pulse Width</th>
<th>Case</th>
<th>Case Error Prob.</th>
<th>Comments</th>
</tr>
</thead>
<tbody>
<tr>
<td>PW &gt; t_{setup} + t_{hold}</td>
<td>a</td>
<td>0.0</td>
<td>Correct value latched</td>
</tr>
<tr>
<td></td>
<td>b</td>
<td>\frac{1}{T_{clk}} \int_0^{t_{setup}} \frac{OW}{t_{hold} + t_{setup}} dOW</td>
<td>Set-up time violation</td>
</tr>
<tr>
<td></td>
<td>c</td>
<td>\frac{1}{T_{clk}} \int_0^{PW - t_{setup} - t_{hold}} \frac{OW}{t_{hold} + t_{setup}} dOW</td>
<td>Hold time violation</td>
</tr>
<tr>
<td></td>
<td>d</td>
<td>\frac{1}{T_{clk}} \int_0^{PW - t_{setup}} \frac{OW}{t_{hold} + t_{setup}} dOW</td>
<td>Wrong value latched</td>
</tr>
<tr>
<td>PW &lt; t_{setup} + t_{hold}</td>
<td>e</td>
<td>\frac{1}{T_{clk}} \int_0^{t_{hold}} \frac{OW}{t_{hold} + t_{setup}} dOW</td>
<td>Set-up time violation</td>
</tr>
<tr>
<td></td>
<td>f</td>
<td>\frac{1}{T_{clk}} \int_0^{t_{hold}} \frac{OW}{t_{hold} + t_{setup}} dOW</td>
<td>Hold time violation</td>
</tr>
<tr>
<td></td>
<td>g</td>
<td>0.0</td>
<td>Correct value latched</td>
</tr>
<tr>
<td>Overall</td>
<td>h</td>
<td>0.0</td>
<td>Correct value latched</td>
</tr>
<tr>
<td></td>
<td>i</td>
<td>\frac{1}{T_{clk}} \int_0^{t_{setup}} \frac{OW}{t_{hold} + t_{setup}} dOW</td>
<td>Set-up time violation</td>
</tr>
<tr>
<td></td>
<td>j</td>
<td>\frac{PW}{T_{clk} (t_{hold} + t_{setup} - PW)}</td>
<td>Metastability</td>
</tr>
<tr>
<td></td>
<td>k</td>
<td>\frac{1}{T_{clk}} \int_0^{t_{hold}} \frac{OW}{t_{hold} + t_{setup}} dOW</td>
<td>Hold time violation</td>
</tr>
<tr>
<td></td>
<td>l</td>
<td>0.0</td>
<td>Correct value latched</td>
</tr>
</tbody>
</table>

Figure 2.6: Propagation Induced Pulse Broadening - PIPB
2.4. Overall SER Computation

own weight/criticality to the observed difference. The recorded primary output difference may not be relevant for specific cases or applications, removing the observed failure.

The failure classification implies some degree of subjectivity, since a criticality parameter can be also added to the fault classes. Furthermore, the usage of the circuit will have a strong impact on the failure analysis, since different function modes/applications will exhibit different SE-related failures. If the circuit under test has a clearly-stated function, then the functional de-rating will have to consider this specific functions and the effect of relevant parameters. However, for general-purpose circuits such as CPUs, the possible application field is quite large, thus rendering the FDR analysis much more complex.

As stated, the FDR computation takes in account the propagation of the Soft Error during several clock cycles. The implied goal is to evaluate whether the SE

- has been silently discarded (dropped), without any further impact on the functioning of the circuit
- remains in the circuit in a latent state without any observable degradation of the function or structure of the circuit
- produces a functional failure.

Differently from the other derating factors, the FDR is not a single number but, for a given application, there will be as many FDRs as the number of failure classes considered.

2.3.5 Memory De-Rating

The Memory De-Rating [Alexandrescu 2011] represents the portion of time during which the data stored in a memory will eventually be read and thus used by the application. This metric is called vulnerability window and it corresponds to the time between a write access to an address and the last read access to that address before the end of the simulation or before another write access to that address (Figure 2.7). Oppositely, during the time between the last read access and a write access, the data is still in the memory but will not be used by the circuit. If an upset occurs during this time, it will have no effect because the corrupted data will never be read and will be overwritten by the write access. Figure 2.7 shows a series of read and write accesses to a memory address; the corresponding memory derating is given by equation 2.3.

\[
Memory \ De-Rating = \frac{(t_2 - t_0) + (t_4 - t_3) + (t_7 - t_6)}{T_e - T_s} \quad (2.3)
\]

2.4 Overall SER Computation

The overall SER analysis combines the data obtained from the technology SER characterization with the de-rating information to provide an application and environment-
Chapter 2. Single Event Effect Analysis

Figure 2.7: Sequence of memory accesses

dependent SER figures for the considered design.

For a given application and considering a specific failure class \( j \), the failure rate for a chip can be calculated with the following equations:

\[
SER_{\text{chip},j} = SER_{\text{sequential},j} + SER_{\text{comb},j} + SER_{\text{memory},j} + SER_{\text{clock},j} \tag{2.4}
\]

The contribution, to the overall failure rate, of the sequential portion of the circuit can be calculated with the following equations:

\[
SER_{\text{sequential},j} = \sum_{i\in \text{Flip\,Flip}} \text{SEU-FIT}_i \cdot LDR_i \cdot TDR_i \cdot FDR_{i,j} \tag{2.5}
\]

\[
SER_{\text{comb},j} = \sum_{i\in \text{Gate}} FDR_{i,j} \cdot LDR_{i,j} \cdot \int_{w=\min}^{w=\max} \text{SET-FIT}_i(w) \cdot TDR_i(w) \cdot EDR_i(w) \, dw \tag{2.6}
\]

\[
SER_{\text{memory},j} = \sum_{i\in \text{Memory}} \text{FIT}(i) \cdot MDR_{i,j} \tag{2.7}
\]

where:

- \( j \) represents a class of failure (all equations)
- \( \text{SEU-FIT}_i \) represents the intrinsic rate of occurrence of SEUs for the sequential instance instance \( i \) (equation 2.5)
- \( w \) represents the pulse width(equation 2.6)
- \( \text{SET-FIT}_i(w) \) represents the intrinsic rate of occurrence of SETs for the combinational instance instance \( i \) (equation 2.6)
- \( \text{FIT}(i) \) represents the uncorrectable error rate for the memory instance \( i \) (equation 2.7)

A high-level view of the SER analysis from faults to system level failures is shown in figure 2.8 [Evans 2014]. On the left are the faults that can be induced in flip-flops, logic gates and memories. In the middle the de-rating effects and on the right the failure rate per each failure class.
Figure 2.8: Soft Error Effect Analysis [Evans 2014]
3.1 Introduction

The continuing evolution of the technology allows building increasingly complex electronic devices integrating more and more functions. This evolution is not free of problems, or more appropriate, challenges to overcome. An increasing source of problems concerning the reliability of new technological processes is the perturbation induced by energetic particles (the SEEs). First applications to incite some interest with respect to SEEs were obviously limited to specific applications: aero-space...
[Binder 1975], high-reliability, nuclear facilities equipment, implantable medical devices [Bradley 1998]. However, technological advances make possible the continuous diminution of the transistor size, rendering the components more sensitive to perturbations induced by radiation. Thus, is no longer possible to ignore Single Events for future and present technologies working in a natural environment [Normand 1996].

Memory devices are invariably amongst the first circuits to be implemented in a new process. Their highly regular structure makes them perfect candidates and a highly effective benchmark and test vehicle for estimating performance and reliability metrics, including the SER [Cannon 2004]. In contrast, logic networks have a much more complex internal structure that allows the SEE's to manifest in very diverse ways with varying levels of criticality. Evaluating the sensitivity of the circuit with respect to SEEs in a neutron environment is not any easy task. Intrinsic (raw) cell sensitivity figures must be provided as a starting point. This is challenge by itself [Vial 1998, Tosaka 1999]. The SEUs - Single Event Upsets affecting the sequential cells of the design have an obvious impact on the circuit reliability. Single Event Transients affecting combinational cells are much more difficult to characterize: the transient pulses at the output of the cell have various shapes, amplitudes and duration. Moreover, the SET parameters strongly depend on the state and the neighborhood of the cell. Lastly, the standard cell library may contain an order of magnitude more combinational cells with various functions and drive strengths than sequential cells, requiring an adequate characterization effort through radiation testing or simulation.

The research community offers a wealth of solutions for each step of the design flow, for any practical representation of the circuit, with a large specter of performance and facility of use. Obviously, the prime targets of the SER characterization and improvement efforts are the memory and sequential cells. In previous technological processes, the combinational cells have a limited (arguably) criticality, due to the lower intrinsic sensibility (than same-process sequential cells) and a stronger electrical and timing de-rating for SET events. However, the SETs are considered [Sanda 2005] to become a critical contributor to the overall circuit SER, considerably justifying the marked increase in the interest that both the academy and the industry show to these events. We present in this paper the results of a practical, candid approach to a possible exhaustive SET evaluation flow in an industrial setting. The main steps of this process consists in:

- Fully characterize the standard cell library using a process and library-aware SER tool.
- Evaluate SET effects in the logic networks of the circuit using a variety dynamic (simulation-based) and static (probabilistic) methods.
- Compute overall SET figures.

The considered library is the Nangate 45nm Open Cell Library [Nangate 2008], characterized using a 45nm generic SER process database. A purely-combinational
3.2. SET Characterization of the Standard Cell Library

The radiation testing of combinational cells has the specific advantage of measuring the cells sensitivity in the intended working environment. To accomplish this task, dedicated test vehicles have to be designed [Perez 2006, Nicolaidis 2003, Eaton 2004]. These sensors provide an accurate measure of the SET rates but also allow the measuring of the pulse width of the observed events. Another approach to the combinational cell SER study consists in using software tools, standard (full 3D-Technology CAD (TCAD) tools, SPICE simulators using SET double-exponential and or Q_{crit} models) or dedicated, such as those that mentioned in the following.

• A first class of tools that can be used to study radiation-induced single event transient, consists in the nuclear physics software tools: HETC [Townsend 2005]; GNASH [Young 1977]; MCNP [Forster 2004]; MC-RED [Wrobel 2001]; Geant4 [Agostinelli 2003]. These tools are quite complex and may require multiple physics and semiconductor competencies.

• A second class of tools (using statistical methods) has been also proposed: the BGR method [Letaw 1991] and the DASIE [Hubert 2001] tool allowing the cell designers to approach the SEE phenomena with a good degree of comfort.

• A third class of prediction tools, using Monte-Carlo techniques [Reed 2013] are able to accurately take in account the geometry/topology of the device: the MC-ORACLE [Wrobel 2011]; the MUSCA SEP3 platform [Hubert 2009]; the SEMM [Murley 1996] and SEMM-2 [Tang 2008] tools.

Recently, dedicated tools such as the TFIT tool [Hane 2008, Belhaddad 2006, Belhaddad 2008] have been specifically developed to predict and improve the cell
SER performance. This generation of tools allows reasonably accurate calculation of the electrical effect of particles impact to a transistor, a cell, or a circuit early in the design flow, at much faster speeds than traditional 3D-TCAD simulations (whereas the 3D-TCAD approach does not apply in the case of neutrons impact and circuits analysis). The tool is able to provide SER data for a variety of operating environment characterized by the type of particles.

![Figure 3.1: Single Event Transient - Production and Modeling: Interaction of a charged particle with the transistor](image)

Figure 3.1: Single Event Transient - Production and Modeling: Interaction of a charged particle with the transistor

![Figure 3.2: Single Event Transient - Production and Modeling: SET: Analog transient and logic model](image)

Figure 3.2: Single Event Transient - Production and Modeling: SET: Analog transient and logic model

The SET is direct consequence of a SEE. The production mechanism (Figure 3.1) and its manifestation at the output of the affected cell (Figure 3.2) have been presented in literature [Dodd 2004, Hass 1999b, Gadlage 2004]. We use a transient logic fault model with an occurrence probability - the SET SER. One Failure In Time (FIT) is one SET during a billion working hours for a MegaCell which is $2^{20}$
(1048576) identical cells. Several factors (external and internal) may have an impact on the SET characteristics and propagation:

- The state of the cell.
- The supply voltage.
- The capacitive load on the charge collection.
- Threshold voltages of the downstream cell.

We only consider transient pulses that have sufficient voltage amplitude to switch the transistors from the following, fan-out cell ($V_{SET} > V_{THRESHOLD}$) and with a transient pulse duration ($PW$) large enough to be able to propagate at least through a few levels of combinational cells. We have selected a practical value of 25ps.

### 3.2.1 TFIT Overview

**TFIT** [Hane 2008, Belhaddad 2006, Belhaddad 2008] is a fast simulation tool that is used to predict and improve the SER and the FIT performance of cells design before production. **TFIT** allows accurate calculation of the electrical effect of particles impact to a transistor, a cell, or a circuit early in the design flow, at much faster speeds than traditional 3D-TCAD simulations (whereas the 3D-TCAD approach does not apply in the case of neutrons impact and circuits analysis). **TFIT** interfaces with Spice simulators so the electrical impact of the particle on a transistor is analyzed on a whole cell or circuit. Particles can be either neutrons (cosmic rays), alpha particles or heavy ions. Figure 3.3 shows the main modules of **TFIT**.

**TFIT** reads the input design in Spice netlist format and calculates the single event effects, the cross sections or the FIT values according to the options provided in a configuration file. For each transistor specified in the input, **TFIT** extracts a networking environment and then characterizes these environments by means of spice simulation. This characterization along with the process technology and the electrical data in the spice netlist are used to generate the current pulses representing particles impact induced currents.

### 3.2.1.1 Technology Response Model

For every given process node, the response of both OFF-N and P MOS transistor struck by ionizing particles are computed using 3D-TCAD simulation. The LET, impact point and angle of ionizing particles, as well as the electrical environment of the device are taken into account (Supply Voltage applied to the device). An appropriate **Design of Experiment (DoE)** is run to allow building the **TFIT** Technological Response Model that consists in a collection of current pulses, corresponding to the various cases presented above. These two sets of curves (corresponding to OFF-NMOS and OFF-PMOS transistors) are stored in two databases that are then used during the **TFIT** simulation.
3.2.1.2 Nuclear Database

TFIT uses a Nuclear Database to evaluate any possible secondary particle produced by a nuclear reaction between a neutron and the silicon atoms. Direction and energy of those secondary particles are studied to account for their interaction with the sensitive volumes of the cell (previously computed by the tool). Depending on the type of interaction, a current is injected while the output of the cell is monitored to observe any possible electrical event (i.e. Single Event Transients).

3.2.1.3 TFIT Analysis on NANGATE 45nm OpenCell Library

The Nangate 45nm Open Cell Library [Nangate 2008] is an open-source, standard-cell library provided for the purposes of testing and exploring Electronic Design Automation (EDA) flows. We have used the October 2008 SP1 version of the package. The library contains 134 standard cells: 9 non-functional (fill, logic0-1 and antenna), 16 flip-flops (standard, with Set/Reset/Both, Scan), 5 latches, 102 combinational cells, a half-adder and a full-adder. The combinational cells offer several logic functions (AND, NAND, OR, NOR, XOR, XNOR, OR-AND, AND-OR, Buffers and Inverters) with different drive strengths.

TFIT has been used to characterize the whole library. For each cell, the effects of neutrons and alpha particles have been studied. The data gathered during the analysis is too voluminous to be presented here. The following paragraphs and tables present smaller sets of data for selected cases. The data presented in this paper is...
3.2. SET Characterization of the Standard Cell Library

...no way associated to real measurements or field, test data. We don’t benchmark the Open Cell library against other solutions. However, the library doesn’t indicate any particular weaknesses or negative aspect. On the contrary, the Open Cell library is a very good vehicle for advanced studies and tool development.

3.2.2 Per-cells state SER figures

All the SER numbers are expressed in FIT (for a MegaCell). The overall SER value is computed assuming equi-probable cell states; table 3.1 presents the SER figures for the NAND2_X1 and table 3.2 presents INV_X1 cells. The observed SET exhibit short pulse widths, with a very low rate of events larger than 75ps.

<table>
<thead>
<tr>
<th>State</th>
<th>Pulse Width</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>&gt; 25ps</td>
</tr>
<tr>
<td>A1=0, A2=0</td>
<td>24</td>
</tr>
<tr>
<td>A1=1, A2=0</td>
<td>227</td>
</tr>
<tr>
<td>A1=0, A2=1</td>
<td>67.7</td>
</tr>
<tr>
<td>A1=1, A2=1</td>
<td>37.3</td>
</tr>
<tr>
<td>Overall SER</td>
<td>89.0</td>
</tr>
</tbody>
</table>

Table 3.1: SER Values for NAND2X1

<table>
<thead>
<tr>
<th>State</th>
<th>Pulse Width</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>&gt; 25ps</td>
</tr>
<tr>
<td>A=0</td>
<td>56.4</td>
</tr>
<tr>
<td>A=1</td>
<td>21.3</td>
</tr>
<tr>
<td>Overall SER</td>
<td>38.8</td>
</tr>
</tbody>
</table>

Table 3.2: SER Values for INVX1

According to the results from the table 3.3, the most sensitive combinational cell is the full adder FA_X1 cell. The sensitivity of this cell is comparable to sequential cells. In addition, the cell exhibits an important SER for long SETs.

As a quick comparison, the table 3.4 presents the results for the DFF_X1 (flip-flop) and DFFR_X1 (flip-flop with reset) cells. The reset state for the DFFR cell is not sensitive and it’s not presented in the table.

3.2.3 Overall per-cell SER figures

The table 3.5 presents the overall SER figures (assuming equi-probable states) for a selection of combinational cells. The SET pulse width threshold is set to 25ps.

The results of the sequential cells characterization are in very good agreement with internal results from the radiation testing of same-generation test vehicles. The
Table 3.3: SER Values for FAX1

<table>
<thead>
<tr>
<th>State</th>
<th>Pulse Width</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>&gt;25ps</td>
</tr>
<tr>
<td>A=0, B=0, CI=0</td>
<td>277</td>
</tr>
<tr>
<td>A=0, B=0, CI=1</td>
<td>231</td>
</tr>
<tr>
<td>A=1, B=0, CI=0</td>
<td>244</td>
</tr>
<tr>
<td>A=1, B=0, CI=1</td>
<td>309</td>
</tr>
<tr>
<td>A=0, B=1, CI=0</td>
<td>209</td>
</tr>
<tr>
<td>A=0, B=1, CI=1</td>
<td>377</td>
</tr>
<tr>
<td>A=1, B=1, CI=0</td>
<td>200</td>
</tr>
<tr>
<td>A=1, B=1, CI=1</td>
<td>145</td>
</tr>
<tr>
<td>Overall SER</td>
<td>249</td>
</tr>
</tbody>
</table>

Table 3.4: SER Values for DFF X1 and DFFR X1

<table>
<thead>
<tr>
<th>CK</th>
<th>D</th>
<th>Q</th>
<th>DFF X1</th>
<th>DFFR X1</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Neutron</td>
<td>Alpha</td>
<td>Neutron</td>
<td>Alpha</td>
</tr>
<tr>
<td>0</td>
<td>179</td>
<td>84.2</td>
<td>169</td>
<td>246</td>
</tr>
<tr>
<td>0</td>
<td>151</td>
<td>192</td>
<td>200</td>
<td>109</td>
</tr>
<tr>
<td>1</td>
<td>146</td>
<td>125</td>
<td>321</td>
<td>305</td>
</tr>
<tr>
<td>1</td>
<td>283</td>
<td>298</td>
<td>82.1</td>
<td>108</td>
</tr>
<tr>
<td>0</td>
<td>179</td>
<td>84.2</td>
<td>157</td>
<td>246</td>
</tr>
<tr>
<td>0</td>
<td>161</td>
<td>192</td>
<td>200</td>
<td>88</td>
</tr>
<tr>
<td>1</td>
<td>108</td>
<td>125</td>
<td>312</td>
<td>305</td>
</tr>
<tr>
<td>1</td>
<td>289</td>
<td>298</td>
<td>88.5</td>
<td>108</td>
</tr>
<tr>
<td>Overall SER</td>
<td>187</td>
<td>175</td>
<td>191</td>
<td>189</td>
</tr>
</tbody>
</table>

presented combinational SER data don’t contradict a few available radiation-testing results. Independent [Nakamura 2010] studies show the feasibility of using TFIT for the SER characterization of standard cells and the good correlation of the tool-provided data with results obtained from radiation testing of dedicated test vehicle. The execution speed of the tool for evaluating the complete (125 cells) library is around 12 hours on a Quad-CPU Core i7 (bi-core) server with 8GB of RAM. The execution speed per cell ranges from a couple of minutes (most inverters and buffers) to 20 minutes (most sequential cells and the more complex combinational cells). Thus, it is perfectly feasible to characterize complex full standard cell libraries in a reasonable amount of time.


### Table 3.5: Selected Cells SER

<table>
<thead>
<tr>
<th>Cell</th>
<th>AND2_X1</th>
<th>AND2_X2</th>
<th>AND2_X4</th>
<th>AND3_X1</th>
<th>AOI21_X1</th>
</tr>
</thead>
<tbody>
<tr>
<td>SER</td>
<td>85.6</td>
<td>65.4</td>
<td>54</td>
<td>81.7</td>
<td>88.5</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Cell</th>
<th>AOI211_X1</th>
<th>BUF_X1</th>
<th>AND4_X1</th>
<th>AND4_X2</th>
<th>AND4_X4</th>
</tr>
</thead>
<tbody>
<tr>
<td>SER</td>
<td>75.3</td>
<td>70.7</td>
<td>75.3</td>
<td>58.5</td>
<td>31.5</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Cell</th>
<th>NOR2_X1</th>
<th>NOR2_X2</th>
<th>NOR2_X4</th>
<th>OR2_X1</th>
<th>NAND2_X1</th>
</tr>
</thead>
<tbody>
<tr>
<td>SER</td>
<td>54.6</td>
<td>45.8</td>
<td>27.8</td>
<td>102</td>
<td>89</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Cell</th>
<th>NAND2_X2</th>
<th>NAND2_X4</th>
<th>XOR2_X1</th>
<th>XOR2_X2</th>
<th>XNOR2_X1</th>
</tr>
</thead>
<tbody>
<tr>
<td>SER</td>
<td>63.5</td>
<td>53.7</td>
<td>145</td>
<td>144</td>
<td>143</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Cell</th>
<th>XNOR2_X2</th>
<th>HA_X1</th>
<th>INV_X1</th>
<th>INV_X2</th>
<th>INV_X4</th>
</tr>
</thead>
<tbody>
<tr>
<td>SER</td>
<td>151</td>
<td>147</td>
<td>38.8</td>
<td>21.6</td>
<td>17.1</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Cell</th>
<th>INV_X8</th>
<th>INV_X16</th>
<th>INV_X32</th>
<th>OR3_X1</th>
<th>OR3_X2</th>
</tr>
</thead>
<tbody>
<tr>
<td>SER</td>
<td>14.4</td>
<td>1.9</td>
<td>0</td>
<td>92.2</td>
<td>67.9</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Cell</th>
<th>OR3_X4</th>
<th>OR4_X1</th>
<th>OR4_X2</th>
<th>OR4_X4</th>
<th>BUF_X2</th>
</tr>
</thead>
<tbody>
<tr>
<td>SER</td>
<td>64.6</td>
<td>88.4</td>
<td>65.8</td>
<td>59.4</td>
<td>53.5</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Cell</th>
<th>BUF_X4</th>
<th>BUF_X32</th>
<th>OR2_X2</th>
<th>OR2_X4</th>
<th>NAND4_X1</th>
</tr>
</thead>
<tbody>
<tr>
<td>SER</td>
<td>46.4</td>
<td>9.2</td>
<td>75.2</td>
<td>64.4</td>
<td>92.5</td>
</tr>
</tbody>
</table>
### Table 3.6: Transistor SER Contribution

<table>
<thead>
<tr>
<th>State/SER</th>
<th>Cell</th>
<th>M0</th>
<th>M1</th>
<th>M2</th>
<th>M3</th>
</tr>
</thead>
<tbody>
<tr>
<td>A1=0, A2=0</td>
<td>24</td>
<td>0</td>
<td>24</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>A1=0, A2=1</td>
<td>67.7</td>
<td>0</td>
<td>0</td>
<td>67.7</td>
<td>0</td>
</tr>
<tr>
<td>A1=1, A2=0</td>
<td>227</td>
<td>165</td>
<td>62.4</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>A1=1, A2=1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>18.6</td>
<td>18.6</td>
</tr>
<tr>
<td><strong>Overall SER</strong></td>
<td>89.0</td>
<td>41.25</td>
<td>8.53</td>
<td>4.65</td>
<td>4.65</td>
</tr>
</tbody>
</table>

Figure 3.4: NAND2 Cell Schematic

#### 3.2.5.1 Cell Drive Strength

Figure 3.5 and 3.6 show how the SET SER varies with respect to the drive strength of the cell.

For the inverters (Figure 3.5) the SER decreases when the drive strength increases. Buffers (Figure 3.6) represent a particular case since the sensitivity to long transients is higher. This happens because if the first inverting stage generates a transient, the second stage stretches the pulse making it longer. This behavior can be observed in any non-inverting gate (AND, OR, AO, OA, etc.).

#### 3.2.5.2 Output Load Capacitance

Figure 3.7 Shows the SET SER dependency with respect to the output load capacitance. The picture shows that the output load capacitance affects only short pulses. Whereas the sensitivity to longer pulses is constant no matter the output load capacitance.
3.2. SET Characterization of the Standard Cell Library

![INV]

**Figure 3.5:** SET SER vs. Drive Strength for INV from X1 to X32

![BUF width]

**Figure 3.6:** SET SER vs. Drive Strength for BUF from X1 to X32
3.2.5.3 Supply Voltage

Figure 3.8 Shows the SET SER dependency from with respect to the supply voltage. The picture shows that for different values of Vdd the sensitivity to short pulses is constant, while the sensitivity to longer pulses decreases.

Figure 3.7: SET SER vs. Output Load Capacitance for MUX2

Figure 3.8: SET SER vs. Supply Voltage for MUX2
3.3 SET Propagation Analysis

Since the characterization of the standard cell library is now complete, we can use the results for the SER analysis of any design implemented using the library. The multiplier the Device Under Test (DUT) has been mapped to the Open Cells library; obtaining a gate-level netlist with a total of 1657 cell instances. The cell count for the most important cells is described in the table 3.7. Several sensitive cells such as the full adder are preponderant in the netlist. The netlist is accompanied by a SDF file that contains timing information for the cell instances and interconnections.

Table 3.7: Post Synthesis Design: Cell Count

<table>
<thead>
<tr>
<th>Cell</th>
<th>Count</th>
</tr>
</thead>
<tbody>
<tr>
<td>FA_X1</td>
<td>327</td>
</tr>
<tr>
<td>AOI22_X1</td>
<td>286</td>
</tr>
<tr>
<td>XNOR2_X1</td>
<td>277</td>
</tr>
<tr>
<td>XOR2_X1</td>
<td>182</td>
</tr>
<tr>
<td>NAND2_X1</td>
<td>128</td>
</tr>
<tr>
<td>INV_X1</td>
<td>109</td>
</tr>
<tr>
<td>MUX2_X2</td>
<td>70</td>
</tr>
<tr>
<td>NOR2_X1</td>
<td>51</td>
</tr>
<tr>
<td>HA_X1</td>
<td>31</td>
</tr>
<tr>
<td>Other</td>
<td>196</td>
</tr>
</tbody>
</table>

The intended purpose of the multiplier block is to be used in a complex design. Thus, the combinational multiplier will be sandwiched between two layers of sequential cells (usually flip-flops). Accordingly, the outputs of the block have been connected to 32 DFF_X1 flip-flops. The flip-flops have been synthesized with the remaining of the circuit in order to obtain accurate timing information (propagation/setup/hold).

The overall goal of the SET propagation analysis is to evaluate the percentage of SET that are memorized in subsequent (flip-flops in our case) sequential/memory cells. As attested by numerous research papers [Dodd 2004, Hass 1999b, Gadlage 2004, Alexandrescu 2002, Reorda 2003, Nguyen 2005] the SET fault to Soft Error transformation is mitigated by several factors: the electrical, logic and temporal de-rating:

- The EDR represents the fault ability to propagate when taking in account the deformation of the transient through the logic network.
- The LDR models the propagation of the fault through the logic network from a purely logic perspective, according to the circuit state.
- The TDR represents the probability of the propagated fault to be present on the sequential cell data input at the latching instant.
In order to extract reference results and to evaluate the effort required for the analysis of the design the most straightforward approach, non-optimized serial fault simulation, have been applied. Then, more sophisticated methods have been benchmarked with the results previously gathered.

3.3.1 Classic serial fault simulation approach

This approach consists in providing a testbench for the circuit (including a clock signal to the output flip-flops), applying random test-vectors on the inputs of the circuit, injecting a transient fault in selected cell outputs and observing the output of the flip-flops. If the memorized data is different from the reference values, then the injected SET has been propagated and memorized. Several simulations scenarios are presented in the following paragraphs.

3.3.1.1 Logic De-Rating evaluation

The simulation campaign consists in 40 simulation runs during which the circuit is exercised with 2000 test vectors. One injection site has been randomly selected for each run. The primary (non-registered) outputs have been observed for evaluating the fault propagation. The primary output of this campaign is the average logic de-rating factor which is 32.29%. The relatively large de-rating percentage reflects the mathematical function of the circuit and its relatively simple internal structure.

The considered multiplier block is exercised using two random 32-bits input vectors. The input vectors are meant to represent binary32 IEEE 754-2008 single-precision floating point numbers. As such, the input vectors can represent random numbers and all the multiplier features (sign, exponent, significant computation blocks) are used. The simulation can also accommodate specific configurations (test cases) where specific number ranges are used in agreement to the needs of the application. Accordingly, the logic de-rating numbers are expected to change, reflecting the relative criticality of the individual cell instances with regard to the considered test case.

3.3.1.2 Electrical/temporal de-rating evaluation

The objective of this campaign is to evaluate the deformation (Propagation Induced Pulse Broadening - PIPB) of the SET from the origin to the primary output. In addition, it will also indicate the shortest pulse duration that the logic network is able to propagate. This information is particularly interesting since the SER characterization of the cell library indicates that most logic cells only exhibit short SETs (<100ps).

As an example of such simulation, we have selected two of the injection sites with a high LDR, thus maximizing the number of propagated faults. The output transient pulse width has been measured on each line of the 32-bit output, the output transient duration is measured as shown in figure 3.9. We have performed several simulation runs with different injected SET pulse widths and 500 test-vectors per
run. The table of results is quite extensive, and the figure 3.10.a, figure 3.10.b and figure 3.10.c only presents an average output pulse width (ps) versus the original SET width (ps) for three different logic gates.

![Figure 3.9: Output SET evaluation](image)

The results show that short transients are not able to propagate through the logic network. Depending on the path, the observed behavior can be very different. The case shown in figure 3.10.a present a quasi-linear growth of the output pulse width for pulses longer then 60ps. figure 3.10.b shows a very discontinuous behavior for pulses shorter then 350ps, then the output pulse width duration converges to a constant value. The last scenario, figure 3.10.c, shows a linear growth for pulses shorter then 400ps, then, as for the case presented in figure 3.10.b, the output pulse width duration converges to a constant value.

A more in-depth analysis of the circuit shows that the source to output paths can be classified in linear and re-convergent paths. The linear paths will cause a single transition for the starting edge of the fault and a second one for the ending edge. The output fault width can be calculated using the original fault width plus a deformation (positive or negative) caused by the difference in the various transition times of cells from the path. However, the results show that this linear dependency is only true for large (>200 ps) transients while shorter faults are characterized by a non-linear regime. This may also be an artifact of the SDF timing models, requiring an alternative approach. Re-convergent paths cause two sets of multiple transitions associated to the two edges of the original fault with a correct signal value between the two events.

The overall simulation time was around 5 minutes for 1 injection site, 24 pulse widths values and 2000 test vectors. In this case, a straightforward approach consists in optimized serial fault evaluation techniques [Alexandrescu 2002]. The duration of the possible fault(s) at the output of the affected cell instance is retrieved from the database, according to the cell characteristics, neighborhood and circuit state. Individual SETs are serially injected, propagated and evaluated. The fault waveform on the destination net is then integrated in order to retrieve an effective pulse width duration.
Figure 3.10: Output SET vs. input SET for three different instances of the design
3.3. SET Propagation Analysis

3.3.1.3 Temporal/logic de-rating evaluation

The primary goal of this simulation campaign consists in performing a fully featured analysis of the SET propagation and memorization. Firstly, an injection site is selected. Secondly, SET occurrence time (fault injection instant) is varied from the beginning to the end of the clock cycle by proposing a discretization of the clock period in 100 fault injection instants. Lastly, the simulation environment evaluates whether the fault is memorized or not, for each fault injection instant. In addition, several clock period values and several pulse widths have been analyzed. The following results are provided by this approach: the dependence of the fault memorization probability versus the fault occurrence instant (expressed in percentage: instant over clock period) (Figures 3.11.a, 3.11.b and 3.11.c), and a variety of de-rating factors. As an example the evolution of the overall de-rating vs. the working frequency and the source SET pulse width are shown in table 3.9 and table 3.8.

Table 3.8: Overall De-rating Factors for Different Frequencies Assuming a Fixed SET PW

<table>
<thead>
<tr>
<th>Source SET PW = 75 ps</th>
<th>Clock period [MHz]</th>
<th>Overall de-rating</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>100</td>
<td>0.85%</td>
</tr>
<tr>
<td></td>
<td>150</td>
<td>1.22%</td>
</tr>
<tr>
<td></td>
<td>200</td>
<td>1.63%</td>
</tr>
</tbody>
</table>

Table 3.9: Overall De-rating Factors for Different SET PWs Assuming a Fixed Frequency

<table>
<thead>
<tr>
<th>Frequency = 100 Mhz</th>
<th>SET PW [ps]</th>
<th>Overall de-rating</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>75</td>
<td>0.85%</td>
</tr>
<tr>
<td></td>
<td>100</td>
<td>1.90%</td>
</tr>
<tr>
<td></td>
<td>125</td>
<td>2.10%</td>
</tr>
<tr>
<td></td>
<td>150</td>
<td>2.675%</td>
</tr>
<tr>
<td></td>
<td>200</td>
<td>3.642%</td>
</tr>
</tbody>
</table>

The overall simulation time was around 35 minutes for each fault injection campaign: 1 injection site, 1 clock frequency and 1 pulse width value.

The results of the serial simulation approach are certainly interesting. The overall de-rating factor (which is the product of the electrical logic and temporal de-rating factors) is on the order of a few percent. This is a reasonable result because:

- SETs have a very small opportunity window to be memorized in a sequential
Figure 3.11: SET latching probability vs. SET occurrence instant for three different instances.
3.3. SET Propagation Analysis

cell. Figures 3.11.a, 3.11.b and 3.11.c show that the vulnerable time slice can be at different instant of the clock cycle but is always very short.

- For short transient pulses the circuit has a strong de-rating factor (typical for most combinational cells). Table 8 shows that the de-rating factor grows with the pulse width but also with the frequency of the circuit.

- Combinational cells have an intrinsic logic de-rating factor that depends on the state of the signals. I.e. an AND2 gate will not propagate an SET coming from one o its input if the other input is at zero. Consequently, combinational logic networks present an inherent logic de-rating factor that could potentially affect the propagation of SEE-induced faults.

All These effects, coupled with an intrinsic low SEE sensitivity seems to indicate that the overall combinational cells contribution to the total circuit SER is limited. Even if the simulation is far from being complete, considering the limited number of test vectors and injection sites; we have spent considerable CPU time on a very small logic block.

3.3.2 Accelerated SET simulation

In the following, we will investigate some of the available methods that will reduce considerably the analysis time. We will make use of the following principles:

1. The ability to separately simulate the normal activity of the circuit (as induced by a new test vector/clock cycle) and the events caused by a SET [Alexandrescu 2002, Nguyen 2005].

   The justification of this approach consists in proving that the SET must arrive on the affected net after the net stabilization (following a new test vector) in order to have an opportunity of reaching the correct opportunity window of the memorization cell. Thus, we can safely simulate the normal circuit events, wait for circuit to stabilize, inject a fault, evaluate fault events, and when finished, inject a new fault. This differential technique will reduce considerably the simulation time. A further optimization consists in evaluating the normal circuit events using a fast (non-timing) simulator.

2. The use of a mathematical temporal de-rating computed as a ratio of the opportunity window and the clock period [Nakamura 2010].

   The temporal de-rating can be quickly estimated using one of the following equations 3.1 where $\delta$ represents the transient pulse width at the destination (data input of the memorizing flip-flop).

\[
P = \frac{1}{2} \cdot \frac{t_{\text{HOLD}} + t_{\text{SETUP}} + \delta}{T_{\text{CK}}} = \frac{\delta}{T_{\text{CK}}}
\]
Using these formulas, we don’t have to simulate different occurrence instants of the SET, with a huge economy of CPU time. The fault simulation will provide the accurate $\delta$ values.

3. **The use of a pulse prediction method based on the measured fault deformation** [Nakamura 2010].

This method relies on the linearity of the logic network to add a constant deformation to larger (hundreds of ps) SETs. The implementation consists in separately simulating the starting and ending edges of the SET and measuring the deformation of the fault. Then, any output fault pulse width can be computed as the sum of the original fault and the measured deformation, eliminating the need to simulate the individual faults. However, since the SER characterization of the standard cell library shows that most SETs are shorter than 200ps, this method is not yet applicable.

4. **The use of classic fault universe reduction methods**

Reducing the number of cells considered during the analysis will also reduce the time required for the simulation. However, some of the techniques available with classic (permanent) faults are not directly applicable to SETs.

A very good optimization method could consist in using this method in conjunction with the method 3): start by evaluating the fan-out of the cell that drives a linear (non-re-convergent) path, then separately simulate the two edges of the fault (as in method 3) and measure the delay up to and from any cell on the fan-out path. Then, mathematical formula will allow the computation of the deformation induced by the path from any down-stream cell to the logic network outputs, eliminating the need for simulating the down-stream cells. Again this is not yet applicable. Finally, we have implemented a simple fault dropping algorithm that eliminates some short faults based on the results of previous fault injections.

The proposed techniques have been implemented as a shared Verilog Procedural Interface (VPI) library that is load by the event-driven simulator at the beginning of the simulation. In conjunction with the testbench, the fault simulation library applies a new test vector on the inputs of the circuit, waits for the circuit to stabilize and then sequentially inject faults on all the cell instances outputs while recording the events observed on the primary (non-registered) outputs. Then, it uses a mathematical setup/hold model for a virtual flip-flop to compute the overall TDR according to the equations 3.1.

The results are identical to the reference, serial fault simulation approach with the added benefit of a huge increase in the simulation time. As an indication, the simulation of 10000 test vectors with fault injections in each cell instance have been performed in 17.8 minutes, while the equivalent serial simulation require an impractical amount of time (several days).
3.3. SET Propagation Analysis

The conclusion of this analysis step is that optimized methods will considerably improve the performances of the SET simulation. However, if we try to evaluate the time required for simulating a full multi-million cell design, the considerably long time required for the fault simulation, together with the engineering effort required for setting up the simulation environment may make this approach impractical. In the following we will quickly present a purely static, mathematical method that tries to provide a very quick of the overall de-rating factor.

3.3.3 Static, probabilistic fault propagation approaches

The research community provides a wealth of solutions [Alexandrescu 2007, Hass 1999a, Asadi 2005, Brglez 1984, Benso 2002] for the static analysis of the logic de-rating in logic networks. We have implemented a very straightforward method that uses a fault propagation probability metric associated to each cell. As an example, inverter, buffers and XOR gates will always propagate transitions on their inputs. AND or OR gates will conditionally propagate the transitions according to the state of the other inputs. We note the probability for an input to be 1 as the state probability \( S(input) \) and the propagation probability as \( P(input) \). For the AND2_X1 gate, we can write: \( P(A1) = S(A2) \); INV_X1 has \( P(A) = 1 \) and so on.

The propagation probability of a non-re-convergent path can be accurately described using the non-dependent state probabilities of its nets and the propagation probabilities of the composing cells. Re-convergent paths may be explored by using Shannon expansion to separate common terms (state probabilities) in some equations. The problem associated to this approach is that for some complex paths (quite easy to find in the multiplier block), the equation becomes easily too complex, requiring CPU time and memory. To keep the analysis time within reasonable limits, we have implemented this algorithm with an optional hard-coded limit for the number of variables to consider and a pessimistic approach (i.e. logic de-rating = 1) for the equations that cannot be correctly evaluated.

The state probabilities of the logic network nets can be computed using the same static, probabilistic method as used for the fault propagation probability computation or can be gathered during a reference simulation run. We have added a specific feature to the VPI simulation library that computes the state probability for each circuit net as the ratio between the overall interval of time during which the net/signal is in the high logic state and the total simulation time. This operation only requires a single, reference simulation that it’s relatively inexpensive to perform and can also provide state probabilities according to the various test cases thus improving the accuracy of the consequent static fault propagation probability evaluation.

Table 3.10 presents the logic de-rating factors obtained through the previous fault injection and simulation efforts and the logic de-rating factors obtained through static methods. The circuit-wide, overall average and standard deviation logic de-rating figures are computed from the de-rating factors associated to each cell instance. The Full Fault Injection and Simulation row presents the results obtained.
using through simulation, without any optimization of any kind. Thus, these results act as the reference data against whom the following methods are benchmarked. The Full static signal probabilities and fault propagation analysis row presents the results obtained using the static, probabilistic approach applied on both signal/net state probabilities and also fault propagation analysis. The final Simulation-based signal probabilities and static fault propagation analysis approach uses signal state probabilities obtained through a reference simulation and fault propagation data obtained through the static method. The results from this case are in a good agreement to the reference results and have been obtained with a reasonable investment of computational resources. The full static data is the least accurate and also the least computational intensive. While the static analysis method deserves to be improved, we can observe that the pessimistic results are always greater than the reference data. Thus, we can use this inexpensive approach to easily establish upper bounds to the logic de-rating factors for the various cell instances of the circuit.

Table 3.10: Overall Circuit Logic De-Rating Factors

<table>
<thead>
<tr>
<th>Method</th>
<th>Full Fault Injection &amp; Simulation</th>
<th>Static Fault Propagation Analysis</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Uniform Signal Probabilities</td>
<td>Simulation-Based Signal Probabilities</td>
</tr>
<tr>
<td>Average LDR</td>
<td>32.3%</td>
<td>26.7%</td>
</tr>
<tr>
<td></td>
<td>38.6%</td>
<td>31.4%</td>
</tr>
<tr>
<td>LDR Standard Deviation</td>
<td>36.2%</td>
<td>32.5%</td>
</tr>
<tr>
<td></td>
<td>33.0%</td>
<td>32.0%</td>
</tr>
</tbody>
</table>

The logic de-rating factor for each cell needs to be accompanied by the temporal de-rating. The chosen approach consists in using the state probabilities to evaluate the fault deformation through each cell and compute the destination SET pulse width using the initial SET PW and the deformation added by the path.

We have implemented the proposed approach as a stand-alone tool [Shi-Jie 2008, Chapman 2010] using third-party Verilog and SDF file parsers. The multiplier block is processed in a few seconds. A few select extracts from the results provided by the static tool are indicated in the table 10. The presented data is computed using the pessimistic tool option.

3.4 Conclusions

We have presented the results of a practical SET analysis flow that shows a possible approach to the SET evaluation of a 45nm cell library and a design, efforts performed in an industrial setting. The primary objectives of this project have been the following: evaluate intrinsic combinational cell SER, analyze SET effects in the design and contribute with tools and methodologies to the understanding of the phenomena.
3.4. Conclusions

Table 3.11: A few static de-rating results

<table>
<thead>
<tr>
<th>Instance</th>
<th>Cell Type</th>
<th>SET PW</th>
<th>Simulation Approach</th>
<th>Full Static Approach</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td>LDR</td>
<td>TDR</td>
</tr>
<tr>
<td>I0.p214...</td>
<td>AND2_X4</td>
<td>75ps</td>
<td>100%</td>
<td>0.82%</td>
</tr>
<tr>
<td>mul_2683_33.p2...</td>
<td>FA_X1</td>
<td>100ps</td>
<td>54.2%</td>
<td>1.24%</td>
</tr>
<tr>
<td>I11.inc_add.p33...</td>
<td>HA_X1</td>
<td>100ps</td>
<td>23.5%</td>
<td>1.14%</td>
</tr>
</tbody>
</table>

The results seems to indicate that the combinational cell SER is usually several times lower than same-library flip-flops and that the SET events are strongly de-rated by electrical and temporal factors, reducing their contribution to the overall SER. However, the reliability engineers and designers need to be equipped with least-effort tools and methodologies in order to be prepared for future challenges.
4.1 Introduction

Hardware is intrinsically unreliable. External and internal perturbations can cause data corruption, faulty states and unpredictable circuit behavior. Single Events Effects (SEEs) represent a particularly representative example of such issues. SEEs are caused by energetic particles from the environment (neutrons, protons, heavy ions, muons, ...) or from the device’s own materials (alpha particles emitted by radioactive contaminants). The particles deposit energy in the device structures and cause transient currents that affect internal signal states and data stored in memorizing structures (memory cells, sequential cells).

Soft Errors (Single Bit/Cell Upset, Multiple Cell Upset, Multiple Bit Upset) in memory blocks can be efficiently mitigated by well-known methods, such as Error Correcting Code (ECC) [Mavis 2008].

Given the non-uniform structure of logic networks, code-based approaches cannot be universally applied to individual elementary cells from the structure of complex circuits. Thus, Single Event Upsets (SEUs) affecting elementary sequential
cells (such as flip-flop and latches) require innovative hardening methodologies and solutions [Makihara 2005, Jagannathan 2011, Loveless 2011]. Obviously, it can be proven that the device can be made arbitrary resilient, with corresponding costs in terms of development efforts [Almukhaizim 2008, Mavis 2007], silicon area, power or performance overheads [Mohanram 2003]. While few areas (aero-space, medical, ...) require an absolute SEU insensitivity (and thus the associated costs), many high-reliability (networking, automotive, ...) applications need to integrate some level of SEE resiliency. Accordingly, current efforts in both the industry and academy aim at establishing a library of hardening approaches, offering tools and methodologies to evaluate and to improve circuit behavior to SEU and ultimately, at providing an adequate equilibrium between costs and benefits.

The work presented in this chapter contribute to the analysis and mitigation of SEU effects in logic combinational networks. We will address subtleties in the SEE behavior of sequential cells, evaluating the sensitive elements of the cell in each primary cell state and possible SET/SEU effects in the cell. We will show that transient in the internal clock circuitry can cause erroneous flip-flop activation [Seifert 2005]. In the considered technology (45nm), this effect is significant for neutron-induced SEEs but not for alpha-related issues. Moreover, SETs faults are possible in the transparent slave latch or output stages. We will de-rate their contribution using Temporal De-Rating principles [Seifert 2004]. SEU and SET effects are highly dependent on the cell (thus the circuit) state. We will highlight the need for a fine-granularity, state-aware analysis of individual cell instances, that allows workload-dependent SER results, with a better accuracy than using a single SER value per cell type.

By analyzing simulation traces from typical applications, it is seen that in many designs, a significant set of the flip-flops are biased towards storing a specific logic value. For example, flip-flops that hold block enable signals are biased to 1 whereas flip-flops that hold counters for rare error events are biased to 0. By very minor logic manipulation, circuits can be modified so that these flip-flops actually store the more stable value, from an SER perspective. In this way, a modest improvement in overall circuit SER can be achieved with virtually zero area or power overhead. The validation campaign support the interest of the presented approach.

The final goal is to sensitize reliability engineers and designers to the need for high-fidelity SER analysis of sequential cells. Failure to perform an accurate SER analysis for sequential cells can result in mis-estimation of the chip-level SER and potentially over-design through unnecessary or incorrect mitigation.

### 4.2 Single Event Effects in Sequential Cells

Flip-Flops are the most widely used type of sequential cell and they usually consist in two latches (master and slave - Figure 4.1).

SEEs can affect any sensitive cell transistor. The conventional approach is to consider SEEs affecting elements from the currently memorizing (closed) latch as
4.2. Single Event Effects in Sequential Cells

As an example, during the second clock half-period, blocked transistors from the master latch inverters are susceptible to transient events, causing the inverter output to change, perturbation that will be propagated by the opposing inverter, eventually altering the stored value.

Obviously, the current cell state dictates the sensitive elements. Moreover, their contribution to the overall SER depends on a wide selection of factors (implementation process, transistor sizing, node capacitance, physical and electrical neighborhood, ...) [Heijmen 2004]. In all, the complete characterization of sequential cell with regard to SEEs represents a sizable amount of effort and should result in a set of event rate (expressed in FIT, Cross-Section) per condition, where condition means a combination of any relevant parameters. However, our modest observation concerning the current industrial SER efforts doesn’t seem to concur this requirement. Evaluating SER figures for complex designs (millions of flip-flops) is a rather straightforward process. A single (or a very limited set) value per cell type is used as a raw, intrinsic SER. Generic values (i.e. a single SER for all flip-flops types) used as baselines are not unheard of. Given the fact that the actual cell SER is a function of the intrinsic per-state SER and significant differences can exist for distinct cell states, it is possible that the straightforward approach could produce overall, intrinsic SER values that are not representative of the circuit behavior for the actual workload. (Please note that we are not considering application-related de-rating or vulnerability factors. The present discussion only addresses intrinsic, raw SER values that are the base for further calculations).

SEU in the master and slave have been considered the predominant SEE-induced issues in sequential cells. However, the primary physical phenomenon consisting in transient currents injected in the cell internal nets may cause a variety of effects. Depending on the occurrence site, we can enumerate them as follows:

- SEEs in the latches’ transistors or pass gates. The current assumption is that this effect can mostly cause SEUs. If the deposited energy/charge is

![Flip-Flop Structure](image-url)
not enough to change the stored value in the latch loop, then it is extremely improbable to propagate as a SET to the primary cell outputs, as it will be strongly filtered by electrical aspects. We will show that, while possible, SETs caused by particle impacts in the memorization loop transistors doesn’t impact significantly the overall SER figures.

- **SEEs** in the output buffer/inverter stages. Primary cell outputs (Q/QN) are typically driven by internal inverters, similar in function and structure to the corresponding stand-alone combinational cells. SETs effects consists in transient faults appearing on one or both outputs. While in terms of intrinsic event rates, SETs are comparable (same order of magnitude) to SEUs, we will show that a strong temporal de-rating can reduce the criticality of these events, at least for low and medium-speed designs.

- **SEEs** in the internal clock circuitry. Clock signals are internally buffered by inverter cells, equally susceptible to SET effects. If a transient fault appears on the output of the clock inverters/buffers, the flip-flop will effectively perform a new sampling. Depending on the value present on the flip-flop data input, this event can cause or not a corruption of the previously-stored value with a corresponding propagation of the fault to the primary cell outputs. From a circuit perspective, this effect is similar to a SEU. The clock inverters exhibit similar susceptibilities to SETs as comparable standard cells, representing a significant contribution to the overall SER.

Finally, we propose a methodology based on detailed SER information for the elementary cells and the evaluation of signal/state probabilities in the circuit using data from functional simulation. We enumerate the requirements for a complete SER analysis of sequential cells, requirements usable with both hardware (radiation testing) and software (prediction tools) methods. Consequently, we will present a framework for evaluating the per-state SER values for sequential cells, including tools and methodologies.

### 4.3 SER Analysis of Sequential Cell States

In this section we will present the requirements of effective SER analysis of sequential cells and practical ways to address them through hardware radiation testing and software prediction tools. We propose a canonical set of 8 basic cell states that reflect the various data/clock/input values cases. For each enumerated cell state, SEU and SET event rates should be provided.

The required data can be obtained through actual measurement (radiation testing, system testing, field data, ...) or prediction (simulation tools, analytic models, ...).

On existing circuits, flip-flop SER can be tested using existing scan chains. This approach is limited to static testing: a pattern is feed through the scan-in input, the
4.3. SER Analysis of Sequential Cell States

Table 4.1: DFF_X1 SER results

<table>
<thead>
<tr>
<th>State</th>
<th>S0</th>
<th>S1</th>
<th>S2</th>
<th>S3</th>
<th>S4</th>
<th>S5</th>
<th>S6</th>
<th>S7</th>
</tr>
</thead>
<tbody>
<tr>
<td>Clock</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>Output</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>Data</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
</tbody>
</table>

circuit is irradiated and scan-out data is verified to observe SEU effects. Through careful pattern preparation and direct clock signals control (if feasible), the complete table can be filled out with actual testing data. One major disadvantage is that SET effects in output or clock circuitry cannot be easily discriminated from proper SEUs in the observed events.

Dedicated SEU test vehicles are also feasible. In addition to a better control of the flip-flop activity and adequate instance count, they allow dynamic testing (which would allow a better understanding of output stages SETs), study of Multiple Cell Upsets and so on.

In addition to testing, SER prediction through software tools is an useful approach to cell SER analysis since it’s able to provide earlier data and information, allowing for an effective and timely SER management process. The works presented in this chapter involved a SER analysis tools (TFIT) that uses nuclear physics knowledge, in-depth process information offered by technology providers (foundries) and a Single Event Effect analysis capability for completely characterizing sequential cell SER in a given working environment and operating conditions. TFIT has been extensively validated by the technology providers themselves; 90nm, 65nm, 55nm, 45/40nm, 32/28nm, 20nm, 16/14nm results have been correlated with radiation test data from multiple foundries.

The Nangate 45nm Open Cell Library is an open-source, standard-cell library provided for the purposes of testing and exploring EDA flows. We have used the July 2009 version of the package [Nangate 2008]. The library contains 134 standard cells, of which 16 flip-flops (standard, with Set/Reset/Both, Scan) and 5 latches. Full library characterization has been performed with the TFIT tool using a 45nm generic SER process database. As we are targeting a terrestrial (atmospheric) working environment, we have studied the effects of neutrons and alpha particles. The data gathered during the analysis is too voluminous to be presented here. The following paragraphs and tables present smaller sets of data for selected cases. The full set of data is available on request.

The SER information presented in this work is no way associated to real measurements or field, test data. We don’t benchmark the Open Cell library against other solutions. Neither the Open Cell library nor the presented SER data are intended for benchmark or comparison against other standard cells library or silicon devices. Moreover, the data is not meant to indicate any particular weaknesses or negative aspects concerning the library. On the contrary, the Open Cell library is a very good
vehicle for advanced studies and tool development.

4.3.1 SEU results for standard flip-flops

The table 4.2 presents the SEU (and SEU-like) event rates [FIT] for the DFF_X1 cell. The alpha results can be simplified to just four values, one for each combination of clock and output (stored data) values. There is no sensitivity to the input value. In contrast, the neutron data can be grouped according to the same four primary (clock, output) states, with an extra impact of the input data to the state SER.

This behavior is consistent with the statements from the previous section. The slight difference in the neutron results is caused by the contribution of SET effects in the clock circuitry.

<table>
<thead>
<tr>
<th>State</th>
<th>CLK</th>
<th>Out</th>
<th>Data</th>
<th>Neutrons</th>
<th>Alpha</th>
</tr>
</thead>
<tbody>
<tr>
<td>S0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>151</td>
<td>192</td>
</tr>
<tr>
<td>S1</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>161</td>
<td>192</td>
</tr>
<tr>
<td>S2</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>179</td>
<td>84.2</td>
</tr>
<tr>
<td>S3</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>179</td>
<td>84.2</td>
</tr>
<tr>
<td>S4</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>283</td>
<td>298</td>
</tr>
<tr>
<td>S5</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>289</td>
<td>298</td>
</tr>
<tr>
<td>S6</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>146</td>
<td>125</td>
</tr>
<tr>
<td>S7</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>108</td>
<td>125</td>
</tr>
</tbody>
</table>

Neutron-related sensitivity in states where the data input is different than the stored value (S1, S2, S5, S6) is generally higher than states where input and stored values are identical (S0, S3, S4, S7). The added contribution is basically the SET susceptibility of the clock inverters. In the following, we will prove this statement. As a support, figure 4.2 shows the internal organization of the DFF_X1 cell.

The TFIT tool is also able to present individual transistors contribution to the overall SER. The results are presented in the table 4.3. Please note that zero values have been replaced with dots for a better clarity. The presented data show that most transistors have the same SER contribution, indifferent of the data input state. The only notable differences are caused by the following transistors: MMP1 and MMN1, MMP2 and MMN2. Figure 4.2 proves that those transistors belong to the clock inverters (Figure 4.1). We can prove thus that SETs in the internal clock inverters can cause an erroneous activation/sampling of the flip-flop and depending on the data input and stored data state, effects that ultimately manifests as SEUs.

It’s worth discussing minimum pulse width requirements for the clock SET to cause a SEU. We have found minimal SET pulse widths (PW) of 50 – 100ps, according to the cell state. This estimation fits minimum CK pulse width from the flip-flop datasheet (60 – 90ps typical). The computed event rates for SET larger
4.3. SER Analysis of Sequential Cell States

Figure 4.2: DFF_X1 internal schematic
than 50\,ps in the clock inverters are very close (+/− 20%) to SET SER data of standard INV\_X2 standard cell, which is similar to the actual clock inverter.

The presented results also suggest the importance of taking in consideration all the 8 states of the flip-flop for any SER analysis efforts, through testing or prediction. As an example, the radiation testing method that uses the internal scan-chain should include test patterns that exercise the flip-flops in all the recommended states.

Obviously, flip-flops with Set/Reset/Scan Enable/Scan In inputs will exhibit a higher number of possible cell states. However, from a SEE perspective, the recommended 8-state analysis is still valid.

Lastly, alpha results show no sensitivity to the input state. The associated transistor SER data show no alpha-induced SETs in the clock inverters. This observation correlates well with the very low or no sensitivity of comparable standard cells.

Table 4.3: DFF\_X1 transistor SER contribution

<table>
<thead>
<tr>
<th>Transistor</th>
<th>S0</th>
<th>S1</th>
<th>S2</th>
<th>S3</th>
<th>S4</th>
<th>S5</th>
<th>S6</th>
<th>S7</th>
</tr>
</thead>
<tbody>
<tr>
<td>MMN2</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>27</td>
<td>.</td>
</tr>
<tr>
<td>MMN6</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>70.2</td>
<td>70.2</td>
<td>.</td>
<td>.</td>
<td>.</td>
</tr>
<tr>
<td>MMN11</td>
<td>.</td>
<td>.</td>
<td>66.5</td>
<td>66.5</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
</tr>
<tr>
<td>MMN10</td>
<td>48</td>
<td>48</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
</tr>
<tr>
<td>MMN13</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
</tr>
<tr>
<td>MMP2</td>
<td>.</td>
<td>.</td>
<td>1e-3</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
</tr>
<tr>
<td>MMP6</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
</tr>
<tr>
<td>MMP7</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>53.4</td>
<td>53.4</td>
<td>.</td>
</tr>
<tr>
<td>MMP3</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>53.4</td>
<td>53.4</td>
<td>.</td>
</tr>
<tr>
<td>MMP4</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
</tr>
<tr>
<td>MMP5</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>20.9</td>
<td>20.9</td>
<td>.</td>
<td>.</td>
<td>.</td>
</tr>
<tr>
<td>MMN7</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>95.9</td>
<td>95.9</td>
<td>.</td>
<td>.</td>
<td>.</td>
</tr>
<tr>
<td>MMP1</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>6.23</td>
<td>11</td>
<td>.</td>
</tr>
<tr>
<td>MMP8</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
</tr>
<tr>
<td>MMP9</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>51.3</td>
<td>50.6</td>
<td>.</td>
<td>.</td>
<td>.</td>
</tr>
<tr>
<td>MMP12</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>51.3</td>
<td>50.6</td>
<td>.</td>
<td>.</td>
<td>.</td>
</tr>
<tr>
<td>MMP11</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
</tr>
<tr>
<td>MMP10</td>
<td>.</td>
<td>.</td>
<td>5.03</td>
<td>5.03</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
</tr>
<tr>
<td>MMP13</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
</tr>
<tr>
<td>MMN4</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>95.9</td>
<td>95.9</td>
<td>.</td>
<td>.</td>
</tr>
<tr>
<td>MMN3</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
</tr>
<tr>
<td>MMN5</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>0.922</td>
<td>0.922</td>
<td>.</td>
<td>.</td>
</tr>
<tr>
<td>MMN1</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>11.6</td>
<td>5e-2</td>
<td>.</td>
<td>.</td>
<td>.</td>
</tr>
<tr>
<td>MMN8</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
</tr>
<tr>
<td>MMN9</td>
<td>.</td>
<td>.</td>
<td>53.9</td>
<td>53.9</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
</tr>
<tr>
<td>MMN12</td>
<td>.</td>
<td>.</td>
<td>53.9</td>
<td>53.9</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
</tr>
</tbody>
</table>
4.3.2 SET results for standard flip-flops

As a further contribution of this work, we have performed a SET analysis for the Sequential cells. The interest of this effort is to a) evaluate the SET susceptibility of combinational stages (output buffers, internal inverters, etc) of the sequential cell and b) investigate whether a SET is possible in the internal memorization loop when blocked (memorizing).

As a first vehicle for this analysis, we have evaluated the SER behaviour of a D-Latch in the memorizing or transparent states.

Figure 4.3: Latch structure

Table 4.4 shows the SET and SEU event rates of the DLH _X1 cell for different transient pulse duration. The gap between the clock high and low results can be explained by Figure 4.3 that shows the cell’s structure for the two clock states. The light-red areas highlight the SET-sensitive part of the circuit. The light-blue area highlights the SEU-sensitive part of the Latch. When the clock is high, the latch is transparent, since the memory loop is open it is impossible to have a SEU, but the cell is still sensitive to SET. When the clock is low the cell is in its memorizing state thus SETs are predominantly occurring in the output stages. Flip-flops exhibit a very similar behavior to latches with respect to SETs with the added SER contribution of the transparent slave latch.

While the presented SET event rate of 134 FITs for SET larger than 25ps is comparable to the SEU event rate of 145 FIT, we must recall the need for a de-rating based on an opportunity window metric [Alexandrescu 2002] (TDR - Temporal De-Rating). In its simplest expression, the de-rating is given by equation 4.1.
Table 4.4: SET and SEU SER for a D-Latch

<table>
<thead>
<tr>
<th>DLH_X1</th>
<th>SET [FIT] for PW &gt; xx ps</th>
<th>SEU SER [FIT]</th>
</tr>
</thead>
<tbody>
<tr>
<td>CLK High</td>
<td>134.0 84.8 35.3 23.7 0 0 0</td>
<td></td>
</tr>
<tr>
<td>CLK Low</td>
<td>38.8 10.5 0 0 0 145.0</td>
<td></td>
</tr>
</tbody>
</table>

\[ TDR_{SET} = \frac{SET \text{ Pulse Width}}{Clock \text{ Period}} \]  \quad (4.1)

Obviously, this de-rating can be only computed when the period driving the flip-flop or latch. As an indication, the table 4.5 presents a few typical de-rating values. According to the tables 4.4 and 4.5, a conservative (pessimistic) evaluation of the SET contribution to the overall SER would only allow for a few FIT of de-rated SET SER compared to hundreds of SEU FIT. These results seem to indicate that sequential SET SER has a relative low impact (in a 45nm process). For more advanced technologies, SET contribution is expected to worsen, with regard to increasing intrinsic SET rates and lesser de-ratings.

From an experimental perspective, sequential SET contribution can be measured by comparing the results of a dedicated, high-speed sequential test vehicle for two testing frequencies.

Table 4.5: SET TDR de-rating

<table>
<thead>
<tr>
<th>Freq/Period</th>
<th>SET TDR [%] for PW of xx ps</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>25  50  75  100  125  150</td>
</tr>
<tr>
<td>500MHz/2ns</td>
<td>2.5  5  7.5  10  12.5  15</td>
</tr>
<tr>
<td>200MHz/5ns</td>
<td>1.25 2.5 3.75 5  6.25 7.5</td>
</tr>
</tbody>
</table>

4.4 Master and Slave Temporal De-Rating

The Master and Slave latches of the flip-flop may exhibit different SER sensitivities. Moreover, the SEU propagation through the circuit network and its subsequent memorization in a downstream cell is highly dependent on the fan-out path delay.

4.4.1 Long Paths

If the delay of the activated paths from the output of the affected flip-flop to the input of the following flip flop is larger than first clock half-cycle, then, only errors which occur early in the clock cycle can be captured in the following flip-flop. Thus, the upsets must occur in the master latch. Any upsets in the slave will have to be discarded, since they arrive too late to cause any damage.
4.4. Master and Slave Temporal De-Rating

In this case, the time derating of the master latch can be expressed as a function of the event opportunity interval and the clock period:

\[ TDR_{\text{Master}} = \frac{T_{\text{CLK}} - \text{Delay}}{T_{\text{CLK}}} \] (4.2)

The de-rated Soft Error Rate of the flip-flop will be:

\[ SER_{FF} = SER_{\text{Master}} \cdot TDR_{\text{Master}} \] (4.3)

4.4.2 Short Paths

If the activated paths are short, then, some errors from the slave when \( CLK = 0 \) and all errors from the master when \( CLK = 1 \) can be latched in the downstream flip-flop. In this case, the time de-rating factors are the following:

\[ TDR_{\text{Master}} = \frac{T_{\text{CLK}=1}}{T_{\text{CLK}}} \] (4.4)

\[ TDR_{\text{Slave}} = \frac{T_{\text{CLK}=0} - \text{Delay}}{T_{\text{CLK}}} \] (4.5)

The de-rated Soft Error Rate of the flip-flop will be:

\[ SER_{FF} = SER_{\text{Master}} \cdot TDR_{\text{Master}} + SER_{\text{Slave}} \cdot TDR_{\text{Slave}} \] (4.6)

4.4.3 Further Comments

The propagation delay can be different for two outputs: \( Q \) and \( QN \). However, the difference will be small enough to safely ignore. This is especially true when considering that the timing de-rating includes external delays which are bigger than internal flip-flop delays. We can usually assume that the delay of the output stage is insignificant with regard to external delay.

Please observe that the presented equations can be simplified back to the usual

\[ TDR_{\text{Flip–Flop}} = \frac{T_{\text{CLK}} - \text{Delay}}{T_{\text{CLK}}} \] (4.7)

\[ SER_{FF} = SER_{\text{Flip–Flop}} \cdot TDR_{\text{Flip–Flop}} \] (4.8)

when a single \( SER \) value is available and the contribution of master and slave latches is supposed equal.

Please note that the in-clock-cycle temporal de-rating concept should be accompanied by a supplementary analysis dealing with the effect of the clock gating, widely used in complex designs.
4.5 Cell State Analysis in Complex Designs

In this section, we present practical approaches and a tool framework able to evaluate the state distribution of the various flip-flop instances in complete designs.

The design under test is an industrial 32 bit CPU design containing 190k sequential instances. Using a straightforward approach based on a single, average SER value of 381 FIT, representing neutron and alpha contribution, we can compute the overall intrinsic sequential SER as:

\[
190000 \text{ cells} \cdot 381 \text{ FIT (per megacell)} = 71 \text{ FIT}
\]  

(4.9)

This initial analysis can be improved per-instance cell state probability values. Our methodology consists in a VPI simulation library, compatible with all principal simulation tools that is able to monitor the activity of all design flip-flop instances or sequential High Level Synthesis (Register Transfer Language (RTL)) signals. As an alternative for the VPI simulation library, we have also developed tools able to read and analyze a waveform file (Value Change Dump (VCD)). Independently of the selected method, the proposed approach is able to evaluate very complex designs with a reasonable investment in time, resources and effort.

The results of the a typical workload simulation are presented in the Figure 4.4. The histogram represents the bins of sequential cell instances according to their state probability (time @ 0). Using per-state SER data from the previous SER characterization of the sequential cells and state probability data from the simulation, the table 4.7 shows a more accurate evaluation of the overall SER.

In the presented case, we have observed a slight (+7%) aggravation of the SER rate wrt. to the straightforward method. In addition, a significant number of flip-flop instances present a strong bias towards one of the 1 or 0 states. This observation allows us to consider that these instances are good candidates for a SER improvement approach based on preferring the flip-flop states with lower sensitivity.

<table>
<thead>
<tr>
<th>% of time storing &quot;0&quot;</th>
<th>Instance Count</th>
<th>SER @ 0</th>
<th>SER @ 1</th>
</tr>
</thead>
<tbody>
<tr>
<td>0%</td>
<td>31697</td>
<td>15.6</td>
<td>0</td>
</tr>
<tr>
<td>5%</td>
<td>3872</td>
<td>1.8</td>
<td>0</td>
</tr>
<tr>
<td>10%</td>
<td>4451</td>
<td>2</td>
<td>0.1</td>
</tr>
<tr>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<td>95%</td>
<td>4029</td>
<td>0.1</td>
<td>0.9</td>
</tr>
<tr>
<td>100%</td>
<td>31732</td>
<td>0</td>
<td>7.4</td>
</tr>
</tbody>
</table>

TOTAL SER 58.4 17.7 76.1
4.6. State-Aware SER Improvement

4.6.1 Preference Towards Lower SER Data State

A potentially interesting SER improvement technique consists connecting the flip-flops in a direct or inverted approach. The eventual costs of this modification consists in adding inverters on the input and output of the flip-flop (Figure 4.5). If an inverted output is available, the output inverter is not needed. Similarly, the combinational logic network can be re-synthesized at no extra cost to provide an inverted output. Potentially, this scheme has very low or no overheads and can reduce to overall SER by a significant amount.

![Figure 4.5: Optimisation through state reversal](image)

Applying the method on the presented case study allowed an intrinsic SER improvement of 19.3%, from the original 76.1 FIT to the new value of 61.4 FIT.

The design has 1.2M combinatorial cells and 190k flip-flops: 70, 000 X1 inverters have been used, consisting in a +5% cell count increase, at a cost of less than 1% area overheads. Since inverters are placed on mostly stable signals, activity factor is low and thus a low dynamic power consumption overhead. The timing overhead
Table 4.7: SER contribution per states

<table>
<thead>
<tr>
<th>% of time storing &quot;0&quot;</th>
<th>Instance Count</th>
<th>SER @ 0</th>
<th>SER @ 1</th>
</tr>
</thead>
<tbody>
<tr>
<td>0%</td>
<td>31697</td>
<td>7.4</td>
<td>0</td>
</tr>
<tr>
<td>5%</td>
<td>3872</td>
<td>0.9</td>
<td>0.1</td>
</tr>
<tr>
<td>10%</td>
<td>4451</td>
<td>0.9</td>
<td>0.2</td>
</tr>
<tr>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<td>95%</td>
<td>4029</td>
<td>0.1</td>
<td>0.9</td>
</tr>
<tr>
<td>100%</td>
<td>31732</td>
<td>0</td>
<td>7.4</td>
</tr>
</tbody>
</table>

TOTAL SER: 42.2 19.2 61.4

is equal to the propagation delay of an inverter.

The previous technique can be enhanced by the usage of flip-flops with a selective transistor hardening. Their SEE figures may be highly unbalanced, presenting a resilient, preferred state. This way, the potential hardening costs and efforts will be minimized while conserving an overall good SER.

4.7 Conclusions

This chapter aims at raising the awareness regarding the current need for a more in-depth, sophisticated Soft Error Analysis of Single Event Upsets in sequential cells. We proposed a set of requirements concerning the per-state SER analysis of Flip-Flops and a framework of tools to fulfill the selected requirements. The detailed data is then used together with updated temporal de-rating techniques and state-dependent SER computations based on design state analysis. The overall methodology allows better, more accurate overall results, that are representative of the actual circuit application and usage. Consequently, we have investigated a selection of SER improvement efforts based on state-aware optimization. The proposed techniques allow for a significant reduction in the overall SER at very low or no extra cost.
5.1 Introduction

Silicon design must incorporate the effect of SEEs from the early stages of an Application Specific Integrated Circuit (ASIC) design flow. The analysis methodology should take in account SER characterization data of the underlying implementation technology, derive circuit SER metrics from the information available at different design stages and elaborate a SER characterization datasheet that predicts circuit behavior in the field.

The TFIT tool [Hane 2008, Belhaddad 2006, Belhaddad 2008] aims at predicting the SEE performance of standard and memory cells. The tool has the support and backing of leading technology providers and has been proven to match the real, test results for a variety of process nodes down to the latest FinFET technologies.

The SoCFIT platform provides modules and tools to analyze the propagation of the SEE-induced faults from the output of the affected cell through the inner paths of the circuit up to system and application. The tool implements the techniques presented in chapter 3 and chapter 4 to generate an extensive report for the device under test, providing intrinsic (raw) SER data (obtained from the TFIT tool), derating factors and the final, application and implementation-specific SER figures, representative of the actual behavior of the circuit in the system.
Chapter 5. Derating Analysis of a Complex CPU

The provided SER information will then help the designers to direct implementation choices, select a design hardening methodology, establish a failure recovery/mitigation strategy and help the support engineers to accompany the final users of the design in building reliable systems. SER prediction tools can assist in the decision making of when and where to use a protection scheme on memories, use of hardened-by-design flip-flops, or a globally optimized SER resiliency.

This chapter presents the results of the SoCFIT analysis of a complex commercial CPU core. The analysis\textsuperscript{1} includes the following tasks:

- Logic De-Rating LDR analysis
- Memory De-Rating MDR analysis
- Functional De-Rating FDR analysis

The FIT results reported in this chapter are calculated from the raw SER reported in [Alexandrescu 2013] for a DFF_X1:

- Neutron Sensitivity: 187.0 FIT/Mbit
- Alpha Particle Sensitivity: 349.6 FIT/Mbit
  - assuming an alpha particle emissivity of: 0.002 $\alpha$/cm$^2$/hour

The design considered in this work was a single core implementation of an industrial 32bit CPU design containing 244,083 flip-flops and 20.7 Mbit of SRAM.

5.1.1 Statistical Confidence

When we perform a series of fault-injection simulation runs in order to estimate the probability, $p$, of a given outcome such as a SDC or DUE failure, we are performing a series of Bernoulli trials.

There are different ways to estimate a confidence interval [Leveugle 2009, Evans 2012] for a given value of $\alpha$ which is the level of confidence: $\alpha = 0.05$ for 95% confidence. If we assume that the distribution of $p$ is normal, which for large values of $n$ (number of injected faults) is true due to the central limit theorem, then a confidence interval is given by the following equation:

$$
\bar{p}_n \pm z_{\alpha/2} \sqrt{\frac{\bar{p}_n (1 - \bar{p}_n)}{n}}
$$

(5.1)

Where,

$$
z_{\alpha/2} = 1.96
$$

(5.2)

For 95% confidence, and

$$
\bar{p}_n = \frac{1}{n} \sum_{i=1}^{n} \chi_i
$$

(5.3)

\textsuperscript{1}The analysis only focused on flip-flop and memories since a synthesized netlist of the CPU was not available. For the same reason, Temporal De-Rating (TDR) calculation was not possible.
5.2. Logic De-Rating Analysis

Where: \( n \) is number of injected fault and \( \chi_i \) is a random variable that represents the simulation outcome: \( \chi_i = 1 \) in case of error and \( \chi_i = 0 \) otherwise.

The results reported in Table 5.3 and in Table 5.4 include the error bars, which are calculated using the formulas reported in this paragraph.

5.2 Logic De-Rating Analysis

As a first task of the analysis, the Logic De-Rating (LDR) was calculated. As presented in section 2, the logic de-rating (LDR) represents the propagation probability of the logic fault\(^2\) from the output of the affected cell to the inputs of a sequential/memory cell. According to the state of the circuit (the values of the signals and cell outputs\(^3\)), the propagation of the fault is subject to logic blocking. As an example, an AND gate with a low input will block any faults on the other inputs. The evaluation of the logic de-rating is a common and well researched subject in the reliability community. Many research papers, methodologies and tools are available to help the designer evaluating the logic derating of circuits. Standard simulators (compiler-driven, table-driven or event-driven) are perfectly able to inject and simulate faults in very complex circuits and designs. In this work, a specific fault simulation technique called Parallel Pattern Single Fault Propagation (PPSFP) was used [Alexandrescu 2012]. The results of the analysis are presented in figure 5.1 and in table 5.1.

<table>
<thead>
<tr>
<th>Flip-Flop Raw SER FIT</th>
<th>Average LDR</th>
<th>Flip-Flop derated SER FIT</th>
</tr>
</thead>
<tbody>
<tr>
<td>124.9</td>
<td>74.1%</td>
<td>92.6</td>
</tr>
</tbody>
</table>

The logic de-rating (LDR), although it can’t be used for application-specific selective mitigation purpose, allows a quick (low cost)\(^4\) circuit SER estimation. The set-up and CPU costs are much lower, as LDR is calculated using static and probabilistic algorithms.

5.3 Memory De-Rating Analysis

As introduced in section 2, the memory de-rating (MDR) [Alexandrescu 2011] accounts for the fact that embedded memories do not always contain user-useful

\(^2\)Single Event Upset (SEU) for flip-flop
\(^3\)SoCFTI considers uniform input probability
\(^4\)The LDR calculation feature of SoCFTI can be very useful: it offers a quick solution for de-rating the intrinsic raw SER by an objective, upper bound factor. If the SER requirement is not very tight, LDR and TDR calculation can be sufficient to prove that the overall circuit SER meets the requirement. If the application is very demanding, requiring very low final SER values, FDR factors will help computing a lower, applicative, overall SER value than LDR (FDR includes LDR).
data. Furthermore, the memory locations are only vulnerable during a limited opportunity window, allowing us to compute an objective de-rating factor. Figure 5.2 shows an example sequence of memory accesses. Equation 5.4 illustrates the formula used to calculate the memory de-rating.

$$\text{Memory DeRating} = \frac{(t_2 - t_0) + (t_4 - t_3) + (t_7 - t_6)}{(T_e - T_s)}$$

The memory de-rating was obtained by running all the test-cases included with the core source code and also by running the three benchmarks implemented for the fault injection campaigns. All the read and write accesses were analyzed independently for each word in each memory, the results of this task are presented in table 5.2.

This analysis was performed using a VPI library that monitored all memory reads and writes and computed the fraction of time data was active (between write to last read of the address, in the memory), then the simulation transcripts were
5.3. Memory De-Rating Analysis

Table 5.2: Average Memory De-Rating

<table>
<thead>
<tr>
<th>Application</th>
<th>Average Memory De-Rating</th>
<th>Actual</th>
<th>Pessimistic</th>
</tr>
</thead>
<tbody>
<tr>
<td>bit count</td>
<td></td>
<td>0.4824</td>
<td>0.7268</td>
</tr>
<tr>
<td>qsort</td>
<td></td>
<td>0.4860</td>
<td>0.8185</td>
</tr>
<tr>
<td>simple math</td>
<td></td>
<td>0.4820</td>
<td>0.6777</td>
</tr>
<tr>
<td>app1</td>
<td></td>
<td>0.4563</td>
<td>0.6326</td>
</tr>
<tr>
<td>app2</td>
<td></td>
<td>0.0305</td>
<td>0.0659</td>
</tr>
<tr>
<td>app3</td>
<td></td>
<td>0.4308</td>
<td>0.6002</td>
</tr>
<tr>
<td>app4</td>
<td></td>
<td>0.4397</td>
<td>0.6045</td>
</tr>
<tr>
<td>app5</td>
<td></td>
<td>0.4308</td>
<td>0.6002</td>
</tr>
<tr>
<td>app6</td>
<td></td>
<td>0.3914</td>
<td>0.3919</td>
</tr>
<tr>
<td>app7</td>
<td></td>
<td>0.0279</td>
<td>0.0358</td>
</tr>
<tr>
<td>app8</td>
<td></td>
<td>0.5061</td>
<td>0.7067</td>
</tr>
<tr>
<td>app9</td>
<td></td>
<td>0.4007</td>
<td>0.6236</td>
</tr>
<tr>
<td>app10</td>
<td></td>
<td>0.1624</td>
<td>0.1624</td>
</tr>
<tr>
<td>app11</td>
<td></td>
<td>0.3620</td>
<td>0.3620</td>
</tr>
<tr>
<td>app12</td>
<td></td>
<td>0.3455</td>
<td>0.3455</td>
</tr>
<tr>
<td>app13</td>
<td></td>
<td>0.4957</td>
<td>0.4957</td>
</tr>
<tr>
<td>app14</td>
<td></td>
<td>0.0402</td>
<td>0.0723</td>
</tr>
<tr>
<td>app15</td>
<td></td>
<td>0.0368</td>
<td>0.0660</td>
</tr>
<tr>
<td>app16</td>
<td></td>
<td>0.0351</td>
<td>0.0678</td>
</tr>
<tr>
<td>app17</td>
<td></td>
<td>0.3715</td>
<td>0.5543</td>
</tr>
<tr>
<td>app18</td>
<td></td>
<td>0.3715</td>
<td>0.5637</td>
</tr>
</tbody>
</table>

analyzed with SoCFIT to calculate the memory de-rating for each instance.

There is some uncertainty if the overall address utilization, in the different considered test-cases, is representative of a real working system. To mitigate this risk, SoCFIT was designed with two options for computing the memory de-rating. In the Actual mode, it is assumed that the test-cases address utilization is correct for the considered application. In this mode, the addresses in a memory that are never used are assumed to be insensitive and are thus de-rated to zero. Conversely, in the Pessimistic mode, it is assumed that those addresses that were never accessed in a test-case would actually be accessed in the real system and thus the average de-rating for the used addresses is applied to the unused addresses. Both values are reported in Table 5.2.

Since all the SRAM embedded in the CPU were protected with ECC the overall memory SER was not calculated (it would be 0). However, in several applications, ECC is not always a possibility thus the memory de-rating can help designers deciding whether or not implementing ECC for a specific memory.
5.4 Functional De-Rating Analysis

As introduced in section 2, the Functional De-rating FDR [Shi-Jie 2008] evaluates whether the Soft Error has any observable impact on the functioning of the circuit, board or system considering its actual usage. For this study, three representative benchmarks were selected and implemented. For each benchmark scenario, three fault injection campaigns were performed. An initial fault injection campaign allowed the calculation of an initial, design-wide, FDR factor. A second, fault injection campaign based on a clustering approach [Evans 2013c, Evans 2014], provided finer-grained (per instance) characterization of the design. Lastly, a final fault injection campaign enabled the evaluation of a hypothetical mitigation scenario implemented using the data calculated from the initial and the cluster fault injection campaign.

5.4.1 Fault Classification

To identify all the possible classes of failure, an preliminary fault injection campaign was performed with a further application (Bubble Sort). The results obtained from this fault injection campaign enabled the following failure classification (aligned to the de-facto standard nomenclature for CPU design [Mukherjee 2008]):

- Silent Data Corruption (SDC):
  - The simulation completed but the result was either wrong, incomplete or missing.

- Detectable Uncorrectable Error (DUE):
  - The run failed in an observable way:
    * Simulation Completed and either an Error Indication or an Interrupt signal toggled.
    * CPU Timeouts (simulation not completed).

The cases where the simulation completed successfully but the simulation time was not identical to the reference (golden) have not been considered errors.

5.4.1.1 Benchmark Application

Each application considered for the fault injection campaigns was designed to have the same overall sequence (presented in the following and showed in figure 5.3):

1. INIT Phase
   (a) The CPU is initialized
   (b) The program and the data are loaded

2. BENCHMARK Phase
5.4. Functional De-Rating Analysis

(a) The application runs to completion

3. REPORT Phase

(a) The results of the application are copied in a test-bench memory, to be verified during the analysis phase

As shown in figure 5.3 the actual injection of fault was performed during the benchmark phase, so that

![Fault Injection Diagram]

Figure 5.3: Fault Injection

The following benchmark applications, obtained from the automotive set of MiBench [Guthaus 2001, MiBench 2001] suite, were adapted and used for the fault injection campaign.

1. Quick sort (QSORT)
2. Simple mathematical operations (SMATH)
3. Bit count algorithms (BITCNT)

**Quick Sort Algorithm Benchmark** The benchmark executed a quick sort algorithm on a 700-items array. The array size was selected in order to have a reasonable execution (CPU) time.

The simulation time of the reference (fault free) simulation was 58318 clock cycles (benchmark duration) and, the overall simulation, required 7.0 minutes of execution (CPU) time.

**Simple Mathematical Operations Benchmark** The benchmark executed a series of simple mathematical operations:

- 40 cubic equations
- 100 integer square roots
- 30 radians to degrees conversions
- 30 degrees to radians conversions

The simulation time of the reference (fault free) simulation was 67327 clock cycles (benchmark duration) and, the overall simulation, required 4.5 minutes of execution (CPU) time.
Bit Count Algorithms Benchmark

The benchmark executed 7 different bit count algorithms. Each of them was executed 200 times.

The simulation time of the reference (fault free) simulation was 60265 clock cycles (benchmark duration) and, the overall simulation, required 5.5 minutes of execution (CPU) time.

5.4.1.2 Fault Injection

For each benchmark, three fault injection campaigns were performed:

- Initial fault injection campaign (5000 SEU injected)
- Clustered fault injection campaign (20000 SEU injected)
- Final fault injection campaign (5000 SEU injected)

Initial Fault Injection

A random fault-injection analysis was performed in order to obtain a first estimate of the actual FIT rate after functional de-rating. One of the problems with a random, flat-analysis is that identifying the critical flip-flops requires running multiple fault-injections for every instance, which is simply not possible in a design with millions of flip-flops.

Table 5.3 reports the initial fault injection functional de-rating results and the circuit failure rate calculated for each application for each class of failure.

Table 5.3: Initial Fault Injection Results and Overal SER

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Quick Sort</td>
<td>SDC</td>
<td>0.260% ±0.141%</td>
<td>0.32 [0.50-0.15]</td>
</tr>
<tr>
<td></td>
<td>DUE</td>
<td>1.000% ±0.276%</td>
<td>1.25 [1.59-0.90]</td>
</tr>
<tr>
<td>Math Functions</td>
<td>SDC</td>
<td>0.400% ±0.175%</td>
<td>0.50 [0.71-0.28]</td>
</tr>
<tr>
<td></td>
<td>DUE</td>
<td>0.941% ±0.276%</td>
<td>1.18 [1.52-0.83]</td>
</tr>
<tr>
<td>Bit Count</td>
<td>SDC</td>
<td>0.720% ±0.234%</td>
<td>0.90 [1.19-0.61]</td>
</tr>
<tr>
<td>Algorithms</td>
<td>DUE</td>
<td>0.920% ±0.265%</td>
<td>1.15 [1.48-0.82]</td>
</tr>
<tr>
<td>Average</td>
<td>SDC</td>
<td>0.460% ±0.108%</td>
<td>0.57 [0.71-0.44]</td>
</tr>
<tr>
<td></td>
<td>DUE</td>
<td>0.954% ±0.156%</td>
<td>1.19 [1.39-1.00]</td>
</tr>
</tbody>
</table>

SEUs were injected using the fault injection capabilities of the SoCFIT VPI simulation library.

Clustering

An exhaustive analysis of individual flip-flop instances is cumbersome and costly for large designs. Objectively ranking the instances in terms of SER contribution is not easily feasible. Partial hardening schemes are difficult to be applied efficiently.
A more scalable approach consists in grouping together flip-flops that have a similar function or are related through objective criteria. Accordingly, the flip-flops of a cluster will exhibit similar SER vulnerabilities. For example, in a processor design, it would not make sense to harden some bits in the program counter (Program Counter (PC)) and not others [Evans 2013c, Evans 2014].

A clustering technique was thus deployed during this project. The flip-flops in each block have been grouped into clusters by matching the instance names against regular-expressions. Flip-flops in the same bus or with very similar names were grouped into a single cluster.

Figure 5.4 shows the cluster size distribution.

![Cluster Size Distribution](image)

Using the simulation environment, fault-injection simulations were performed. The number of fault injections was selected proportional to the number of flip-flops grouped in the same cluster, with a maximum of 102 (100 + 2 to cope for possible execution issues) fault-injections. Clusters smaller than approximately 32 bits were not simulated. These clusters represent 0.35% of the total number of flops.

The simulation transcripts (initial and clustered fault injection) were then parsed and the errors classified in SDC, DUE.

Figure 5.5 and Figure 5.6 show, for the three benchmarks, the portion of the design (%Flip-Flop) that caused DUE and SDC respectively.

Figure 5.7 shows, for each benchmark application, the portion of the design...
Figure 5.5: Portion of the design (%Flip-Flop) that produced DUE

Figure 5.6: Portion of the design (%Flip-Flop) that produced SDC
(Flip-Flop) that caused DUE only, SDC only and both SDC and DUE.

**Final Fault Injection** A further flat, design-wide, fault-injection analysis was performed to evaluate the effectiveness of the hypothetical mitigation scenarios. The data gathered from the initial and the clustered fault simulation campaigns...
have been used together to rank the flip-flops (clustered) in decreasing order of criticality.

Table 5.4 reports the final fault injection functional de-rating results and the circuit failure rate calculated for each application for each class of failure.

Table 5.4: Final Fault Injection Results and Overall SER before mitigation

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Quick Sort</td>
<td>SDC</td>
<td>124.9</td>
<td>0.280% ±0.146%</td>
<td>0.35 [0.53-0.17]</td>
</tr>
<tr>
<td></td>
<td>DUE</td>
<td></td>
<td>0.800% ±0.247%</td>
<td>1.00 [1.31-0.69]</td>
</tr>
<tr>
<td>Math Functions</td>
<td>SDC</td>
<td></td>
<td>0.380% ±0.171%</td>
<td>0.47 [0.69-0.26]</td>
</tr>
<tr>
<td></td>
<td>DUE</td>
<td></td>
<td>0.660% ±0.224%</td>
<td>0.82 [1.10-0.54]</td>
</tr>
<tr>
<td>Bit Count</td>
<td>SDC</td>
<td></td>
<td>0.660% ±0.224%</td>
<td>0.82 [1.10-0.54]</td>
</tr>
<tr>
<td>Algorithms</td>
<td>DUE</td>
<td></td>
<td>1.100% ±0.289%</td>
<td>1.37 [1.73-1.01]</td>
</tr>
<tr>
<td>Average</td>
<td>SDC</td>
<td>124.9</td>
<td>0.440% ±0.106%</td>
<td>0.55 [0.68-0.42]</td>
</tr>
<tr>
<td></td>
<td>DUE</td>
<td></td>
<td>0.854% ±0.147%</td>
<td>1.07 [1.25-0.88]</td>
</tr>
</tbody>
</table>

**Mitigation Scenarios** Flip-Flops to be hardened have been selected based on clustering results: clusters with higher failure rates (SDC and/or DUE) were protected first.

After all the clusters that produced an error were protected, the failure rate of the processor was not zero. This is because the number of fault injections in a given cluster may not have been sufficient to observe a failure. In order to extend the available data, the flip-flops in clusters that are "close", within the design hierarchy to those clusters which produced failures, were considered for protection.

Initially, clusters at the same 3rd level hierarchy were merged with the clusters that were already protected. Then those at the 2nd and 1st level of hierarchy were merged in, as well. This approach makes it possible to extract a maximum benefit from the relatively modest number of fault injection simulation. These four different criteria were used, in sequence, to select the flip-flop to harden:

1. Harden flip-flops with highest failure rate (from clustering) first
2. Harden flip-flops in third-level-modules (third hierarchical level) with highest percentage of hardened flip-flops first
3. Harden flip-flops in second-level-modules (second hierarchical level) with highest percentage of hardened flip-flops first
4. Harden flip-flops in first-level-modules (first hierarchical level) with highest percentage of hardened flip-flops first
The improvement obtained with each criterion is highlighted (background color) in the following charts (Figure 5.8 to Figure 5.16).

- The light-red portion highlights the SER improvement obtained with the first criterion (clustering information).
- The light-green portion highlights the SER improvement obtained with the second criterion.
- The light-blue portion highlights the SER improvement obtained with the third criterion.
- The light-yellow portion highlights the SER improvement obtained with the fourth criterion.

The following pictures (Figure 7 - Figure 15) show the flip-flop-SER improvement with respect to the percentage of hardened flip-flop, for each benchmark application.

- Figure 5.8 and Figure 5.9 show the SDC and DUE rates improvement for the quick-sort benchmark.
- Figure 5.10 and Figure 5.11 show the SDC and DUE rates improvement for the mathematic benchmark.
- Figure 5.12 and Figure 5.13 show the SDC and DUE rates improvement for the bit count benchmark.
- Figure 5.14 and Figure 5.15 show the SDC and DUE rates improvement using the average (three applications) DUE and SDC rates for the three benchmarks to rank the flip-flop in order of criticality.
- Figure 5.16 shows the SDC and DUE rates improvement using the sum of (three applications) DUE and SDC rates for the three benchmarks to rank the flip-flop in order of criticality.

The results obtained (Figure 5.8 to Figure 5.16) show that the failure rates, both SDC and DUE, can be reduced considerably by hardening a limited percentage of flip-flops instances. As an example, SER reductions by ∼ 5X can be obtained by protecting only ∼ 10% of the circuit flip-flops.

The key to this success consists in ranking the circuit features according to their vulnerability using a clustering approach and then selective protecting the most interesting candidates. Furthermore, this result can be achieved by running a limited fault simulation campaign (90000 SEUs injected overall).

The SoC FIT platform has been instrumental in implementing this strategy and have minimized the implementation and simulation costs.
Chapter 5. Derating Analysis of a Complex CPU

Figure 5.8: Quick Sort SDC and DUE Rates Reduction Versus % Hardened Flip-Flops when Selection is Made Based on Observed DUE Rate

Figure 5.9: Quick Sort SDC and DUE Rates Reduction Versus % Hardened Flip-Flops when Selection is Made Based on Observed SDC Rate
5.4. Functional De-Rating Analysis

Figure 5.10: S-Math SDC and DUE Rates Reduction Versus % Hardened Flip-Flops when Selection is Made Based on Observed DUE Rate

Figure 5.11: S-Math SDC and DUE Rates Reduction Versus % Hardened Flip-Flops when Selection is Made Based on Observed SDC Rate
Chapter 5. Derating Analysis of a Complex CPU

Figure 5.12: Bit Count SDC and DUE Rates Reduction Versus % Hardened Flip-Flops when Selection is Made Based on Observed DUE Rate

Figure 5.13: Bit Count SDC and DUE Rates Reduction Versus % Hardened Flip-Flops when Selection is Made Based on Observed SDC Rate
5.4. Functional De-Rating Analysis

Figure 5.14: SDC and DUE Rates Reduction Versus % Hardened Flip-Flops when Selection is Made Based on Observed average DUE Rate

Figure 5.15: SDC and DUE Rates Reduction Versus % Hardened Flip-Flops when Selection is Made Based on Observed average SDC Rate
### 5.5 Conclusion

In this chapter we presented an overview of a comprehensive SER analysis flow that can be used by to evaluate and improve the overall SER reliability metrics of complex devices. The goals of such SER analysis are the multiple.

First, the reliability engineer is able to characterize and document the reliability figures of the design in order to prove that stringent customer-driven expectations are successfully met.

Second, the application-specific SER figures of the design are evaluated using a scalable approach that consists in grouping together flip-flops (clustering). The implemented methodology for selective mitigation is based on coarse-grained fault injection: Starting with the full set of flip-flops (i), they are grouped into clusters using static information (ii). Statistical fault injection is performed for each of the clusters to rank them based on their sensitivity (iii) and then a set of the most sensitive clusters is selected for mitigation (iv) [Evans 2013c].

Third, the design flow-oriented methodology allows the reliability and design engineer to share a common goal and use the same concepts and tools.

Finally, the outcomes of the analysis help the reliability engineer to choose the optimal error handling methodology in order to meet harsh reliability constraints, to ensure adequate data protection, to respect pre-defined system up-time constraints and to provide support and maintenance to the final user during the lifetime of
the product. Moreover, the methodology provides early results that can be used in improving the circuit SER resilience through architectural and design choices with the firm goal of improving customer experience when using high availability products.

The results obtained allowed the development and evaluation of a hypothetical mitigation scenario that aims to significantly improve the reliability of the circuit at the lowest cost. The results obtained show that the failure rates, both SDC (Silent Data Corruption) and DUE (Detectable Uncorrectable Errors), can be significantly reduced by hardening a limited percentage of flip-flops instances. For the considered design with, \( \sim 250,000 \) flip-flops, by running only \( 90,000 \) fault injection simulations, we showed an SER reductions by \( \sim 5X \) (for both SDC and DUE) can be obtained by hardening \( \sim 10\% \) of the circuit flip-flops.
Contents

6.1 Single Event Transient Analysis ........................................... 105
6.2 Single Event Analysis for Sequential Logic .......................... 106
6.3 Derating Analysis of a Complex CPU ................................. 106
6.4 Summary ............................................................................. 107

Radiation-induced soft errors are one of the major sources of failures in nanoscale digital circuits. In addition to traditional safety and mission critical applications, mainstream systems must be designed to consider the reliability impact of soft errors. The physics behind soft error phenomena are well understood although the impact of new process technologies such as FINFETs and FDSOI are still being studied. The research community offers a wealth of solutions for each step of the design flow, for any practical representation of the circuit, with a large specter of performance and facility of use, nevertheless, assessing the effect of soft errors on a complex design remains a challenging task and the approaches taken in industry often involve large approximations. The techniques proposed in this thesis contribute to better addressing these problems. As with any research, this document provides a snapshot of the current state of the work, however, additional research is required to further develop the techniques that have been proposed.

6.1 Single Event Transient Analysis

The first contribution of this thesis (chapter 3) concerns the Single Event Transients (SET) analysis.

SETs affecting combinational logic are considerably more difficult to model, simulate and analyze than SEUs. The working environment may cause a myriad of distinctive transient pulses in various cell types that are used in widely different configurations.

Chapter 3 presents a practical SET analysis flow that shows a possible approach to the SET evaluation of a 45nm cell library and a design, efforts performed in an industrial setting. The analysis flow consists in three main steps. The first step concerns the technology characterization of the standard cell library: the TFIT tool was used to characterize the Nangate 45 nm Open Cell Library. For each cell, the effects of neutrons and alpha particles have been studied. The second step concerns

Chapter 6

Conclusion
the SET propagation analysis; this task included the Logic De-Rating - LDR, the Electrical De-Rating - EDR (or Propagation Induced Pulse Broadening - PIPB) and the Temporal De-Rating - TDR. Finally the overall SET figures taking into account the particularities of the implementation of the circuit and its environment.

The results indicates that the combinational cell SER is usually several times lower than same-library flip-flops and that the SET events are strongly de-rated by electrical and temporal factors, reducing their contribution to the overall SER. However, the reliability engineers and designers need to be equipped with least-effort tools and methodologies in order to be prepared for future challenges.

As a further contribution some of the SET-SER influencing factors were analyzed, this includes: the cell drive strength, the output load capacitance and cell supply voltage.

### 6.2 Single Event Analysis for Sequential Logic

The first contribution of this thesis (chapter 3) concerns the Single Event Upset (SEU) analysis.

Single Event Effects in sequential logic cells represent the current target for analysis and improvement efforts in both industry and academia.

Chapter 4 presents a state-aware analysis methodology that improves the accuracy of Soft Error Rate data for individual sequential instances based on the circuit and application: a set of requirements concerning the per-state SER analysis of Flip-Flops are presented. The detailed data is then used together with updated temporal de-rating techniques and state-dependent SER computations based on design state analysis. The overall methodology allows better, more accurate overall results, which are representative of the actual circuit application and usage.

Furthermore, the intrinsic imbalance between the SEU susceptibility of different flip-flop states is exploited to implement a low-cost SEU SER improvement strategy; by evaluating the state probability for each sequential cell.

As a further contribution, an SET analysis for the Sequential cells was performed. The interest of this effort is first, to evaluate the SET susceptibility of combinational stages (output buffers, internal inverters, etc.) of the sequential cell and second, investigate whether a SET is possible in the internal memorization loop when blocked (memorizing). The results obtained with the TFIT tool for a $D - LATCH$ shows a strong difference ($\sim 3.5X$) between the transparent state (clock high - more sensitive) and the latching state (clock low - less sensitive).

### 6.3 Derating Analysis of a Complex CPU

Chapter 5 presents the results of a comprehensive functional analysis of an industrial 32bit CPU containing 244,083 flip-flops and 20.7 Mbit of SRAM. The analysis included the logic, memory and functional de-rating computation. Accelerated simulation techniques (probabilistic calculations, clustering, parallel simulations)
have been proposed and evaluated in order to develop an industrial validation environment, able to take into account very complex circuits with reasonable efforts (manpower and CPU time).

The LDR was calculated using specific fault simulation technique called PPSFP, considering uniform input probability while the dynamic (simulation-based) analyses (MDR and FDR) were performed using three representative applications obtained from the automotive set of MiBench benchmark suite.

For each benchmark application, three fault injection campaigns were performed. An initial fault injection campaign allowed the calculation of an initial, design-wide, FDR factor. A second, fault injection campaign based on a clustering approach, provided finer-grained (per instance) characterization of the design. Lastly, a final fault injection campaign enabled the evaluation of a hypothetical mitigation scenario implemented using the data calculated from the initial and the cluster fault injection campaigns.

This methodology [Evans 2013c] used for selective mitigation is based on coarse-grained fault injection: starting with the full set of flip-flops (i), they are grouped into clusters using static information (ii). Statistical fault injection is performed for each of the clusters to rank them based on their sensitivity (iii) and then a set of the most sensitive clusters is selected for mitigation (iv).

The results obtained allowed the development and evaluation of a hypothetical mitigation scenario that aims to significantly improve the reliability of the circuit at the lowest cost. The results obtained show that the failure rates, both SDC (Silent Data Corruption) and DUE (Detectable Uncorrectable Errors), can be significantly reduced by hardening a limited percentage of flip-flop instances. For the considered design with, \( \sim 250,000 \) flip-flops, by running only 90,000 fault injection simulations, we showed an SER reductions by \( \sim 5X \) (for both SDC and DUE) can be obtained by hardening \( \sim 10\% \) of the circuit flip-flops.

### 6.4 Summary

In the years to come, electronics will play an increasingly critical role in all aspects of everyday life, technology’s evolution will allow building more and more complex devices integrating increasingly complex functionalities. In many applications the highest level of reliability is essential and there is a great need for techniques to analyze complex systems and evaluate the impact of technology level faults.

Analysis of complex systems requires adequate techniques in order to scale with the increasing complexity. The key contribution of this thesis is to provide industrial solutions and methodologies for the areas of terrestrial applications requiring ultimate reliability (telecommunications, automotive, medical devices, etc.) to complement previous work on Soft Errors traditionally oriented aerospace, nuclear and military applications. The Methodologies, the algorithms and the CAD tools proposed and validated as part of the work are intended for industrial use and have been included in a commercial CAD framework that offers a complete solution for
assessing the reliability of circuits and complex electronic systems.
Appendix A

Flip-Flop SEU Reduction through Minimization of the Temporal Vulnerability Factor (TVF)

This appendix presents the results of a collaborative research project between IROC Technologies and Intel Israel. The content is based on a paper which was presented at IOLS 2015 [Evans 2015].

A.1 Introduction

In large SoCs, managing the effects of soft-errors in flip-flops is essential. Numerous hardened flip-flop designs [Calin 1996, Lee 2010, Mitra 2007, Nicolaidis 2008, Omana 2010] exist and there are many approaches to selectively replace the most functionally critical flip-flops with hardened cells [Seshia 2007, Mirkhani 2005]. This approach is very effective, but it can only be used when a hardened flip-flop design is available. In some cases, such cells are not available and alternative approaches for mitigating flip-flop SER must be explored. It is well known that not all SEUs in flip-flops propagate due to logical and temporal masking effects. The logic function of a circuit is determined by the design requirements, so little can be done to modify the logical masking factor. However, flip-flop SEUs are most likely to propagate along paths where there is significant slack. By inserting padding on these paths, the temporal vulnerability factor (TVF) can be minimized, reducing the probability that SEUs are propagated.

In this appendix, we review how the TVF of a circuit can be modelled and we investigate the extent to which the TVF of a circuit can be modified in order to minimize the propagation of SEUs. Of course, the added delay must not impact the critical paths, meaning the setup constraints must be respected. The amount of padding to be added can be calculated using linear programming. The extra cost of the padding, in terms of area and power can be estimated and included as constraints in the optimization problem. The proposed technique has been applied to eight benchmark circuits.

A.2 Overview of SEU Masking Factors

SEU Temporal Vulnerability It is well known that the majority of SEUs in flip-flops do not propagate. In a synchronous circuit, in order for a SEU to propagate,
Appendix A. Flip-Flop SEU Reduction through Minimization of the Temporal Vulnerability Factor (TVF)

It must be sampled by the downstream sequential elements. This implies that the upset must occur sufficiently early in the clock period in order for the erroneous value to meet the setup time of one or more downstream flip-flops. In this work, we focus exclusively on synchronous circuits using master-slave flip-flops and make the assumption of a clock with a symmetric duty cycle, as shown in figure A.1. During the first half of the clock cycle (CLK=1), the master latch is opaque and thus susceptible to SEUs and during the second half of the clock period (CLK=0), the slave latch is opaque, and thus vulnerable.

An early, in-depth study of the temporal masking of SEUs is presented in [Seifert 2004]. In this work, the authors propose the notion of the Temporal Vulnerability Factor (TVF) which is a measure of the likelihood that an upset will be captured in a downstream flip-flop, based on temporal considerations. The TVF is shown to be inversely proportional to the logic propagation time ($\Delta t_{prop}$), as shown in equation A.1. The authors performed a series of circuit simulations using spice while injecting faults at different points in time through a full clock cycle. A graph of the observed results for latches and flip-flops is reproduced in figure A.2.

$$TVF \propto \frac{T_{clk} - (\Delta t_{prop} + \Delta t_{setup} + \Delta t_{clk})}{T_{clk}}$$

(A.1)

If the nominal SER value for the flip-flop is defined as the sum of the static SER of the master and slave latches ($SER_{FF} = SER_m + SER_s$), then it is clear that the TVF can never exceed 0.5, since only one of the master or slave latches are

![Figure A.1: Master-Slave Flip-Flops](image)

![Figure A.2: Spice Simulation of SEUs in Latches and Flip-Flops [Seifert 2004]](image)
susceptible at any point in time. In some cases, the intrinsic SER of a flip-flop is expressed as $\text{SER}_{FF} = 0.5 \cdot \text{SER}_m + 0.5 \cdot \text{SER}_s$. With this definition, the TVF can be in the range of 0.1. In fact, the results of the simulations in [Seifert 2004] show that when $t_{\text{prop}} \approx 0$, the TVF value is actually lower than 0.5 as seen in figure A.2. This is because there are intrinsic delays in the flip-flop (e.g. CLK $\rightarrow$ Q) as well as in the interconnect. This simple relationship is further complicated by the fact that when $t_{\text{prop}} \approx t_{\text{cyc}}$, the clock nodes are highly susceptible to upsets which produce jitter and can provoke a downstream setup violation.

In a real circuit, one flip-flop can feed multiple end-points through logic paths with different delays, as shown in figure A.3. Thus, depending on when an upset occurs, it may propagate to some, but not all of the end-points. In order for a SEU to be fully blocked, it must occur sufficiently late in the cycle that it can not propagate along even the shortest path. Thus, a safe value for the TVF factor which represents the probability that an upset will propagate to none of the end-points, is given by equation A.2, where $\Delta t_{\text{prop}}^{i,j}$ is the propagation delay from start-point $i$ to end-point $j$.

$$TVF_{\text{min}}(i) \propto 1 - \min_j \left( \frac{\Delta t_{\text{prop}}^{i,j} + \Delta t_{\text{setup}} + \Delta t_{\text{clk}}}{T_{\text{clk}}} \right) \quad (A.2)$$

![Figure A.3: Multiple Downstream Paths](image)

**SEU Logical Vulnerability**  A fault is logically masked if it can not propagate through the gates in the combinatorial logic network based on their logic function. In processor applications, the notion of AVF [Mukherjee 2008], is used to measure the probability that a fault propagates. In other work, this notion is referred to as the logical de-rating factor (LDR) [Nguyen 2005].

This probability depends on the input vectors, or for processors, on the application that is running. In this work, since the focus is on TVF, we assume that a LDR factor can be calculated for each flip-flop. In the experimental work, the LDR is calculated based on fault simulation as described in [Alexandrescu 2012]. The total SEU rate can then be estimated using equation A.3.

$$\text{SER}_{SEU}^{FF} = \sum_{i \in FF} (\text{SER}_i^{\text{nom}} \cdot TVF_{\text{min}}(i) \cdot LDR(i)) \quad (A.3)$$
Appendix A. Flip-Flop SEU Reduction through Minimization of the Temporal Vulnerability Factor (TVF)

SET Contribution The overall SER contribution from single event transients (SETs) is normally smaller than that from SEUs [Gill 2009], primarily due to the additional latch-window masking effect. The proposed approach to SER minimization consists of adding additional gates. In effect, we are trading a small increase in SET SER in exchange for additional masking of SEUs.

Many authors have proposed approaches for computing the overall effect of SETs incorporating all the masking effects (electrical, logical and latch-window) [Hayes 2007, Miskov-Zivanov 2006, Ramakrishnan 2008, Rajaraman 2006]. The focus in this work is to use a simple model for SETs that can: (i) account for the SET contribution in the original circuit and (ii) account for the increase in SETs as the result of the added delay gates. Therefore, we ignore the electrical masking and consider the logical and latch-window masking independently, as shown in equation A.4. This equation gives the rate at which SETs get sampled in a flip-flop. A further de-rating could be applied, since not every error that reaches a flip-flop propagates to a primary output.

\[
SER^{SET} = \sum_{i \in \text{gates}} \left( \sum_{pw \in \text{PW}} SER_i^{nom}(pw) \cdot \frac{pw}{T_{clk}} \cdot LDR(i) \right)
\]  

(A.4)

\(SER_i^{nom}(pw)\) represents the FIT\(^2\) rate for transient pulses of width \(pw\) for the given gate. This sensitivity can be represented as a distribution where the FIT rate for transients within a set of discrete ranges can be determined by simulation [Costenaro 2013a], as illustrated in figure A.4 for two gates taken from the Nangate 45nm library [Nangate 2008]. The \(\frac{pw}{T_{clk}}\) factor represents the latch window masking effect. Finally, the \(LDR(i)\) factor represents the probability that a fault in a gate will propagate logically and can be calculated using fault simulation.

\[SER^{nom}(pw)\]

\[\frac{pw}{T_{clk}}\]

\[LDR(i)\]

Figure A.4: SET Sensitivity of two 45nm Combinatorial Gates

\(^1\)By reducing the timing margin on certain paths, it is also possible that there is a small increase in the risk of delay faults.

\(^2\)FIT = Failure in Time. One FIT is one failure in \(10^9\) operating hours.
A.3 TVF Optimization

In [Bramnik 2013], it was shown that by increasing the logic propagation delay, the TVF, and thus the overall SER, can be reduced. This work, however, did not discuss how to select on which paths TVF should be optimized, nor did it quantitatively consider the resulting area and power overheads, nor did it consider the increase in combinatorial SER due to the SETs that may occur in the additional gates. In the current work, we show how linear programming can be used to apply the TVF optimization technique while still respecting bounds on the acceptable area and power overheads. Using a linear model is attractive since modern solvers can deal with a large number of variables. In [Alexandrescu 2014], linear programming was used effectively to select optimal ECCs for memories and here we expand this to TVF minimization.

Problem Formulation Starting with an original circuit, as shown in figure A.5(a), the problem consists of selecting the amount of extra delay that can be inserted either at the input or at the output of the flip-flops, as shown in figure A.5(b). The problem thus consists of selecting values for \( d_1 \ldots d_{nq} \) and \( d_1 \ldots d_{nd} \) subject to the correct constraints.

Using a static timing analysis (STA), the slack between any two flip-flops can be extracted. This array can be expressed as a NxN matrix where each entry indicates the slack from a given start point to a given end point, as shown in equation A.5 for the example circuit in figure A.5(a). Note that the slack time reported by the STA tool, includes the time required to meet the setup constraints (\( \Delta t^{setup} \)) and also accounts for the uncertainty in the clock network (\( \Delta t^{clk} \)). The slack, thus represents the total amount of additional delay that could potentially be added on the given path.

Note that for simplicity, in this work, the timing paths from the primary inputs and to the primary outputs have been ignored, although the inclusion of such paths does pose any particular difficulties, but would require knowledge of the timing constraints of the adjacent circuits.
Appendix A. Flip-Flop SEU Reduction through Minimization of the Temporal Vulnerability Factor (TVF)

\[
\begin{align*}
0 & 0 & 0 & 0 & s_{15} & 0 \\
0 & 0 & 0 & 0 & s_{25} & 0 \\
0 & 0 & 0 & 0 & s_{36} & 0 \\
0 & 0 & 0 & 0 & s_{46} & 0 \\
0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0
\end{align*}
\] (A.5)

For the example circuit, the constraints that must be respected when selecting \(d_{1q}..d_{nq}\) and \(d_{1d}..d_{nd}\) is that the setup time must still be met. In other words, the added delay must not exceed the available slack which can be expressed as:

\[
\begin{align*}
d_{1q} + d_{5d} & \leq s_{15} \\
d_{2q} + d_{5d} & \leq s_{25} \\
d_{3q} + d_{6d} & \leq s_{36} \\
d_{4q} + d_{6d} & \leq s_{46}
\end{align*}
\] (A.6)

We propose to introduce padding by simply adding chains of buffers, either on the inputs or outputs of the flip-flops. However, we must pose a constraint on the delay variables \((d_{nq}, d_{nd})\) such that they must be either zero (no modification), or at least one minimal buffer delay. This type of constraint is supported by the solver [Berkelaar 2008] through the use of semi-continuous variables.

**Cost Function** The equation for TVF based on the fastest path (equation A.2), is not linear and thus can not be directly used as a cost function. When a flip-flop has paths with multiple end points, adding padding such that the propagation is prevented on some paths can still reduce the risk that a SEU produces an error, even if the fault still propagates to some endpoints. Therefore, we use an average TVF along all the paths, as shown in equation A.7.

\[
TVF_{avg}(i) \propto 1 - \frac{T_{clk}}{j} \left( \frac{\Delta t_{prop}^{ij} + \Delta t_{setup} + \Delta t_{clk}}{T_{clk}} \right)
\] (A.7)

Building on the equation for average TVF, the sum of the SEU and SET SER of a circuit can be expressed as shown in equation A.8. In this equation, the \(TVF_{avg}(i)\) for flip-flops has a linear relationship with the variables for the added delay \((d_{1q}..d_{nq}, d_{1d}..d_{nd})\).

The number of buffers that are actually inserted must be an integer but depends linearly with the amount of delay. These additional buffers contribute to the second term which represents the combinatorial SER. The intrinsic SET sensitivity of the buffer \((SER_{i}^{nom}(pw))\) is fixed, based on the type of gate. For each buffer inserted at the output of a flip-flop, the LDR for SETs in that buffer is identical to the LDR of that flip-flop, as the SET would have to propagate through the same downstream logic. For buffers added at the input of a flip-flop, the LDR is 1.0, as there is no
intervening logic. In this way, the effect of the added delay can be modelled as having a linear effect on the overall circuit SER.

There are limitations on the accuracy of this linear model but it makes it possible to quickly optimize the amount of inserted delay, by preferentially placing it where errors are most likely to propagate, while still respecting the setup constraints.

\[
SER(d_{1Q}..d_{NQ},d_{1D}..d_{Nd}) = \\
\sum_{i \in FFs} \left( SER_i^{nom} \cdot TVF_{avg}(i) \cdot LDR(i) \right) + \\
\sum_{i \in gates} \left( \sum_{pw \in PW} SER_i^{nom}(pw) \cdot \frac{pw}{T_{clk}} \cdot LDR(i) \right)
\]

In this section, we have identified a cost function which expresses the overall SER as a function of the average TVF which itself depends on the propagation delays between flip-flops. This cost function also accounts for the added SET sensitivity due to the additional buffers. This function can be minimized, subject to the setup time constraints which have been expressed in terms of the available slack reported by a STA tool. In the following section, we develop additional constraints which account for the area and power overhead associated with the additional padding delay.

### A.4 Area and Power Constraints

**Area Cost Constraints** It is assumed that the incremental delay of a single buffer (BUF_X1) is fixed (≈50 ps) based on a standard loading of 0.48 fF. Only an integer number of buffers can be inserted, thus the actual delay may be slightly lower than the solution from the linear solver which uses continuous variables. The total area overhead is estimated by multiplying the number of added buffers by the area of a single BUF_X1 cell (0.798 $\mu$mm$^2$) and does not account for the impact of additional routing.

**Power Cost Constraint** From fault simulations, the switching activity factor ($AF_i$) for each flip-flop is known. The total power for the added gates can be expressed as the sum of a fixed static component ($P_{\text{static}}$) plus a dynamic component that depends on the frequency of the circuit and the activity factor of the signal, as shown in equation A.9.

\[
P = \sum_{j \in BUF} \left( P_{\text{static}} + AF_i \cdot FREQ \cdot P_{\text{dyn}} \right)
\]
A.5 Experimental Results

**Original Circuits** Six benchmark circuits from the ISCAS and ITC [Corno 2000] suites and two arithmetic and cryptographic circuits from Opencores [Opencores 2013] were selected, covering different design sizes and operating frequencies. These were synthesized into the Nangate 45nm library [Nangate 2008] based on typical process and the clock frequency was maximized (zero slack), so that the clock period was appropriate based on the number of layers of logic. In practice, some positive timing margin is required to allow for variability.

The SER of all gates has been determined using simulation techniques described in [Costenaro 2013a] taking into account the neutron contribution (JEDEC) and the alpha contribution for an alpha emmissivity of $0.001 \alpha/cm^2/hour$. Using fast parallel fault simulation, the LDR for flip-flops and combinatorial gates was simulated for uniform random input vectors based on 100 000 fault injections per flip-flop or gate. Then, using equation A.3, the initial SER of each of these circuits was calculated. The results prior to mitigation are shown in table A.1.

During this analysis, only the internal timing paths from flip-flops to flip-flops were considered. The paths to and from primary inputs and outputs were ignored. Nothing in the proposed approach precludes considering external paths but this would require making assumptions on the input and output timing constraints.

The FIT rates are reported in units of $\mu$FIT since the total FIT contribution of such tiny circuits is very small. The column labelled Eff. $SEU$ SER, shows the SEU FIT rate after the TVF and logical de-rating have been applied.

<table>
<thead>
<tr>
<th>Circuit</th>
<th>Num FFs</th>
<th>Num Gates</th>
<th>Period (ns)</th>
<th>Num Paths</th>
<th>Raw SEU SER ($\mu$FIT)</th>
<th>Eff SEU SER ($\mu$FIT)</th>
</tr>
</thead>
<tbody>
<tr>
<td>s1488</td>
<td>5</td>
<td>379</td>
<td>0.72</td>
<td>78</td>
<td>2034</td>
<td>254</td>
</tr>
<tr>
<td>b06</td>
<td>9</td>
<td>47</td>
<td>0.37</td>
<td>63</td>
<td>3195</td>
<td>341</td>
</tr>
<tr>
<td>s400</td>
<td>21</td>
<td>114</td>
<td>0.5</td>
<td>311</td>
<td>7455</td>
<td>1308</td>
</tr>
<tr>
<td>b05</td>
<td>34</td>
<td>421</td>
<td>1.37</td>
<td>1064</td>
<td>12070</td>
<td>4663</td>
</tr>
<tr>
<td>b07</td>
<td>49</td>
<td>397</td>
<td>0.51</td>
<td>1559</td>
<td>17395</td>
<td>2370</td>
</tr>
<tr>
<td>s9234</td>
<td>145</td>
<td>701</td>
<td>0.76</td>
<td>3754</td>
<td>51475</td>
<td>19951</td>
</tr>
<tr>
<td>or1200</td>
<td>214</td>
<td>5192</td>
<td>5.86</td>
<td>13716</td>
<td>75970</td>
<td>25488</td>
</tr>
<tr>
<td>aes_core</td>
<td>554</td>
<td>12252</td>
<td>1.09</td>
<td>14986</td>
<td>196670</td>
<td>66637</td>
</tr>
</tbody>
</table>

The slack histograms for two of the circuits are shown in figure A.6. We see that in the b06 circuit (figure A.6(a)), there are no paths with large slack, thus intuitively, we expect that by adding a small amount of padding on the remaining paths, the TVF can be made very low. On the other hand, in the OR1200, there are many paths with large slacks. It is thus clear that a huge number of gates are required to pad all these paths, especially given the longer clock period (5.86 ns).

**Unconstrained Results** The solver was initially run with no area or power constraints. In this case, the paths were padded to the maximum amount possible. The results, in terms of SER reduction and increased area are shown in table A.2.
We see that the results are highly variable. Small circuits such as b06 which has a very short clock period and a tight slack histogram (figure A.6(a)), show significant SER reduction by adding just a few gates. Larger circuits with a longer clock period, such as OR1200, require significantly more area to achieve the maximal SER reduction.

In the absence of constraints, the solver adds padding until the paths were maximally long. Due to the fact that one flip-flop can feed multiple end-points and that these end-points may share common start points, it is not possible to make the slack zero on all paths. Consider the example shown in figure A.7. Let us assume that the paths from FF1 → FF4 and FF3 → FF5 have zero slack. Then, even though the paths starting from FF2 have slack, it is not possible to pad either the start or end points. This limitation comes from the fact that we only allow padding at the flip-flops and, in theory, the intervening logic could be modified to have zero slack.

Note that when the SER of the circuits, was re-evaluated after the padding, the TVF calculation was performed based on equation A.2, using the path with the minimum propagation delay. It is possible that the approximation of the TVF used as the cost function (equation A.7) results in a slightly non-optimal solution. However, the reported reduction does indeed represent a valid solution. Furthermore, the additional SET contribution has been included, thus the reported value represents a true net reduction.

**Area and Power Constrained Results** To force the solver to find the best SER reduction, within a given area or power budget ($A_{max}$ or $P_{max}$), an additional
Appendix A. Flip-Flop SEU Reduction through Minimization of the Temporal Vulnerability Factor (TVF)

constraint was added to the system. The system was repeatedly solved for increasing values of $A_{\text{max}}$ or $P_{\text{max}}$ and the results are shown in figures A.8 and A.9. The vertical scale shows the net SER reduction. This accounts for the combinatorial SER increase due to the added gates, thus the SER reduction in these graphs is lower than that seen in table A.2.

For the s400 circuit, a reduction of about 40% in overall SER was possible with a 5% area overhead. Beyond this point, the added combinatorial SER from the additional buffers starts to offset the benefit from reduced SEU SER. For larger circuits (or1200), the SEU SER could be driven down further, with additional area cost.

A.6 Conclusions

In this work, we have studied the extent to which the TVF of a circuit can be intentionally reduced in order to minimize the flip-flop SER. The analysis considered varied benchmark circuits and took into account the additional combinatorial SER that results from the added logic gates. The proposed modifications do not impact the critical timing paths.

It is well known that adding logic for delay padding is costly, as seen when fixing hold time violations. As a result the proposed technique for SER reduction is inher-
ently less effective than approaches based on selective replacement with hardened flip-flops [Evans 2013c, Ebrahimi 2014] where it is common to achieve a 5x SER reduction by replacing fewer than 5% of the flip-flops.

The proposed technique might apply well in flash-based FPGAs, where the configuration storage bits are already quite robust to radiation effects. For space applications, common practice is to triplicate flip-flops, which is very costly. Instead, unused interconnect resources could be used to insert padding, reducing the TVF and thus minimizing flip-flop SER. In FPGAs, there is no area penalty associated with the use of such unused resources, although it still impacts dynamic power.

In this work, we considered a single, typical timing condition. In terrestrial applications, what is important for SER analysis is the average value of TVF across a large population of devices, thus evaluating TVF based on typical timing is reasonable. When adaptive voltage scaling (AVS) is used, the average timing slack in the population is reduced and the typical timing may yield a conservative estimate of the average TVF.

The approach to inserting delay proposed in this work consisted simply of adding a chain of minimum sized buffers, which results in increased combinatorial SER. Instead, a glitch-filter, of the type shown in figure A.10 can be used. Such a circuit has the advantage that it adds delay, while also reducing combinatorial SER. Future work will look at using optimization techniques to selectively insert glitch filters to both simultaneously minimize TVF and reduce SETs.

Figure A.9: Power Constrained TVF Minimization

![Power Constrained TVF Minimization](image)

We further note that in this contribution, we formulated the problem based on inserting delay at the input or the output of flip-flops, in order to reduce the size
Appendix A. Flip-Flop SEU Reduction through Minimization of the Temporal Vulnerability Factor (TVF)

of the solution space. In some cases, it is more effective to insert delay in the middle of the combinatorial logic (see figure A.3). Future work will look at how the optimization problem can be generalized to consider a broader solution space.

In summary, optimization techniques to reduce SER subject to constraints on acceptable penalties are attractive. The proposed technique provides the designer with an additional tool for SER minimization when hardened flip-flops are not available.


Bibliography


IEEE Workshop on Silicon Errors in Logic - System Effects (SELSE), 2010. (Cited on page 68.)


Bibliography


Techniques for the evaluation and the improvement of emergent technologies’ behaviour facing random errors

Abstract: The main objective of this thesis is to develop analysis and mitigation techniques that can be used to face the effects of radiation-induced soft errors - external and internal disturbances produced by radioactive particles, affecting the reliability and safety in operation complex microelectronic circuits. This thesis aims to provide industrial solutions and methodologies for the areas of terrestrial applications requiring ultimate reliability (telecommunications, medical devices, ...) to complement previous work on Soft Errors traditionally oriented aerospace, nuclear and military applications.

The work presented uses a decomposition of the error sources, inside the current circuits, to highlight the most important contributors.

Single Event Effects in sequential logic cells represent the current target for analysis and improvement efforts in both industry and academia. This thesis presents a state-aware analysis methodology that improves the accuracy of Soft Error Rate data for individual sequential instances based on the circuit and application. Furthermore, the intrinsic imbalance between the SEU susceptibility of different flip-flop states is exploited to implement a low-cost SER improvement strategy.

Single Event Transients affecting combinational logic are considerably more difficult to model, simulate and analyze than the closely-related Single Event Upsets. The working environment may cause a myriad of distinctive transient pulses in various cell types that are used in widely different configurations. This thesis presents a practical approach to a possible exhaustive Single Event Transient evaluation flow in an industrial setting. The main steps of this process consists in: a) fully characterize the standard cell library using a process and library-aware SER tool, b) evaluate SET effects in the logic networks of the circuit using a variety dynamic (simulation-based) and static (probabilistic) methods and c) compute overall SET figures taking into account the particularities of the implementation of the circuit and its environment.

Fault-injection remains the primary method for analyzing the effects of soft errors. This document presents the results of functional analysis of a complex CPU. Three representative benchmarks were considered for this analysis. Accelerated simulation techniques (probabilistic calculations, clustering, parallel simulations) have been proposed and evaluated in order to develop an industrial validation environment, able to take into account very complex circuits. The results obtained allowed the development and evaluation of a hypothetical mitigation scenario that aims to significantly improve the reliability of the circuit at the lowest cost. The results obtained show that the error rate, SDC (Silent Data Corruption) and DUE (Detectable Uncorrectable Errors) can be significantly reduced by hardening a small part of the circuit (Selective mitigation).

In addition to the main axis of research, some tangential topics were studied in collaboration with other teams. One of these consisted in the study of a technique
for the mitigation of flip-flop soft-errors through an optimization of the Temporal De-Rating (TDR) by selectively inserting delay on the input or output of flip-flops.

The Methodologies, the algorithms and the CAD tools proposed and validated as part of the work are intended for industrial use and have been included in a commercial CAD framework that offers a complete solution for assessing the reliability of circuits and complex electronic systems.

**Keywords:** Single-Event Effects, Single-Event Upsets, Single-Event Transients, Soft Errors, Fault-Injection, Selective Mitigation

**ISBN:** 978-2-11-129206-2
Techniques pour l’évaluation et l’amélioration du comportement des technologies émergentes face aux fautes aléatoires

Abstract: L’objectif principal de cette thèse est de développer des techniques d’analyse et mitigation capables à contrer les effets des Evènements Singuliers (Single Event Effects) - perturbations externes et internes produites par les particules radioactives, affectant la fiabilité et la sûreté en fonctionnement des circuits microélectroniques complexes. Cette thèse à la vocation d’offrir des solutions et méthodologies industrielles pour les domaines d’applications terrestres exigeant une fiabilité ultime (télécommunications, dispositifs médicaux, ...) en complément des travaux précédents sur les Soft Errors, traditionnellement orientés vers les applications aérospatiales, nucléaires et militaires.

Les travaux présentés utilisent une décomposition de sources d’erreurs dans les circuits actuels, visant à mettre en évidence les contributeurs les plus importants.

Les upsets (SEU) - Evènements Singuliers (ES) dans les cellules logiques séquentielles représentent actuellement la cible principale pour les efforts d’analyse et d’amélioration à la fois dans l’industrie et dans l’académie. Cette thèse présente une méthodologie d’analyse basée sur la prise en compte de la sensibilité de chaque état logique d’une cellule (state-awareness), approche qui améliore considérablement la précision des résultats concernant les taux des événements pour les instances séquentielles individuelles. En outre, le déséquilibre intrinsèque entre la susceptibilité des différents états des bascules est exploité pour mettre en œuvre une stratégie d’amélioration SER à très faible coût.

Les fautes transitoires (SET) affectant la logique combinatoire sont beaucoup plus difficiles à modéliser, à simuler et à analyser que les SEUs. L’environnement radiatif peut provoquer une multitude d’impulsions transitoires dans les divers types de cellules qui sont utilisés en configurations multiples. Cette thèse présente une approche pratique pour l’analyse SET, applicable à des circuits industriels très complexes. Les principales étapes de ce processus consiste à: a) caractériser complètement la bibliothèque de cellules standard, b) évaluer les SET dans les réseaux logiques du circuit en utilisant des méthodes statiques et dynamiques et c) calculer le taux SET global en prenant en compte les particularités de l’implémentation du circuit et de son environnement.

L’injection de fautes reste la principale méthode d’analyse pour étudier l’impact des fautes, erreurs et disfonctionnements causés par les événements singuliers. Ce document présente les résultats d’une analyse fonctionnelle d’un processeur complexe dans la présence des fautes et pour une sélection d’applications (benchmarks) représentatifs. Des techniques d’accélération de la simulation (calculs probabilistes, clustering, simulations parallèles) ont été proposées et évaluées afin d’élaborer un environnement de validation industriel, capable à prendre en compte des circuits très complexes. Les résultats obtenus ont permis l’élaboration et l’évaluation d’un hypothétique scénario de mitigation qui vise à améliorer sensiblement, et cela au moindre coût, la fiabilité du circuit sous test. Les résultats obtenus montrent que
les taux d’erreur, SDC (Silent Data Corruption) et DUE (Detectable Uncorrectable Errors) peuvent être considérablement réduits par le durcissement d’un petite partie du circuit (protection sélective).

D’autres techniques spécifiques ont été également déployées: mitigation du taux de soft-errors des Flip-Flips grâce à une optimisation du Temporal De-Rating par l’insertion sélective de retard sur l’entrée ou la sortie des bascules et biasing du circuit pour privilégier les états moins sensibles.

Les méthodologies, algorithmes et outils CAO proposés et validés dans le cadre de ces travaux sont destinés à un usage industriel et ont été valorisés dans le cadre de plateforme CAO commerciale visant à offrir une solution complète pour l’évaluation de la fiabilité des circuits et systèmes électroniques complexes.

**Keywords:** Evènements Singuliers, Evènements Singuliers Upsets, Evènements Singuliers Transitoire, Soft Erreurs, Injection de Fautes, Protection Sélective

**ISBN:** 978-2-11-129206-2
Résumé en Français

Contents

1.1 Introduction ............................................. 1

1.1.1 Les Environnements Radiatifs .......................... 2

1.1.1.1 L’Environnement Spatial .......................... 2

1.1.1.2 L’Environnement Atmosphérique .................... 2

1.1.1.3 Les Particules Alpha .............................. 3

1.1.2 Les Événements Singuliers .............................. 3

1.2 Analyse des Événements Singuliers ......................... 5

1.2.1 Caractérisation de la Technologie ...................... 5

1.2.2 Effets de Blocage ...................................... 5

1.2.2.1 Blocage Électrique ............................... 5

1.2.2.2 Blocage Logique .................................... 6

1.2.2.3 Blocage Temporel ................................... 6

1.2.2.4 Blocage Fonctionnel ............................... 7

1.2.2.5 Blocage des Mémoires .............................. 7

1.2.3 Calcul du Taux d’Erreur Global d’un Circuit .............. 7

1.3 Analyse des Événements Singuliers Transitoires ............. 8

1.4 Analyse des Événements Singuliers dans la logique séquentielle ......................................................... 11

1.5 Analyse de-rating d’un microprocesseur ...................... 12

1.6 Annexe: Flip-Flop SEU Reduction through Minimization of the Temporal Vulnerability Factor (TVF) .................. 13

1.1 Introduction

1.1.1 Les Environnements Radiatifs

1.1.1.1 L’Environnement Spatial

L’environnement spatial (Figure 1.1) est un milieu assez hostile pour les équipements électroniques car on peut y rencontrer trois sources de radiation :

1. Les particules emprisonnées par le champ magnétique de la terre (les ceintures de radiations)
   - Particules à haute énergie (rayonnements cosmiques et particules solaires) qui ne sont pas déviées par le champ magnétique de la terre et qui se retrouvent emprisonnées par ce dernier.
2. Les rayonnements cosmiques galactiques (GCR)
   - Particules à haute énergie provenant de l’extérieur du système solaire
3. Les particules solaires
   - Particules à haute énergie (principalement protons et particules alphas) éjectées par le soleil (vent solaire, éruptions solaires, ...).

Figure 1.1: L’environnement spatial

1.1.1.2 L’Environnement Atmosphérique

L’environnement atmosphérique (ou terrestre) est sensiblement moins agressif que l’environnement spatial. Une importante partie des rayons cosmiques et des particules solaires est filtrée par le champ magnétique terrestre (Figure ??). Ensuite, les particules primaires restantes interagissent avec les atomes de l’atmosphère (Figure ??), ces interactions vont engendrer une multitude de particules secondaires (neutrons, protons, muons, pions, électrons, ...), qui sont prédominants aux altitudes de vol d’avions mais qui peuvent aussi atteindre le sol.
1.1.1.3 Les Particules Alpha

Une autre source de particules ionisantes qu’on peut rencontrer au niveau terrestre provient de la désintégration radioactive d’atomes lourds, présents en très faible quantité dans tous les matériaux (utilisés pour le packaging des circuits intégrés). La désintégration radioactive (Figure 1.3) de ces atomes peut en particulier émettre des particules ionisantes comme les particules alpha qui peuvent générer des erreurs.

1.1.2 Les Événements Singuliers

Les événements singuliers (Single Event Effects - SEE) proviennent de l’interaction d’une particule énergétique avec un circuit microélectronique (Figure 1.4). Les particules ou les radiations ionisantes ont le pouvoir d’arracher ou d’exciter les électrons des atomes de la matière qu’elles traversent. Cette perturbation peut modifier d’une manière imprévue le fonctionnement d’un circuit électronique.

Les événements singuliers peuvent engendrer des effets permanents, irréversibles ou des effets réversibles, non destructifs. Parmi les effets non destructifs on peut citer:

- le Single Event Transient (SET) ou Single Event Multiple Transient (SEMT)
- le Single Event Upset (SEU) ou Single Event Multiple Upset (SEMU)
- le Single Bit Upset (SBU), Multi Cell Upset (MCU) et Multi Bit Upset (MBU) pour les mémoires
Figure 1.3: Les désintégrations radioactives

Figure 1.4: Méthodes d’interactions de particules : Ionisation Directe et Indirecte
• le Single Event Functional Interrupt (SEFI)
• le Single Event Latchup (SEL)

Parmi les effets permanents on peut citer:
• le Single Event Burnout (SEB)
• le Single Event induced Snapback (SES)
• le Single Event Gate Rupture (SEGR)

1.2 Analyse des Événements Singuliers

L’Analyse de l’impact des fautes induites par des Soft Erreurs (SE) dans les circuits intégrés reste difficile. La grande majorité des fautes ne se propagent pas (ou effets de masquage) en raison des effets de blocage. La section suivante présente les trois étapes principales d’une méthodologie d’analyse Soft Erreurs: la caractérisation de la technologie, les effets de blocage et enfin le calcul du taux d’erreur global d’un circuit.

1.2.1 Caractérisation de la Technologie

La caractérisation de la technologie SER est la première étape de la méthode d’analyse SER présentée. Les données SER brutes devraient être fournies en termes de probabilité/taux d’occurrence pour un environnement spécifique.

1.2.2 Effets de Blocage

1.2.2.1 Blocage Électrique

Le blocage électrique quantifie l’atténuation électrique d’un SET et donc sa capacité de se propager à travers le réseau combinatoire.

Figure 1.5: Blocage Électrique
1.2.2.2 Blocage Logique

Le blocage logique consiste à évaluer la propagation de l’erreur logique de la sortie de la cellule affectée aux entrées d’une cellule séquentielle ou mémoire. Selon l’état du circuit (les valeurs des signaux et des sorties de cellules), la propagation de l’erreur est soumise à un blocage de la logique. C’est le cas, par exemple, d’une faute qui se propage jusqu’à l’entrée A d’une cellule de type AND. Si la valeur du signal sur l’entrée B est 0, la sortie de la porte sera 0 indifféremment de la valeur du signal sur l’entrée A.

![Figure 1.6: Blocage Logique](image)

1.2.2.3 Blocage Temporel

Le blocage temporel est lié au concept de fenêtre d’opportunité temporelle d’une faute (SET ou SEU) d’être mémorisé dans une cellule séquentielle ou mémoire. Si la faute n’est pas d’échantillonnée dans un registre ou une cellule de mémoire, elle n’aura aucun impact sur le fonctionnement du circuit. Le concept de blocage temporel change selon le type de faute considéré :

- Un SEU, pour se propager, doit arriver dans le registre concerné au début de la période d’horloge pour se propager à travers le réseau logique et d’atteindre le prochain étage séquentiel.

- Un SET doit provoquer une valeur incorrecte sur l’entrée d’une cellule séquentielle ou mémoire pendant la fenêtre d’échantillonnage.
1.2.2.4 Blocage Fonctionnel

Le blocage fonctionnel évalue si un soft erreur a un impact visible sur le fonctionnement du circuit, d’une carte ou d’un système. Il prend en compte l’utilisation réelle du circuit et le fonctionnement du système.

1.2.2.5 Blocage des Mémoires

Le blocage des mémoires représente la portion de temps pendant lequel les données stockées dans une mémoire finiront par être lues et donc utilisées par l’application. Cette métrique est appelée fenêtre de vulnérabilité et elle correspond au temps entre un accès en écriture à une adresse et le dernier accès en lecture à cette adresse avant la fin de la simulation ou devant un autre accès en écriture à cette adresse.

1.2.3 Calcul du Taux d’Erreur Global d’un Circuit

Le calcul du taux d’erreur global d’un circuit combine les données de sensibilité de la technologie (Caractérisation de la technologie) avec les informations concernant la propagation des fautes et leurs effets sur l’application.
1.3 Analyse des Événements Singuliers Transitoires

Les événements singuliers transitoires (SET) affectant la logique combinatoire sont beaucoup plus difficiles à modéliser, simuler et analyser que les événements singuliers upset (SEU). L’environnement de travail peut provoquer une multitude d’impulsions transitoires distinctives dans divers types de cellules qui sont utilisés largement dans différentes configurations.

Ce chapitre présente un flow détaillé d’analyse SET dans un cadre industriel. Cette méthode d’analyse consiste en trois étapes principales :

1. La première étape concerne la caractérisation de la technologie
   - L’outil TFIT a été utilisé pour caractériser la bibliothèque Open-Cell Nangate 45nm. Pour chaque cellule, les effets des neutrons et des particules alpha ont été étudiés.

2. La deuxième étape concerne l’analyse de la propagation de la SET
   - Le circuit considéré pour l’analyse est un Multiplieur-FP (Floating Point), synthétisé avec la bibliothèque Open-Cell Nangate 45nm.

3. Enfin, le SER-SET global est calculé en tenant compte des particularités de la mise en œuvre du circuit et de son environnement.

Les résultats indiquent que le SER-SET (logique combinatoire) est habituellement, de plusieurs fois inférieur par rapport au SER-SEU (logique séquentielle) à cause de un très fort blocage temporel et électrique.

En outre, comme contribution supplémentaire, ce chapitre présente les résultats d’une analyse paramétrique du SET-SER en mettant en évidence les facteurs qui influencent la sensibilité SET :

- Cell Drive Strength : la sensibilité SET globale diminue à l’augmentation du drive strenght. Pour les portes logiques non-inversantes (buffer, and, or, ...), la sensibilité SET diminue pour des courtes impulsions mais augmente pour des longues (Figures 1.9 et 1.10).

- Output Load Capacitance : la sensibilité SET globale diminue à l’augmentation du output load pour des courtes impulsions et reste constante pour des longues (Figure 1.11).

- Supply Voltage : la sensibilité SET globale diminue à l’augmentation du Supply Voltage pour des longues impulsions et reste constante pour des courtes (Figure 1.12).
Figure 1.9: SET SER vs. Drive Strength pour INV de X1 à X32

Figure 1.10: SET SER vs. Drive Strength pour BUF de X1 à X32
Figure 1.11: SET SER vs. Output Load Capacitance pour MUX2

Figure 1.12: SET SER vs. Supply Voltage pour MUX2
1.4 Analyse des Evènements Singuliers dan la logique séquentielle

Les upsets (SEU) - Evènements Singuliers spécifiques pour la logique séquentielle - représentent actuellement la cible principale pour les efforts d’analyse et d’amélioration à la fois dans l’industrie et dans l’académie.

Cette section présente une méthodologie d’analyse basée sur la prise en compte de la sensibilité de chaque état logique d’une cellule (state-awareness), approche qui améliore considérablement la précision des résultats concernant les taux des d’événements pour les instances séquentielles individuelles. Cette section montre qu’en utilisant une valeur de SER moyenne (pour les 8 états) on peut introduire une erreur considérable dans le calcul du SER global d’un circuit. Pour le cas de l’étude considérée (un microprocesseur industriel - ∼190,000 flip-flops), en utilisant un SER-SEU moyen on introduit une erreur de presque 10% (sous-estimée) du SER global du circuit.

Les données SER plus détaillées ont été ensuite utilisées pour proposer des nouvelles techniques de calcul du TDR, pour les flip-flops master-slave.

Le déséquilibre intrinsèque entre la susceptibilité SEU des différents états des flip-flops est exploité pour mettre en œuvre une stratégie d’amélioration de SER-SEU à faible coût (Figure 1.13). Cette technique a été évaluée avec un microprocesseur industriel (∼190,000 flip-flops - ∼1,200,000 portes combinatoires), la probabilité d’état pour chaque cellule séquentielle a été obtenue par simulation. Le SER-SEU peut être diminué de ∼20% (de 76.1 FIT à 61.4 FIT) (tableaux 1.1 et 1.2), en ajoutant ∼70,000 inverseurs, qui représentent mois de 1% de surcoût en surface.

Table 1.1: SER du microprocesseur avant protection

<table>
<thead>
<tr>
<th>% of time storing &quot;0&quot;</th>
<th>Instance Count</th>
<th>SER @ 0</th>
<th>SER @ 1</th>
</tr>
</thead>
<tbody>
<tr>
<td>0% 31697</td>
<td>15.6</td>
<td>0</td>
<td></td>
</tr>
<tr>
<td>5% 3872</td>
<td>1.8</td>
<td>0</td>
<td></td>
</tr>
<tr>
<td>10% 4451</td>
<td>2</td>
<td>0.1</td>
<td></td>
</tr>
<tr>
<td>95% 4029</td>
<td>0.1</td>
<td>0.9</td>
<td></td>
</tr>
<tr>
<td>100% 31732</td>
<td>0</td>
<td>7.4</td>
<td></td>
</tr>
<tr>
<td><strong>TOTAL SER</strong></td>
<td><strong>58.4</strong></td>
<td><strong>17.7</strong></td>
<td><strong>76.1</strong></td>
</tr>
</tbody>
</table>

En outre, comme contribution supplémentaire, ce chapitre présente les résultats d’une analyse SET effectuée pour les cellules séquentielles. L’intérêt de cet effort est d’abord, d’évaluer la sensibilité des parties combinatoires (buffers de sortie, inverseurs internes, etc.) de la cellule séquentielle et, deuxièmement, d’examiner si un SET est possible dans la boucle de mémorisation interne lorsqu’elle est bloquée (mémorisation). Les résultats obtenus avec l’outil TFIT pour un D-Latch, montrent
une différence très importante (~ 3.5X) entre l’état transparent (clock high - plus sensible) et l’état mémorisation (clock low - moins sensible).

### 1.5 Analyse de-rating d’un microprocesseur

L’injection de fautes reste la principale méthode d’analyse pour étudier l’impact des fautes, erreurs et disfonctionnements causés par les événements singuliers. Cette section présente les résultats d’une analyse fonctionnelle d’un processeur complexe en présence des fautes et pour une sélection d’applications (benchmarks) représentatifs. Des techniques de simulation accélérée (calculs probabilistes, clustering, simulations parallèles) ont été proposés et évalués afin de développer un environnement de validation industriel, capable de prendre en compte les circuits très complexes avec des efforts raisonnables (main-d’œuvre et de temps CPU).

Le LDR a été calculé en utilisant une technique appelée PPSFP (Parallel Pattern Single Fault Propagation) en utilisant des vecteurs d’entrées aléatoires (Figure 1.14). Les analyses dynamiques (par simulation) (MDR et FDR) ont été effectuées en utilisant trois applications de la suite MiBench :

- L’algorithme Quick Sort
- Une série d’opérations mathématiques
- Plusieurs algorithmes de bitcount

![Figure 1.13: Optimisation par l’inversion d’état](image)

![Diagram](image)

**Table 1.2: SER du microprocesseur après protection**

<table>
<thead>
<tr>
<th>% of time storing &quot;0&quot;</th>
<th>Instance Count</th>
<th>SER @ 0</th>
<th>SER @ 1</th>
</tr>
</thead>
<tbody>
<tr>
<td>0%</td>
<td>31697</td>
<td>7.4</td>
<td>0</td>
</tr>
<tr>
<td>5%</td>
<td>3872</td>
<td>0.9</td>
<td>0.1</td>
</tr>
<tr>
<td>10%</td>
<td>4451</td>
<td>0.9</td>
<td>0.2</td>
</tr>
<tr>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<td>95%</td>
<td>4029</td>
<td>0.1</td>
<td>0.9</td>
</tr>
<tr>
<td>100%</td>
<td>31732</td>
<td>0</td>
<td>7.4</td>
</tr>
</tbody>
</table>

**TOTAL SER**

<table>
<thead>
<tr>
<th>SER</th>
</tr>
</thead>
<tbody>
<tr>
<td>42.2</td>
</tr>
<tr>
<td>19.2</td>
</tr>
<tr>
<td><strong>61.4</strong></td>
</tr>
</tbody>
</table>
Figure 1.14: Distribution des valeurs de Blocage Logique des Flip-Flop (Moyenne LDR = 74.1%)

Pour chaque application, trois campagnes d’injection de faute ont été effectuées. Une première campagne d’injection de faute a permis le calcul d’un facteur FDR initial. Une deuxième campagne d’injection de faute basée sur une approche de clustering, a permis une meilleure caractérisation (pour chaque instance) du processeur. Enfin, une dernière campagne d’injection de faute a permis l’évaluation d’un scénario de mitigation d’erreur hypothétique mis en œuvre en utilisant les données obtenues à partir de la première et de la deuxième campagne d’injection de faute.

Les résultats obtenus montrent que les taux d’erreurs, à la fois de la SDC (Silent Data Corruption) et DUE (Detectable Uncorrectable Errors), peuvent être considérablement réduits par le durcissement d’un pourcentage limité de flip-flop. Pour le processeur considéré (∼ 250,000 flip-flops), en exécutant seulement 90,000 fautes, on a pu montrer une réduction d’un facteur 5 du taux d’erreurs (SDC et DUE) en durcissant ∼ 10% des flip-flops (Figure 1.15).

1.6 Annexe: Flip-Flop SEU Reduction through Minimization of the Temporal Vulnerability Factor (TVF)

L’annexe présente les résultats d’un projet de recherche en collaboration entre IROC Technologies et Intel Israel. Le contenu est basé sur une publication qui a été présenté à la conférence IOLTS 2015.

Cette section présente une technique d’amélioration du taux de soft-erreurs des flip-flips grâce à une optimisation du blocage temporel (Temporal De-Rating - TDR...
Figure 1.15: Reduction du taux SDC et DUE pour chaque application
ou TVF) par l’insertion sélective de délais (implémenté avec des buffers) sur l’entrée ou la sortie des flip-flops. Le blocage temporel (TDR) dans son expression la plus simple est défini comme le slack divisé par la période d’horloge : c’est donc évident que les chemins critiques sont les moins sensibles et les chemins plus relaxés sont les plus sensibles.

La technique proposée vise à réduire le slack, donc augmenter le blocage temporel, pour les chemins plus relaxés, en ajoutant des délais (buffer). L’implémentation de la technique d’optimisation a été proposé comme un problème de programmation linéaire afin de pouvoir considérer des contraintes de surface et power. La technique a été évaluée avec plusieurs circuits (relativement simples) et les résultats obtenus montrent des réductions du SER global (SEU et SET - la contribution des buffers à été considérée) entre 20% et 50% avec un surcoût en surface de ~ 10%.