New Developments in FPGA Devices: SEUs and Fail-Safe Strategies from the NASA Goddard Perspective by Pellish, Jonathan et al.
Melanie Berg, AS&D in support of NASA/GSFC
Melanie.D.Berg@NASA.gov
Kenneth LaBel, NASA/GSFC 
Jonathan Pellish, NASA/GSFC
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016.
New Developments in FPGA
Devices: SEUs and Fail-Safe 
Strategies from the NASA Goddard 
Perspective
20160014661
https://ntrs.nasa.gov/search.jsp?R=20160014661 2019-08-29T17:08:07+00:00Z
Acknowledgements
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 2
• Some of this work has been sponsored by the 
NASA Electronic Parts and Packaging (NEPP) 
Program and the Defense Threat Reduction 
Agency (DTRA).
• Thanks is given to the NASA Goddard Radiation 
Effects and Analysis Group (REAG) for their 
technical assistance and support. REAG is led by 
Kenneth LaBel and Jonathan Pellish.
Contact Information:
Melanie Berg: NASA Goddard REAG FPGA 
Principal Investigator:
Melanie.D.Berg@NASA.GOV
Acronyms
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 3
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Application specific integrated circuit (ASIC) 
Block random access memory (BRAM) 
Block Triple Modular Redundancy (BTMR) 
Clock (CLK or CLKB)
Clock period counter (T) 
Combinatorial logic (CL) 
Computer aided design tool (CAD) 
Configurable Logic Block (CLB) 
Data-path delay (τdelay)
Digital Signal Processing Block (DSP) 
Distributed triple modular redundancy (DTMR) 
Edge-triggered flip-flops (DFFs)
Equivalence Checking (EC)
Error detection and correction (EDAC) 
Exclusive OR (XOR)
Field programmable gate array (FPGA) 
Finite state machine (FSM)
Gate Level Netlist (EDF, EDIF, GLN) 
Global triple modular redundancy (GTMR) 
Hardware Description Language (HDL) 
Input – output (I/O)
Linear energy transfer (LET)
Local triple modular redundancy (LTMR) 
Look up table (LUT)
NASA Electronic Parts and Packaging (NEPP) 
Mean fluence to failure (MFTF)
Mean time to failure (MTTF) 
Multiple bit upset (MBU) 
Operational frequency (fs)
•
•
•
•
•
Power on reset (POR) 
Place and Route (PR)
Probability of a Configuration single event upset (PSEU) 
Probability of a flip-flop single event upset (PDFFSEU) 
Probability of a functional path single event upset 
(P(fs)functionalLogic)
Probability that a data-path will not be logically masked
(Plogic)
Probability that an SET can propagate to a logic cone end-
point (Pprop)
Probability of a system incurring a single event functional 
interrupt (PSEFI)
•
•
•
•
•
Probability that a DFF will capture a SET (P(fs)DFFSET)
Probability of system malfunction due to a single event upset 
(P(fs)error)
Probability of generating an single event transient (PSET) 
Radiation Effects and Analysis Group (REAG)
Single error correct double error detect (SECDED) 
Single event functional interrupt (SEFI)
Single event effects (SEEs) 
Single event latch-up (SEL) 
Single event transient (SET) 
Single event upset (SEU)
Single event upset cross-section (σSEU)
Static random access memory (SRAM)
Static timing analysis (STA)
System on a chip (SOC) 
System Clock period (Tclk)
Triple modular redundancy (TMR) 
Width of a single event transient (τwidth)
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
• FPGA Devices and Their Tool Environment.
• FPGAs and Critical Applications.
• Single Event Upsets (SEUs) in FPGAs and Fail-Safe 
Overview.
• Single Event Upsets and FPGA Configuration.
• Single Event Upsets in an FPGA’s Functional Data Path 
and Fail-Safe Strategies.
• Fail-Safe Strategies for FPGA Critical Applications.
• Fail-Safe State Machines.
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 4
Agenda
FPGA Devices and Their Design Tool 
Environment
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 5
Definitions
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 6
• A Field-Programmable Gate Array (FPGA) is a 
semiconductor device containing configurable logic 
components called "logic blocks", and configurable 
interconnects. Logic blocks can be configured to perform 
the function of basic logic gates such as AND, and XOR, or 
more complex combinational functions such as decoders 
or mathematical functions.
• An application-specific integrated circuit (ASIC) is a 
semiconductor device designed for a particular use. Its 
designs are considered more custom.
 An FPGA is an ASIC designed to 
have a “sea” of configurable logic 
for general purpose usage.
Creating A Design in An Integrated 
Circuit Device (FPGA or ASIC)
• The objective is to create a hardware 
design using hardware description
language (HDL):
– Clocks,
– Resets,
– Sequential elements 
(e.g., flip-flops),
– Combinatorial logic.
• The description gets synthesized into a hardware gate-level-
netlist (GLN: file listing gates and connectivity).
• The synthesized hardware gates are mapped and placed into 
the cell library (or logic blocks) of the target FPGA or ASIC.
• ASIC flow produces a mask that is handed to a foundry. FPGA
flow produces a configuration file.
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 7
Design Tools
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 8
• Design tools are used for each step of the design process.
• Synthesis: maps HDL into logic blocks (cells) … outputs 
gate-level net-lists.
• Place and route (PR): optimizes where the logic blocks 
and their interconnects should be within the device.
• Synthesis along with PR tools contain optimization 
algorithms within their tool sets.
– These algorithms are used to optimize area, power, and logic 
function.
– Tools are difficult to create. Poorly designed tools can create 
designs that are: functionally incorrect, too large to fit into the 
target device, or output too much power. Hence, a bad tool can 
produce unusable designs.
– Equivalence checking (EC) verifies tool output matches HDL.
Best practice is to use a proven vendor’s tool set – or 
product might be unreliable or unusable.
FPGA User Design Flow
Create 
Configuration
STA, and back 
annotated Gate Level 
Simulation
Place and Route
Looks like ASIC
design flow …
but …without 
the wait time
User creates a design 
that is mapped into a 
manufacturer provided 
FPGA
Functional 
Specification
HDL
Synthesis
Behavioral Simulation
STA, EC, and Gate 
Level Simulation
HDL: Hardware description language 
STA: Static timing analysis
EC: Equivalence checking
Performed by user 
with manufacturer 
FPGA specific 
design tools
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 9
General FPGA Architecture: Fabric Containing
Customizable Preexisting Logic…User Building
Blocks
IO Block – Part of 
IO Ring
Logic Block:
Combinatorial
and/or Sequential
Special High-
Speed Connect
Block
Memory
In
te 
g
ra
te
d
C
irc
u
it
H a r d I P : P r o c e s s o r s ,
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 10
D i g i t a l C l o c k
M a n a g e r s , P h a s e
L o c k e d L o o p s
How Do FPGA’s Differ?
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 11
• Manufacturer Architecture (not all are listed):
– Configuration,
– User building blocks (combinatorial logic cells, sequential logic 
cells),
– Routing,
– Clock structures,
– Embedded mitigation, and
– Embedded intellectual property (IP); e.g., memories, complex I/O 
management, phase locked loops (PLLs), and processors.
• Manufacturer design tool environment:
– Synthesis,
– Place and Route, and
– Configuration management output.
Difference in architectures and tools will affect the 
final design and design process – users be aware.
FPGA Component Libraries: Basic 
Designer Building Blocks (They Differ 
per FPGA Type)
• Combinatorial logic 
(CL) blocks
– Vary in complexity.
– Vary in I/O.
• Sequential logic blocks 
(DFF)
– Uses global Clocks.
– Uses global Resets.
– May have mitigation.
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 12
User Maps the Design Logic into FPGA
Preexisting Logic
Hardware design language (HDL)
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 13
Combinatorial 
FPGA
Equivalent 
Block DFF 
FPGA
Equivalent 
Block
Synthesis
FPGA Configuration (Storage of User 
Design Mapping)
FPGA MAPPING
Configuration Defines: 
Arrangement of pre-existing 
logic via programmable 
switches.
Functionality (logic cluster) and 
Connectivity (routes)
Programmable Switch 
Types:
Antifuse: One time 
Programmable (OTP),
SRAM: Reprogrammable (RP), 
or
Flash: Reprogrammable (RP).
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 14
SEU Testing is required in order to characterize the
σSEUs for each of FPGA categories.
FPGA Structure Categorization as 
Defined by NASA Goddard REAG:
Design σSEU Configuration σSEU Functional logic
σSEU
SEFI σSEU
Sequential and 
Combinatorial 
logic (CL) in 
data path
Global Routes 
and Hidden 
Logic
– Single Event Upset (SEU)
– Single event functional interrupts (SEFI) 
SEFI out of presentation scope
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 15
SEU cross section: σSEU
FPGA’s And Critical Applications
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 16
Common FPGA Applications
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 17
• Controllers,
• Dataflow and interface adaptation,
• Digital signal processing (DSP),
• Software-defined radio,
• ASIC prototyping,
• Medical imaging,
• Robotic control (vision, movement, speech, etc.,…)
• Cryptology,
• Nuclear plant control,
• The list goes on…
Soil Moisture 
Active Passive
Spacecube: 
International 
Space Station
Mars Rover
New Horizons 
Pluto and Beyond
Common Applications Example 1:
Military and Space
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 18
Common Applications Example 2: 
Terrestrial
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 19
Automotive applications that are opening up to FPGA-based solutions:
Navigation and Telematics Displays
Personnel Occupancy Detection Systems (PODS) for Next-GenerationAirbags 
Blind-Spot Warning System
Engine Control Module
Lane Departure Warning System 
Adaptive Cruise Control 
Collision Avoidance System
Injector Control (especially diesel engines) 
Power Steering Control
Multi-Axis Power Seat Control
Advanced Suspension and Traction Control 
Emissions Control
Back-up Sensors 
Back-up Camera
Rear-Seat Entertainment Source MUXing 
Digital Cluster
Concerns for using FPGA Devices in 
Critical Applications
• Safety: can circuits or
humans be damaged or hurt?
• Reliability : will the device 
operate as expected?
• Availability: how often will the 
system operate as expected?
• Recoverability: if the device 
malfunctions, can the system 
come back to a working 
state?
• Trust: Will the insertion of the 
device compromise security?
Critical applications expect to 
avoid disaster when disaster is
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 20
probable.
Sources of FPGA Failures
Negative bias 
temperature 
instability (NBTI)
Dielectric 
breakdown 
(DB)
Hot carrier 
injection (HCI),
Total ionizing 
dose (TID)
Single event 
effects (SEEs)
Environmental 
stress
Poor design 
choices
Lack of 
verification
Electromigration 
(EM)
Packaging and 
mounting
Transistor 
switching stress
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 21
How To Protect A System from Failure
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 22
• Investigate failure modes – understand risk:
– Reliability testing (temperature, voltage, mechanical, and logic 
switching stresses).
– Radiation testing: Single event effects (SEE) and total ionizing 
dose (TID).
• Add redundancy:
– Replication with correction.
– Replication with detection. Requires recovery:
• Switch to another device,
• Try to recover state,
• Start over,
• Alert,
• Do nothing… die.
• Add filtration: e.g., Finite impulse response (FIR) filters 
or Constant false alarm rate filter (CFAR).
• Add masking: Protect system operation from failures.
Single Event Upsets (SEUs) and FPGA 
Devices
• Although there are many sources of FPGA 
malfunction, this presentation will focus on SEUs as a 
source of failure.
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 23
Implications of SEUs to FPGA 
Applications
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 24
• Ionizing particles cause upsets (SEUs) in FPGAs.
• Each FPGA type has different SEU error signatures:
– Temporary glitch (transient),
– Change of state (incorrect state machine transitions),
– Global upsets: Loss of clock or unexpected reset,
– Configuration corruption. This includes route breakage (no 
signal can get through) – can be overwhelming.
• The question is how to avoid system failure and the 
answer depends on the following:
– The system’s requirements and the definition of failure,
– The target FPGA and its surrounding circuitry susceptibility,
– Implemented fail-safe strategies,
– Reliable design practices,
– Radiation environment.
Characterizing SEUs: Radiation Testing 
and SEU Cross Sections
Terminology:
• Flux: Particles/(sec-cm2)
• Fluence: Particles/cm2
σseu is calculated at several LET values 
(particle spectrum)
Testing with a low flux is imperative with 
SRAM based devices: this is due to the 
complexity of the device versus the 
accelerated rate of exposure.
=
#errors
fluenceseu
σ
SEU Cross Sections (σseu) characterize how many
upsets will occur based on the number of ionizing
particles to which the device is exposed.
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 25
SEE Go-no-Go: Single Event Hard 
Faults and Common Terminology
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 26
• Single Event Latch Up (SEL): Device latches in high 
current state:
– Has been observed in FPGA devices that are currently on the 
market.
– Some missions choose to use the devices and design around 
the SEL.
• Single Event Burnout (SEB): Device draws high 
current and burns out.
– Not observed in FPGA devices that are currently on the 
market.
• Single Event Gate Rupture: (SEGR): Gate destroyed 
typically in power MOSFETs.
• Not observed in FPGA devices that are currently on the 
market.
Goal for critical applications:
Limit the probability of system error propagation
and/or provide detection-recovery mechanisms
via fail-safe strategies.
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 27
Fail-Safe Strategies for FPGA 
Critical Applications
Differentiating Fail-Safe Strategies:
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 28
• Detection:
– Watchdog (state or logic monitoring).
– Simplistic Checking … Complex Decoding.
– Action (correction or recovery).
• Masking (does not mean correction):
– Not letting an error propagate to other logic.
– Redundancy + mitigation or detection.
– Turn off faulty path.
• Correction (error may not be masked):
– Error state (memory) is changed/fixed.
– Need feedback or new data flush cycle.
• Recovery:
– Bring system to a deterministic state.
– Might include correction.
There is a difference between error masking, correcting,
and detecting!
Availability versus Correct Operation
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 29
• What is your expected up-time versus down-time 
(availability)?
• Is correct operation well defined? Needs to be 
Unambiguous!
• Is system failure well defined? Needs to be 
Unambiguous!
• Can availability and correct operation be deterministic 
regardless of error signature?
• Availability:
– Might be defined for general operation.
– Might be only defined during critical operation.
– E.g., device operation during manned missions might be for 
short periods of time. However, correct operation is 
mandatory and availability during this window is expected to 
be %99.9999(999?)
Detection and Recovery
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 30
• Not all mitigation schemes require detection.
• Questions/Considerations:
– If your scheme requires detection:
• Can the system detect all error signatures?
• Can the system detect all error signatures fast
enough?
• Do different errors require different recovery 
schemes… can the system accommodate.
– How are you going to verify the detection and 
recovery?
– How much downtime will there be during recovery
Availability is affected by detection and recovery time
“Yes or “We know it will work” are not good enough 
answers: Ask how and if the scheme has been verified!
SEUs and FPGA Variations
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 31
• FPGA susceptibilities (error signatures) vary per 
FPGA type.
• How does a project manage and protect against
FPGA susceptibilities? (mitigation schemes will
change based on FPGA type).
• The most efficient solution will be based on 
understanding:
– SEE theory,
– FPGA SEE susceptibility (per FPGA type),
– Proven mitigation strategies per FPGA type,
– Validation and verification of implemented mitigation 
strategies, and
– Limitations of tools and/or mitigation schemes.
Redundancy Is Not Enough
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 32
• Just adding redundancy to a system is not 
enough to assume that the system is well 
protected.
• Concerns that must be addressed for a critical 
system expecting redundancy to cure all (or 
most):
– How is the redundancy implemented?
– What portions of your system are protected? Does the 
protection comply with the results from radiation 
testing?
– Is detection of malfunction required to switch to a 
redundant system or to recover?
– If detection is necessary, how quickly can the detection 
be performed and responded to?
– Is detection enough?... Does the system require 
correction?
Mitigation and FPGA Devices
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 33
• Mitigation can be:
– User inserted: part of the actual design process.
• User must verify mitigation… Complexity is a RISK!!!!!!!!
– Embedded: built into the device library cells.
• User does not verify the mitigation – manufacturer does.
• Mitigation should reduce error…
– Generally through redundancy.
– Incorrect implementation can increase error.
– Overly complex mitigation cannot be verified and 
incurs too high of a risk to implement.
Radiation Hardened (per SEU) versus 
Commercial FPGA Devices
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 34
• A radiation hardened (per SEU) FPGA is a device that 
has embedded mitigation.
• Radiation hardened FPGA devices are available to 
users. They make the design cycle much easier!
• They are considered hardened if:
• Configuration susceptibility is reduced to an 
acceptable rate.
• Generally, less than one node per 1x10-8 days.
• Be careful: with millions of nodes, this can translate
into 1 or two configuration failures per year.
• However, if the node isn’t being used, then your
circuit may not be affected by the failure.
Radiation Hardened versus Commercial FPGA
Device Geometries And Gate Count
As Geometries Get Smaller, More Gates Are Available for Mitigation
= SEU Hardened/Harder
0
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 35
1 2 3 4
Logic Capacity - Millions
5
Virtex UltraScale+ 
Kintex UltraScale+ 
Virtex UltraScale 
Kintex UltraScale
Virtex-7 
Virtex-7Q 
Stratix 5
Virtex 5 
Virtex 5QV
Virtex 4QV and Virtex 4
RT-ProASIC
RTAX-S
90nm 
130nm 
150nm
65nm
28nm
20nm
16nm
Courtesy of Synopsys
FPGA Devices Listed by Configuration
Type (Not All Are Included in The List): 
Embedded Mitigation
ionizing dose data.
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications,
Montreal, Canada, November 10, 2016. 36
Go to http://radhome.gsfc.nasa.gov, manufacturer websites, and other 
space agency sites for more information on SEU data and total
Manufacturer Configuration 
Type
Short List of 
Device Families
Embedded 
Mitigation
Altera SRAM Stratix No
Microsemi Antifuse RTAX, RTSXS Clocks +DFFs 
(configuration is 
already hardened 
by nature)
Microsemi Flash ProASIC3, RTG4 Configuration is 
already hardened 
by nature.
Xilinx SRAM Virtex, Kintex No
Xilinx Hardened SRAM Virtex V5QV Configuration + 
DICE DFFs +
SET filters
FPGA Devices Listed by Configuration Type
(Not All Are Included in The List): Susceptibility
Configuration Type
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 37
Short List of 
Device Families
Embedded 
Mitigation
Most Susceptible 
Components
SRAM Stratix, Virtex, 
Kintex
No Configuration
Antifuse RTAX, RTSXS DFFs and clocks 
(configuration is 
already hardened by 
nature)
Combinatorial logic 
(however 
susceptibility 
considered low)
Flash ProASIC3, RTG4 Configuration is 
already hardened by 
nature.
ProASIC: DFFs and 
clocks
RTG4: Clocks
Hardened SRAM Virtex V5QV Configuration + 
DICE DFFs + SET
filters
Clocks. In some 
cases additional 
mitigation may be 
necessary for 
configuration and 
DFFs
Go to http://radhome.gsfc.nasa.gov, manufacturer 
websites, and other space agency sites for more 
information on SEU data and total ionizing dose data.
NASA and other Government Agency 
FPGA Device Selection for Critical 
Applications
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 38
• Currently, the most common FPGA devices used 
for NASA driven critical space applications are 
anti-fuse.
• This is also true for other government agencies.
• However, due to cost of implementation and 
robustness of design, SRAM-based and Flash 
FPGAs are becoming more popular.
• The usage of SRAM-based FPGA devices 
introduces a variety of challenges for critical 
operations because their SEU susceptibility and 
reduced security.
Preliminary Design Considerations for 
Mitigation And Trade Space
• Does the designer need to add 
mitigation?
• Will there be compromises?
– Performance and speed,
– Power,
– Schedule
– Mitigating the susceptible 
components?
– Reliability (working and mitigating 
as expected)?
Impact to speed, power, area, reliability, and
schedule are important questions to ask.
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 39
Determine Most Susceptible Components:
Fail-safe Strategies for Single Event 
Upsets (SEUs)
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 40
• The following slides will demonstrate commonly used 
mitigation strategies for FPGA devices.
• What you should learn:
– The differences between FPGA mitigation 
strategies.
– Strengths and weaknesses of various strategies.
– Questions to ask or considerations to make when 
evaluating mitigation schemes.
– Which mitigation schemes are best for various 
types of FPGA devices.
• The scope of this presentation will cover fail-safe 
strategies for configuration and data-path SEUs
Single Event Upsets and FPGA 
Configuration
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 41
Pconfiguration+P(fs)functionalLogic+PSEFI
Programmable Switch Implementation and 
SEU Susceptibility
ANTIFUSE (one time programmable)
SRAM (reprogrammable)
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 42
FPGA
Configuration 
Type
REAG Model
Antifuse
SRAM (non-
mitigated)
Flash
Hardened SRAM
P(fs) SEFIfunctionalLogicerror +P∝P( fs)
P(fs) ∝PConfigurationerror
P(fs) SEFIfunctionalLogicerror +P∝P( fs)
P(fs) ∝P SEFIfunctionalLogicConfigurationerror +P+P( fs)
Configuration SEU Test Results and 
the REAG FPGA SEU Model
P(fs) ∝P +P( fs) +P
error Configurat ion functional Logic SEFI
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 43
What Does The Last Slide Mean?
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 44
FPGA
Configuration 
Type
Susceptibility
Data-path: Combinatorial Logic (CL) and Flip-flops (DFFs); 
Global: Clocks and Resets;
Configuration
Antifuse Configuration has been designated as hard regarding SEEs. 
Susceptibilities only exist in the data paths and global routes. 
However, global routes are hardened and have a low SEU 
susceptibility.
SRAM (non-
mitigated)
Configuration has been designated as the most susceptible portion 
of circuitry. All other upsets (except for global routes) are too 
statistically insignificant to take into account. E.g., it is a waste of 
time to study data path transients, however clock transient studies 
are significant.
Flash Configuration has been designated as hard (but NOT immune) regarding 
SEEs. Susceptibilities also exist in the data paths and global routes (e.g., 
clocks and resets).
Hardened 
SRAM
Configuration has been designated as hardened (but NOT hard)
regarding SEEs. Susceptibilities also exist in the data paths and
global routes (e.g., clocks and resets).
R 
O 
U 
T 
I 
N 
G 
M 
A 
T 
R 
I 
X
Example: Routing Configuration
Upsets in a Xilinx Virtex FPGA
Look Up Table:
I1 I2 I3 I4
LUT
LUT
Because multiple paths can pass through the routing matrix, this 
configuration can be catestrophic – i.e., break simple mitigation
I I I1 2 3 I4
LUT
I1 I2 I3 I4
LUT
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 45
• Upsets have no effect until 
Address containing upset is 
read out of SRAM
• Error detection and 
correction (EDAC) are placed 
after data out
• EDAC circuits only work one 
data word at a time
Traditional SRAM … One Data Word at 
a Time
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 46
Configuration SRAM is NOT Utilized the
Same Way as Traditional SRAM
B1  B2  B3   
B4 B B B
B
Bi   Bi+1Bi+2 Bi+3 
B B B B
B B B 
B B B B
B B B B 
B B B B
• Direct connections
from configuration to 
user logic.
• Upset occurs in a used 
configuration bit then, 
upset occurs in logic.
• We’re not dealing with data words anymore. Traditional SRAM
EDAC schemes don’t quite apply for configuration SRAM
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016.
Every used bit is visible
B B B B B B B B
B B B B B B B B
B B B B B B B B
Scrubbers: Blind versus Read-back
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016.
48
Blind Scrubber
• Write golden configuration 
into configuration
• Scrub cycle in the order of
ms
• Pros: simple, less area and 
power, no need for 
additional non-volatile 
memory, very fast (great for 
accelerated testing)
• Cons: Write pointer can get 
hit during writing and write 
bad data into configuration-
however, insignificant 
probability of occurrence 
(proven in heavy ion SEU 
testing)
Read-back
• Read configuration, calculate
correct data; if there is an 
upset, write corrected data.
• Scrub cycle in the order of s
• Pros, only writes if there is an 
upset
• Cons, additional non-volatile 
memory required; slow (only 
a problem for accelerated 
testing); takes more area and 
power; Correction scheme 
can break (e.g. be limited to 
detecting and correcting one 
upset); Consequently, upon 
an MBU can write bad data to 
configuration – this has been 
proven via heavy-ion testing.
Scrubbers: Internal versus External (1)
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 49
• Internal and external scrubbers are used to fix 
configuration bits:
– Internal scrubber: is created out of hard cores that reside inside 
the FPGA device; or is created out of user fabric logic blocks 
located inside the FPGA device.
– External scrubber is implemented in an separate device .
• External scrubbers are usually implemented in anti-fuse 
FPGA devices.
• Internal scrubbers are obviously more susceptible than 
external scrubbers.
• Scrubbers that use SECDED are considered suspicious
because of the increased number of multiple bit upsets
(MBUs) in configuration memory.
Scrubbers: Internal versus External (2)
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 50
• Internal scrubbers are usually implemented as read-
back. Remember read-back scrubbers can break 
and write bad frames into the configuration due to 
MBUs.
• Although configuration memory interleaving has 
been used in the newer SRAM-based FPGAs, a 
significant number of multiple bit upsets (MBUs) 
have been observed via Naval Research Laboratory 
(NRL) laser testing and heavy-ion testing.
Differentiate Scrubbing for Space Applications
and Scrubbing for Radiation Testing
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 51
Space Application
• Only scrub if there is 
mitigation
• Make scrubber simple to 
reduce project risk
• Do not scrub constantly – not 
necessary and not good for 
the system
• Single error correction double 
error detection (SECDED) 
scrubbers may not work well 
due to multiple bit upsets 
(MBUs)
• Blind scrubbing is the 
simplest scheme yet read-
back will also work
Accelerated SEU Testing
• We must scrub!
• Particles cannot overtake the 
scrubber – i.e., scrubber must 
be fast enough to stop fast 
accumulation of configuration 
SEUs – SCRUB CONSTANTLY
• SECDED scrubbing schemes 
do not work well during 
accelerated testing because 
of MBUs and accumulation
• Generally no time for read-
back of configuration – hence 
blind scrubber is the best fit 
for accelerated testing
Warning!
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 52
• Fixing a configuration bit does not mean that you 
have fixed the state in the functional logic path.
• In order to guarantee that the functional logic is 
in the expected state after the configuration bit is 
fixed, either the state must be restored or a reset 
must be issued.
Reliably getting to an expected state after a 
configuration-bit SEU (that affects the design’s 
functionality) requires one of the following:
– Fix configuration bit + (reset or correct DFFs) or
– Full reconfiguration.
R 
O 
U 
T 
I 
N 
G 
M 
A 
T 
R 
I 
X
Example: Routing Configuration
Upsets in a Xilinx Virtex FPGA
Look Up Table:
I1 I2 I3 I4
LUT
LUT
Configuration + design state must be corrected after a configuration 
SEU hit.
I I I1 2 3 I4
LUT
I1 I2 I3 I4
LUT
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 53
Single Event Upsets in an FPGA’s Functional 
Data Path and Fail-Safe Strategies
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 54
Pconfiguration+P(fs)functionalLogic+PSEFI
Data-path SEUs and Their Affect At The
System Level
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 55
• A system implemented in an FPGA is a
cascade of sequential and combinatorial
logic.
• The occurrence of an SET or SEU does not 
definitively cause system error.
• Probability of a system error due to an
SEU depends on many factors:
– Probability of fault generation in a gate (SET or 
SEU).
– Probability of error propagation – will the SET 
or SEU force the system’s next state to be 
incorrect?
Probability of Error Propagation in A Data-
Path
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 56
Upsets usually occur between clock cycles: Can
cause a system-level malfunction if the SET or SEU
will force the system’s next state to be incorrect.
• Capacitive filtration: data-path capacitance can stop 
transient upset propagation; e.g.:
– Routing metal or heavy loading.
– If a transient doesn’t reach a sequential element, then it most 
likely will not cause a system upset.
• Logic masking:
– Redundancy and mitigation of paths can stop upset propagation.
– Turned off paths from gated logic can stop upset propagation.
• Temporal delay: path delays can block temporary SEUs 
from disturbing next state calculation.
•Data-path SEU Susceptibility and 
Analysis : the NASA Electronics Parts 
and Packaging (NEPP) FPGA Model
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 57
Berg M.,” FPGA SEE Test Guidelines”, NASA Radiation Effects 
and Analysis Group Website: 
https://nepp.nasa.gov/files/23779/FPGA_Radiation_Test_Guide 
lines_2012.pdf, July 2012.
Background: Synchronous Design Data 
Path – Sample and Hold
• Edge Triggered Flip-Flops (DFFs),
• Clocks and resets (global routes),
• Combinatorial Logic (CL).
• All DFFs are connected to a clock.
• DFFs sample their input at the rising 
edge of clock.
DFFs
• Synchronous design components:
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 58
• CL compute between clock edges.
Designs are complex – We modularize for simplicity
=
1
fs
Frequency
clkτ
τclk
Clock
Period
CL
DFF
Background: Synchronous Data Paths:
StartPoint DFFs → EndPoint DFFs
• Datapath defined as StartPoint via CL to 
EndPoint.
• CL and routes create delay (τdly ) from 
StartPoints to EndPoints.
• Every data path has a unique τdly .
• τdly is calculated using Static Timing 
Analysis (STA) design tools.
T T+1
Every DFF has a function that 
determines its state
T-1
τdly τclk
τdly
Modularization: Every DFF has a unique cone of logic
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications,
59Montreal, Canada, November 10, 2016.
How can a DFF Contain an Incorrect 
State from a SEU?
We make a clear distinction 
between DFF SEUs based on 
Clock state and Capture.
DFFk Cone of Logic
EndPoint DFF SEUs + StartPoint DFF SEUs + CL SETs
DFF upsets that 
occur at the clock 
edge.
DFF upsets that occur 
between clock edges and 
are captured by
Single Event 
Transients 
captured by
 DFFs have various modes of 
reaching a bad state due to SEUs.
 Attribute some modes to EndPoints 
and some to StartPoints.
EndPoints. EndPoints.
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 60
Wrong function = 
Wrong DFF State
01
1
0
1
How Does a StartPoint SEU get Captured 
by an EndPoint?
If DFFD flips its state @ time=τ:
0<τ<τclk τdly or
τ+τdly <τclk
Probability of capture:
1
0???
T T+1
Time Slack = τclk −τdly
T-1
τdly τclk
1- (τdly/τclk)= 1-τdlyfs
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 61
Details of Capturing StartPoint DFFs
• SEU generation occurs in a StartPoint between rising clock 
edges (βP(fs)DFFSEU)
• StartPoint upsets can be logically masked by logic 
between the StartPoint and its EndPoint
• Design topology and temporal effects:
– Increase path delay (# of gates) – decrease probability of capture
– Increase frequency – decrease probability of capture
Upset generated 
internally to DFF 
between clock 
edges
Design Topology Design
and Temporal 
Masking
Topology and 
Logic Masking
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 62
01
1
0
1
Synchronous System: CL SET Capture
0???
SET
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 63
τwidth
SET
Details of CL SET Capture in a 
Synchronous System: P(fs)DFFSET
• SET Generation (Pgen) occurs between clock edges
• EndPoint DFF captures the SET at a clock edge
– Increase frequency – increase probability of capture
– Increase CL – increase probability of capture
SET is Generated SET will not be 
logically masked
Width of SET
relative to clock 
period
Propagation:
SET can propagate 
through electrical medium 
(routes and gates) and 
reach the End-Point
τclk
τwidth
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 64
NEPP FPGA Model: Putting it All Together
– Analyzed Per Particle Linear Energy
Transfer (LET) EndPoint 
StartPoints
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 65
CL
StartPoints and CL need to be captured by an EndPoint… hence data 
path de-rating factors exist.
Table: Component Contribution to σSEU across Frequency and Gate Count
EndPoint 
Logic 
Masking
Frequency # of Gates in Path
EndPoint Directly Proportional N/A
StartPoint Inversely Proportional Inversely Proportional
CL Directly Proportional Directly Proportional
Current Use of NEPP FPGA Model
EndPoint 
StartPoints
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 66
CL
Currently, model is used to better understand 
heavy-ion SEU data:
Great for measuring mitigation scheme strength.
EndPoint 
Logic 
Masking
Data Path Fail-Safe Strategies
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 67
Selecting a Fail-Safe Scheme (1)
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 68
• Fail-Safe scheme selection will depend on:
– Requirements,
– FPGA device type, and
– design architecture.
• Everything depends on requirements. Do not over-
design or under-design.
– How long is the system required to operate?
– How much error can be tolerated?
– What type of errors can be tolerated or untolerated?
• FPGA device type considerations:
– Does the device have embedded mitigation?
– Is the device’s configuration soft?
Selecting a Fail-Safe Scheme (2)
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 69
• Design Architecture considerations:
• Flushable systems:
– Systems that can tolerate next-state≠expected-state.
– However, if next-state≠expected-state, next-state is always 
deterministic.
– Examples of tolerable system flushing:
• Relatively frequent soft or hard system resets.
• Safe-state machines?????????
• Roll-back systems.
• Non-flushable systems:
– next-state must always equal expected-state during a given 
window of operation.
– “Must always equal” is an exaggeration: cannot be %100 
(no such thing) however, %99.9999 might be feasible.
– Example: while an FPGA is controlling a manned mission 
during take-off – the operation is not interruptible.
Dual Redundant Systems 
(Detection Systems)
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 70
Dual Redundancy Example
• Dual redundant systems cannot correct; they can only detect.
• Form of correction: Roll-back + dual redundancy:
– Roll-back is not a sufficient solution for systems with highly susceptible
hardware. Not bad for systems where memory is the most susceptible.
– Roll-back may not satisfy requirements for critical applications (next-state ≠
expected state).
• Alert systems must be highly reliable and verifiable.
Complex System
Complex System
Compare
Alert 
Recover
S
y
n
c
h
ro
n
iz
e
Synchronization is not 
always easy or predictable
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 71
Mitigation – Fail Safe Strategies That 
Do Not Require Fault Detection but 
Provide SEU Masking and/or 
Correction:
Triple Modular Redundancy (TMR)
All terminology presented is now used 
by standard commercial tools: 
Mentor Graphics and Synopsys
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 72
Triplicate and Vote
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016.
TMR Implementation
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 74
• As previously illustrated, TMR can be implemented in a 
variety of ways.
• The definition of TMR depends on what portion of the 
circuit is triplicated and where the voters are placed.
• The strongest TMR implementation will triplicate all 
data-paths and contain separate voters for each data-
path.
– However, this can be costly: area, power, and 
complexity.
– Hence a trade is performed to determine the TMR 
scheme that requires the least amount of effort and 
circuitry that will meet project requirements.
• Presentation scope: Block TMR (BTMR), Localized TMR 
(LTMR), Distributed TMR (DTMR), Global TMR (GTMR).
Block Triple Modular Redundancy: BTMR
• Need Feedback or flushing to Correct
• Cannot apply internal correction from voted outputs
• If blocks are not regularly flushed (e.g. reset), Errors 
can accumulate – may not be an effective technique
V 
O 
T 
I 
N
G 
M 
A 
T 
R 
I 
X
Complex 
function
with
DFFs
Can Only 
Mask 
Errors
3x the error rate with
triplication and no
correction/flushing
Copy 1
Copy 2
Copy 3
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 75
BTMR Clarification: Very Important 
Point!!!!
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 76
• Adding more blocks with voting on the outputs:
– Only masks upsets.
– Does not correct upsets.
• Will not prolong normal operation!!!!!!!!
However, will mask upsets.
• Device lifetime does not last longer …if 
malfunction is expected within one day, then all
devices are expected to malfunction within one 
day. Because there is NO correction.
• However, if operation doesn’t need to be 
prolonged – i.e, flushable circuits, then BTMR can 
be great.
When BTMR is Beneficial: Examples of 
Flushable BTMR Designs
Voter
• Shift Registers.
• Transmission channels: It is typical for transmission 
channels to send and reset after every sent packet.
• Lock-Step microprocessors that have relaxed 
requirements such that the microprocessors can be 
reset (or power-cycled) every so-often.
Transmission channel example:
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 77
TRANSMIT
TRANSMIT
TRANSMIT
RESET
If The System Is Not Flushable, Then 
BTMR May Not Provide The Expected 
Level of Mitigation
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 78
• BTMR can work well as a mitigation
scheme if the expected MTTF >> expected
window of correct operation.
• But… If the expected time to failure for one
block is less than the required full-
liveliness availability window, then BTMR
doesn’t buy you anything.
• If not thought out well, BTMR can actually
be a detriment – complexity, power, and
area, and false sense of performance.
Explanation of BTMR Strength and Weakness
using Classical Reliability Models
Overall: 
MTTFBTMR < MTTFBlock
Operating in this time interval 
will provide a slight increase in 
reliability.
However, it will provide a 
relatively hard design.
System 2
SEU Data
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 79
System 1
Relibility for 1 
block (Rblock)
Relibility for 
BTMR (RBTMR)
Mean Time to 
Failure for 1 
block (MTTFblock)
Mean Time to
Failure BTMR
(MTTFBTMR)
e- λt 3 e- 2λt-2 e- 3λt 1/ λ (5/6 λ)= 0.833/λ
High Reliability: Full Operation 
Required for Small Window of Time
• If a small time window of operation is required 
and reliability is expected to be high, then adding 
redundancy: Triple, quadruple, n-tuple, etc,… can 
be used.
Reliability: Zoomed In
system flush)
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications,
Montreal, Canada, November 10, 2016. 80
• Early in the 
window, as you add 
correctly 
implemented 
redundancy, the 
reliability stays 
high longer.
• However!!!!!!!! This is not true as time goes on.
This is only true during time near start time (from
High Reliability: Full Operation 
Required for Large Window of Time
• If a large time window of operation is required 
and reliability is expected to be high, then adding 
redundancy: Triple, quadruple, n-tuple, etc,… 
should not be used.
• As you add more 
redundancy, 
overtime, the 
reliability will drop 
very quickly.
• Using more redundancy (adding more blocks) will
be less reliable over time
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 81
What Should be Done If Availability 
Needs to be Increased?
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 82
• If the blocks within the BTMR have a relatively high upset
rate with respect to the availability window, then stronger
mitigation must be implemented.
• Bring the voting/correcting inside of the modules… bring
the voting to the module DFFs.
The following slides illustrate the various forms of TMR that 
include voter insertion in the data-path.
TMR
Nomenclature
Description
DFF: Edge triggered flip-flop; CL: Combinatorial Logic
TMR
Acronym
Local TMR DFFs are triplicated LTMR
Distributed TMR DFFs and CL-data-paths are 
triplicated
DTMR
Global TMR DFFs, CL-data-paths and global 
routes are triplicated
GTMR or 
XTMR
Probability that an 
SET in a CL gate will 
manifest as an error 
in the next system 
clock cycle
DFF: Edge triggered flip-flop CL: Combinatorial Logic
P(fs)error∝Pconfiguration + P(fs)functionalLogic + PSEFI
P(fs)DFFSEU + P(fs)SET→SEU
Describing Mitigation Effectiveness Using 
A Model
Probability that an 
SEU in a DFF will 
manifest as an error 
in the next system 
clock cycle
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 83
Voter
P(fs)error∝Pconfiguration + P(fs)functionalLogic + PSEFI
0
Local Triple Modular Redundancy (LTMR)
Comb
Logic
DFF
LTM R
Voter
P(fs)DFFSEU →SEU + P(fs)SET→SEU
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 84
Voter
Comb DFF
Logic
Comb
Logic
DFF
• Only DFFs are triplicated. Data-paths are kept singular.
• LTMR masks upsets from DFFs and corrects DFF upsets if feedback is 
used.
• Good for devices where DFFs are most 
susceptible and configuration and CL 
susceptibility is insignificant; e.g.,
Microsemi ProASIC3.
Adding LTMR to a Microsemi ProASIC3
Device
• Microsemi ProASIC3
– DFFs are the most 
susceptible (to 
heavy-ion SEUs) 
data-path 
components.
• Adding LTMR 
decreases design 
sensitivity to SEUs.
LET: Linear Energy Transfer; 
WSR: windowed shift register
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 85
Adding LTMR to a Microsemi ProASIC3
Device versus RTAXs Embedded LTMR
• At lower LETs, user 
inserted LTMR to the 
ProASIC3 has similar 
SEU response to 
Microsemi RTAXs 
series.
• Higher LETs, clock tree 
upsets start to 
dominate and LTMR in 
the ProASIC3 is not as 
effective.
• For most critical 
applications, these 
cross-sections will 
produce acceptable 
upset rates.
LET: Linear Energy Transfer; 
WSR: windowed shift register
Embedded 
LTMR in a DFF 
cell RTAXs 
series.
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 86
R 
O 
U 
T 
I 
N 
G 
M 
A 
T 
R 
I 
X
LTMR Should Not Be Used in An 
SRAM Based FPGA
I1 I2 I3 I4
LUT
I1 I I2 3 I4
LUT
I1 I2 I3 I4
LUT
Look Up Table:
LUT
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 87
Distributed Triple Modular Redundancy (DTMR)
DTMR
Voter
Voter
P(f P Low Minimally
s)error∝ configuration + P(fs)functionalLogic + PSEFI Lowered
P(fs)
0
+ P(f )
Low
DFFSEU →SEU s SET→SEU
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 88
Voter
Voter
Voter
Voter
Voter
Voter
Voter
CombDFF Comb
Logic Logic
DFF
Comb
Logic
DFF
• Triple all data-paths and add voters after DFFs.
• DTMR masks upsets from configuration + DFFs + CL and corrects 
captured upsets if feedback is used.
• Good for devices where configuration or DFFs + CL are more
susceptible than project requirements; e.g., Xilinx and Altera
commercial FPGAs.
Voter
Low Low Lowered
P(fs)error∝Pconfiguration + P(fs)functionalLogic + PSEFI
Global Triple Modular Redundancy (GTMR)
P(fs)DFFSEU →SEU + P(fs)SET→SEU
Comb
Logic
GTMR Voter
DFF
Voter
Voter
Voter Voter
Voter
Voter
DFF
Comb DFF Comb
Logic Logic
Low
• Triple all clocks, data-paths and add voters after DFFs.
• GTMR has the same level of protection as DTMR; however, it also 
protects clock domains.
• Good for devices where configuration or DFFs + CL are more
susceptible than project requirements; e.g., Xilinx and Altera
commercial FPGAs.
Voter
Low
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 89
Theoretically, GTMR Is The Strongest 
Mitigation Strategy… BUT…
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 90
• Triplicating a design and its global routes takes up a 
lot of power and area.
• Generally performed after synthesis by a tool– not 
part of RTL.
• Skew between clock domains must be minimized such 
that it is less than the feedback of a voter to its 
associated DFF:
– Does the FPGA contain enough low skew clock 
trees? (each clock + its synchronized reset)x3.
– Limit skew of clocks coming into the FPGA.
– Limit skew of clocks from their input pin to their 
clock tree.
• Difficult to verify.
Logic Partitioning
• With BTMR, DTMR,
and GTMR the 
logic needs to 
partitioned such 
that their TMR 
domains do not 
overlap.
• This can make 
design 
implementation in 
a selected device 
impossible: area, 
timing, and place 
and route.
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 91
NASA Electronics Parts and Packaging 
Testing of the Kintex-7 FPGA: Counter 
Arrays
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 92
SEU Cross-Section Data: (NEPP)
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
5.7 11.4 20.6
LET MeVcm2/mg
41.2
S
E
U
 C
ro
s
s
-S
e
c
ti
o
n
(D
e
s
ig
n
/c
m
2
)
1.00E+06
Kintex-7 Counter Arrays
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 93
Manual DTMR 
Syn DTMR 
BTMR
No TMR
Note: Syn DTMR is from
an old version of the 
synopsis tool. They have
fixed the tool since.
Currently, What Are The Biggest Challenges
Regarding Mitigation Insertion?
• Tool availability… Synopsys is now available.
• User’s are not selecting the correct mitigation scheme for their 
target FPGA.
• Logic partitioning is not being performed when needed.
FPGA Type LTMR
Antifuse+LTMR: Microsemi 
RTAX or RTSX family
General Recommendation
Not Recommended but may be a solution for some situations
Will not be a good solution
DTMR GTMR
?????
Commercial SRAM: Xilinx 
and Altera devices
Commercial Flash: 
Microsemi ProASIC family
Hardened SRAM: Xilinx 
V5QV
?????
?????
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 94
User versus Embedded Mitigation
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 95
• A subset of user inserted mitigation strategies
have been presented.
• None of the strategies are 100% fail-safe.
• Depending on the project requirements, and the
target device’s SEU susceptibility, the most 
efficient mitigation strategy should be selected.
• In most cases, devices with embedded
mitigation do not require additional (user
inserted) mitigation.
Fail-Safe State Machines
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 96
Synchronous FSMs and SEUs
• A synchronous FSM utilizes 
DFFs to hold its current 
state, transitions to a next 
state controlled by a clock 
edge and combinatorial 
logic, and only accepts
inputs that have been
synchronized to the same 
clock
• FSM SEUs can occur from:
– Caught data-path SETs
– DFF SEUs
– Clock/Reset SETs
C
u
rre
n
t
S
ta
te
O
u
tp
u
ts
Inputs
Clock
Next State
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 97
• A synchronous FSM is designed to deterministically 
transition through a pattern of defined states
Synchronous
FSM
5-State FSM Binary Encoding Example
Example of an FSM used to control a 
peripheral device
5-State FSM with each state 
encoded as binary numbers.
An SEU can change current state and cause a
catastrophic event
State 0
State 0
State 1
State 2State 3
State 4State 1
State 2State 3
State 4
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 98
How Do We Implement Fail-Safe 
FSMs?
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 99
• Question: A designer states that all FSMs
have been implemented as “safe”, what do
you expect?
• Correction? Detection? Masking?
– What does correction mean?
– All mitigation shall be defined unambiguously 
by the requirements and by the designer.
Safe State Machines
• As currently defined by design tools and by some 
designers, the term “safe” state machine is a misnomer.
• Auto transitioning (“safe state-machine” ) is a reaction to 
a small subset of incorrect transitions (unmapped states). 
They do not correct or mask (protect) against incorrect 
transitioning.
What happens if 
an SEU causes a 
transition from 
“001” to “101” ?
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 100
State Mapped or 
Unmapped
000 Yes
001 Yes
010 Yes
011 Yes
100 Yes
101 No
110 No
111 No
Safe State Machines: What happens if an 
SEU causes a transition from “001” to 
“101” ?
• As currently implemented, a “safe” state machine will 
automatically transition to a reset (or “safe” state).
• Problem: this could be detrimental to your system
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 101
State Mapped or 
Unmapped
000 Yes
001 Yes
010 Yes
011 Yes
100 Yes
101 No
110 No
111 No
Problems with Current “Safe” FSM 
Definition
• Sounds more safe than 
what it really is.
• Does not do anything for
incorrect transitions into
mapped states.
• Does not correct the state:
– Something that is supposed to 
be on will abruptly shut off.
– Other FSMs or control logic 
can become unsynchronized 
with the bad FSM; with or 
without the automated jump to 
a “safe” state.
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 102
Can Auto-transitioning Work for Your 
Mission?
• Auto-transitioning can work if 
incorrect sequencing of your FSM 
will not cause system failure; e.g. 
mathematical logic control.
• Auto-transitioning can be 
acceptable if it is used in 
conjunction with a detection flag. 
The detection flag must propagate 
to all necessary logic.
• But remember, there is no 
protection or detection with auto-
transitioning when incorrectly 
transitioning to a mapped state.
Auto-transitioning + detection is available with computer
aided design (CAD) tools.
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 103
Implementing Corrective Logic for FSMs
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016.
• FPGAs with hardened configuration:
– LTMR: Triplicate each DFF and use a majority voter.
• The triplication + voter is treated as one DFF
• Encoding doesn’t change
• Resultant FSM has 3 times the number of DFFs 
than the original encoding scheme.
• Combinatorial logic (not including the voters) 
does not change
– Hamming Code-3: requires a new encoding scheme.
• FPGAs with commercial SRAM configuration: 
DTMR is suggested.
There are computer aided design tools (CAD) that can 
assist in adding all of the above mitigation strategies.
A closer look at a base-state
(state 0) and its companion-
states
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 105
Hamming Code-3 FSM Diagram for a 5 
Base-State FSM: Would need 5*7=35 
FSM states to be represented… 6 DFFs
State 0
State 1
State 2State 3
State 4
FSM Fault Tolerance:
5-State Conversion to a Hamming Code-3 FSM
ProASIC3 Heavy-Ion FSM SEU Testing
SEU cross-sections per FSM. 
Scale is Log-Linear
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 106
Some Thoughts
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 107
Concerns and Challenges of Today 
and Tomorrow for Mitigation Insertion
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 108
• User insertion of mitigation strategies in most FPGA devices has 
proven to be a challenging task because of reliability, performance, 
area, and power constraints.
– Difficult to synchronize across triplicated systems,
– Mitigation insertion slows down the system.
– Can’t fit a triplicated version of a design into one device.
– Power and thermal hot-spots are increased.
• The newer devices have a significant increase in gate count and 
lower power. This helps to accommodate for area and power 
constraints while triplicating a design. However, this increases the 
challenge of module synchronization.
• Embedded mitigation has helped in the design process. However, it 
is proving to be an ever-increasing challenge for manufacturers.
– We (users) want embedded systems: cheaper, faster, and less power 
hungry.
– However, heritage has proven that for critical applications, embedded 
systems have provided excellent performance and reliability.
Summary
To be presented by Melanie D. Berg at 2016 SERESSA 12th International School on the Effects of Radiation on Embedded Systems for Space Applications, 
Montreal, Canada, November 10, 2016. 109
• For critical applications, mitigation may be required.
• Determine the correct mitigation scheme for your mission while 
incorporating given requirements:
– Understand the susceptibility of the target FPGA and how it 
responds to other devices.
– Investigate if the selected mitigation strategy is compatible to the 
target FPGA.
– Calculate the reliability of the mitigation strategy to determine if 
the final system will satisfy requirements.
– Ask the right questions regarding functional expectation, 
mitigation, requirement satisfaction, and verification of 
expectations.
• Although it is desirable from a user’s perspective to have embedded 
mitigation, cost seems to be driving the market towards unmitigated 
commercial FPGA devices. Hence, it will be necessary for user’s to 
familiarize themselves with optimal mitigation insertion and usage.
