Partial Triple Modular Redundancy: Low-Cost
Resilience for FPGAs in Space Environments
Andrew Keller and Michael Wirthlin
NSF Center for Space, High-Performance, and Resilient Computing (SHREC)
Brigham Young University
Provo, Utah 84102–4115
{andrewmkeller, wirthlin}@byu.edu

Abstract—Field programmable gate arrays (FPGAs) offer
large amounts of configurable logic for use in a wide variety of
applications, including space-based missions. Radiation-induced
configuration upsets can cause deployed designs to fail. This
work demonstrates the application of partial triple modular
redundancy (pTMR) on a commercial FPGA-based networking
system to improve overall soft-error rate (SER). Applying pTMR
to this system increased the size of the FPGA circuit by 2.8%.
Accelerated neutron testing was performed to measure the
baseline failure rate as well as the improvement with pTMR.
The measured failure rate of the unmitigated baseline system
was estimated at 135 FIT. The mitigated system had a reduced
failure rate of 22 FIT (approximately a 6x improvement). This
work demonstrates improvement in the SER of a commercial
FPGA-based design using an automated post-synthesis mitigation
technique and a small amount of additional logic resources.

I. I NTRODUCTION
Field programmable gate arrays (FPGAs) are digital logic
devices that can be configured for use in a wide variety
of applications. They are used in: consumer products, wired
and wireless communication systems, data-centers, automotive products, medical devices, aerospace applications, and
space-based missions (e.g., satellites, deep-space probes, marsrovers). To meet the demands of these applications, FPGAs
often include fixed resources for analog-to-digital sampling,
high-speed serial I/O ports, fast arithmetic, and on-chip memories. Design images can be updated remotely thanks to the
reconfigurability of SRAM-based FPGAs, which is particularly appealing to space-based missions.
Like all semiconductor devices, SRAM-based FPGAs are
susceptible to radiation-induced single-event upsets or SEUs.
When an SEU occurs within the configuration memory
(CRAM) of the device, the FPGA design may experience
unexpected behavior [1]. SEUs are soft-errors that do not
permanently damage the device; they corrupt the state of a
memory element, which can be restored via a re-write or a
power-cycle [2]. The likelihood of radiation-induced CRAM
upsets occurring in terrestrial environments is very small for
individual FPGA devices [3], but in harsh radiation environments, such as space environments, SEUs are a significant
area of concern for the reliability of an SRAM-based FPGA
design [4]. In mass deployment scenarios, the sheer number
of operational FPGAs elevates concern over SEUs.
Because FPGAs are used in applications that require highavailability, high-reliability, and functional safety, the effects

of SEUs must be addressed for both terrestrial and spacebased applications. Significant contributions have been made
by FPGA vendors to lower intrinsic soft error rates, provide
mechanism for configuration scrubbing (i.e., SEU repair),
protect on-chip user memories with built-in error correction
coding (ECC), and direct consumers towards more aggressive
SEU mitigation techniques as needed. A common suggestion
for these kinds of applications is to implement triple modular
redundancy or TMR.
TMR is an SEU mitigation technique that addresses the
effects of soft errors in FPGAs by using three redundant
copies of a design to mask a failure in any single copy. This
technique provides high-reliability, especially when combined
with configuration scrubbing as seen in published work under
this fellowship [5]. The benefits of TMR come at a high cost
in terms of increased resource utilization, degradation of maximum operating speed, and increase in power consumption [6].
Many applications needing high-reliability cannot afford the
overhead of full-TMR.
Partial TMR (pTMR) is a technique that addresses SEU
effects on FPGAs by applying TMR to a sub-set of an
FPGA design [7]. Single-event upsets that occur within logic
structures protected by TMR will be masked, reducing the SER
of the system. Unlike full-TMR, pTMR seeks to maximize the
improvement in SER while limiting the use of additional logic
resources by protecting the most important circuit structures
in a design. As part of this research, the techniques presented
in [7] to are adapted to benefit a commercial FPGA-based
networking system.
In this study, the soft error rate (SER) performance of a
commercial network switch is improved using pTMR. Customers noticed infrequent network outages that were determined to be caused by radiation-induced soft errors in the
SRAM-based FPGAs of the system. Partial TMR was used to
improve the SER performance of the system. Using a modest
amount of redundancy, a significant improvement in SER
performance was observed through neutron radiation testing. A
fault injection analysis of the pTMR results is also included in
this study. Applying partial TMR with a 2.8% overhead (i.e.,
circuit size increase) lowered the failure in time (FIT) rate
(i.e., failures per billion hours of operation) of the observed
network outages by a factor of 6×; from 135 FIT to 22 FIT.

II. S OFT E RRORS IN FPGA S
FPGAs use a large amount of state to store the configuration
of an active design. This state, called the “configuration
bitstream,” determines the behavior of the circuitry for a
specific design on the FPGA fabric. This data determines
the functionality of lookup tables, routing, and the behavior
of other components within the FPGA [1]. The contents of
configuration memory (CRAM) typically remains unchanged
while a design operates on an FPGA1 .
Soft errors within CRAM is a concern because they can
alter the behavior of an operating design and result in system
failure. The SER for CRAM upsets on a single SRAMbased FPGA in terrestrial environments is relatively low. For
Xilinx 7 Series FPGAs, the CRAM upset rate in a terrestrial
environment is less than 73 FIT per megabit [3]. The number
of CRAM bits in a Xilinx Virtex 7 FPGA (xc7vx300tffg17612) is on the order of one-hundred million [9]. Thus, in a single
device, only one upset will occur every 15 years of operation
on average. The occurrence of an upset does not guarantee
system failure. In many cases, upsets occur in unused portions
of the FPGA, or the effects of the upset can otherwise be
ignored. Only a fraction of random upsets will result in system
failure. The critical sensitivity a design is the probability that
a random upset results in a system failure.
Soft errors in terrestrial environments become a problem
when many FPGAs are involved (i.e., wide spread deployment). Consider the SER of configuration upsets when onehundred thousand SRAM-based FPGAs are deployed. Assuming FPGAs similar to the one described above, one CRAM
upset would occur every 80 minutes on average. With a critical
sensitivity of one-percent (i.e., one out of every hundred upsets
causes disruptive design behavior), a system failure would
occur once every five or six days.
In contrast, the CRAM fit rate of a similar FPGA (i.e.,
a Xilinx Kintex 7 xc7k325tfbg900c-1) was measured in [4]
under heavy ion irradiation (i.e. similar to space-environments)
to have a FIT per Mbit rate of 664,000. At this rate, 1.12
configuration upsets would occur per device per day on
average. With a critical sensitivity of one-percent, a system
failure using this device in a space-based mission would occur
once every 89 days on average. SEU mitigation techniques can
significantly reduce the rate of system failures.
Soft errors also have the potential to corrupt data or control sequences without raising the system’s awareness of the
error. This is known as silent data corruption (SDC) and can
result in system failure. These types of errors are particularly
threatening because they can go unnoticed for long periods of
time or result in catastrophic behavior.
III. FPGA-BASED N ETWORKING A PPLICATION
The product used for this study is a commercial networking
switch. The type of networking switch used is designed to
unify multiple computer networks together throughout an
1 Partial re-configuration can be used to swap out design circuitry on the
fly, but this technique is not commonly employed [8].

entire building or across a wide range of locations. Figure 1
shows a high level layout of the components in the switch
used in this study. The switch uses SRAM-based FPGAs to
process most of the network traffic. Each FPGA connects to a
group of network ports through dedicated ASIC components.
Communication between the FPGAs and ASICs is conducted
using a high-speed chip-to-chip packet transfer protocol. Each
module in a chassis is equipped with two FPGAs and their
associated network port connections. Installed modules are
governed by a system controller and are interconnected by
the rest of the switch fabric. Additional modules can be added
to the switch to expand network capacity.
Controller
CPU

Additional
Resources

ASIC

FPGA

FPGA

ASIC

ASIC

FPGA

FPGA

ASIC

Additional Modules
Fig. 1. Network Switch System Layout

The design within the FPGAs in this system is large and
complex. Table I show the resource utilization of the FPGA
design. The targeted FPGA is a Xilinx Virtex 7 device. This
design occupies 86% of the available slices. It consumes more
than half of the registers and block memories and more than
one-third of the look up tables (LUTs). Most of the global
clock buffers and I/O pins are also being used. In this study,
the size of the design is important for two reasons. First, it
limits the amount of redundancy that can be added to the
design for SEU mitigation. Second, the larger the design is,
the more challenging it can be to debug and protect against
soft-error induced failure behavior.
Under certain soft-error conditions, this system has a specific failure mode that is noticeable to the end customer. In
some rare circumstances, an SEU can cause a silent data
corruption event in which the system fails without identifying
a failure condition. In this case, some customers have reported
short lived network outages. The duration of these outages is
not long because a switch-over event allows the system to
restore connectivity with low impact to the network. Failure

TABLE I
R ESOURCE U TILIZATION OF THE O RIGINAL D ESIGN
Virtex 7 Device
Slices
Registers
LUTs
Block Memories
Global Clock Buffers
I/O

Used
44,256
(86%)
115,619
(56%)
141,317
(34%)
429
(57%)
30
(93%)
622
(88%)

Total
51,000
204,000
408,000
750
32
700

analysis shows that a silent data corruption event may cause
the interface between the FPGA and port ASIC to experience
continuous protocol errors.
In most cases, the customer networks were working fine
over the last six months to a year before this protocol failure
mode occurred. The upset that caused the failure mode can
be cleared by being over written with the correct value, but
the effects of the upsets may persist until a power cycle is
performed [7]. When the failure mode occurs, the only way
to remove the error is to have an operator manually power
cycle the setup, which can take ten minutes or more for a
fully loaded system. This is a challenging recovery mechanism
because it requires operator intervention and typically power
cycles are only allowed in a designated maintenance window.
Diagnosing the cause of the protocol failure mode was
challenging due to irreproducibility. When a failed system
returned for diagnosis, the reported failure mode was not reproducible. The system would pass all diagnostics and appear
to be a fully functional system. This type of irreproducibility is
typical of soft-error induced failure modes. It is very difficult
to reproduce SEU related issues such as this without extensive
radiation testing or fault injection.
Neutron radiation testing was conducted to attempt to reproduce this behavior in the product. Specifically, experiments
were performed to explore which of the devices in the system
might be the cause of this behavior. Naturally, the neutron
test focused on the FPGA and ASIC components related to
the continuous protocol error behavior. From this test, it was
discovered that the failure mode reported by the customer
could be reproduced by exposing the FPGA to an accelerated
neutron beam while the beam was isolated from the other
components (i.e., only the FPGA was in the beam path). This
evidence further strengthened the argument that this failure
mode was being caused by SEUs in deployment and is part
of the SER of the FPGA design.
In addition to radiation testing, fault injection analysis was
used to further study the failure mode behavior. Fault injection
augments data collected from radiation testing [10]. A custom
high-speed JTAG configuration manager was used to inject
faults into the CRAM memory of the FPGAs while stress tests
of real customer applications were being run on the networking
system (see Figure 5). A random fault injection campaign was
used for testing. It was found that the specific failure mode
behavior was also reproducible via fault injection.

IV. SEU M ITIGATION
A typical approach for handling upsets in configuration
memory is to add configuration memory scrubbing to the system [11]. CRAM scrubbing allows corrupted design circuitry
to be repaired. In some cases, simply repairing the design
circuitry will allow errors in the system to be resolved [7].
Other situations require additional repair mechanisms.
A variety of mechanisms have been shown to adequately
address the effects of SEUs within SRAM-based FPGAs. One
effective SEU mitigation technique is TMR. This technique
replicates the design’s circuitry with three copies and then
adds a majority voter to mask a failure in any single copy.
The majority voter is often triplicated as well to avoid single
point failures. A generalized digram of this scheme is shown in
Figure 2. When combined with configuration memory scrubbing (i.e., repair of corrupted circuitry), this technique has
demonstrated FIT rate reductions by a factor of 50-100× [12];
without CRAM scrubbing, a reduction by a factor of 15× has
been observed [13], [14]. The overall benefits of full TMR
depends on the design and mitigation scheme implementation,
but this technique has been shown to an effective approach for
SEU mitigation.
The benefits of TMR comes at the cost of increased circuit
area, power consumption, and critical path delay. This technique increases area (i.e., resource utilization) by more than
three times and may add significant delays to the critical path
depending on the frequency of voter insertion throughout the
design [6]. While expensive, this technique is commonly used
in space-environments and high-energy physics experiments,
which require high levels of reliability.

M oduleA

V oterA

M oduleB

V oterB

M oduleC

V oterC

Fig. 2. Triple Modular Redundancy with Majority Voter Triplicated

In many applications, applying full TMR to a design’s
circuitry is not feasible. Full TMR cannot be applied to a
design if there is not enough area, critical path slack, or
power budget available to do so. In this network switch, full
TMR cannot be applied because the FPGA resource utilization
is already well beyond a third of the device. Further, the
tight timing constraints of this high-speed design prevent
the application of full TMR. TMR, however, can be applied
selectively to portions of the design that will benefit the most
from TMR.
Applying TMR in a partial manner allows the benefits of
TMR to be targeted towards specific failure modes while

lowering the overhead of implementation cost compared to
full TMR. Partial TMR (pTMR) ranks design components
by their critical sensitivity contribution and applies TMR to
as many of the highest ranked components that it can while
respecting the constraints presented by area, timing, and power
limitations. The desired outcome is to maximize reliability
benefits of TMR given limited resources. This allows terrestrial
and space-based applications to take advantage of the benefits
of TMR at virtually no additional cost (i.e., using left over
resources on the FPGA after the design has been completed).
V. PARTIAL TMR
The process of applying pTMR to an FPGA design within
the network switch application followed the flowchart in
Figure 3. This process involved identifying the sub-module
that contributed the most to the failure mode, applying pTMR
to this module, and verifying that functional behavior was
preserved. This flow allows the techniques discussed in [7]
to be used in commercial designs. Scope selection lets the
user guide the application of pTMR. An iterative application
process tunes redundancy against constraints. Functional testing provides confidence that the pTMR version of the design
matches expected behavior.
A. Scope Selection
Based on the continuous failure mode behavior and prior
knowledge of the target design, design engineers identified
the “packet reader” module as being the most likely module
to cause the given failure mode if corrupted by an SEU. This
module processes the header information of each packet and
determines where the payload needs to be directed. It makes up
roughly five percent of the overall design resource utilization,
but it contributes significantly to control logic. A challenge
with the “packet reader” module is that if the position of a
packet header within a network channel stream is lost, it can
be very challenging to recover the stream. This module was
selected for pTMR application due to its importance in control
logic and because this module has the greatest likelihood of
causing the observed failure mode from a design perspective
if corrupted.
B. Automated Insertion
Partial TMR was applied to the packet reader module. BYU
EDIF tools were used to apply pTMR [15] following the flow
shown in Figure 4. A netlist of the packet reader module was
provided to the tools for analysis and pTMR insertion. The
tool ranks components by importance based on connectivity
with other components and selects which components to apply
pTMR to according to parameters set by the user. After
triplicating the selected components and inserting appropriate
voters, the tool produces a netlist of the module with a certain
level of pTMR applied to it. The produced netlist can then be
imported into the vendors tools for implementation.
The application of pTMR is tuned through an iterative
implementation process where the produced netlist is run
through the vendor tools to make sure the resulting netlist

meets the provided constraints and design rules. If area, timing,
or power constraints are not met, or if the design fails a
design rule (i.e., replicating a component that cannot not be
replicated), that information is feed back to the to the user
so that the parameters on pTMR application can be adjusted.
A new netlist is generated with the updated parameters. This
process is continued until the highest level of replication can
be provided while meeting all constraints and design rules.
Parameters for applying pTMR on the packet reader module
were adjusted until the resulting netlist meet all constraints
and design rules. This included excluding proprietary IP from
replication and excluding combinational logic along critical
paths of the design so that timing could be met. This module required all combinational paths to have a propagation
delay less than the period of a 212 MHz clock. In the end,
approximately 28% of the components in the packet reader
module were replicated (e.g., flip-flop, LUTs, multiplexers)
as shown in Table II. This module makes up approximately
five percent of the overall design, thus replicating 28% of
this modules results in a 2.8% increase in overall circuit size.
While this is a relatively small amount of added redundancy,
this portion of the design is most likely the greatest contributor
to the design’s critical sensitivity if corrupted. It will in turn
yield the greatest benefit from the application of TMR. The
portions of the packet reader module that had TMR applied
to them correspond to state machines, counters, and additional
feedback logic within the module [7].
TABLE II
OVERHEAD OF PARTIAL TMR I NSERTION
Overall Design
Packet Reader Module
pTMR Selection
Overhead

Components
266,139
13,647
3,750
7,500

Percentage
–
5.13%
of Overall Design
27.5%
of the PR Module
2.82% increase in overall size

C. Functional Behavior Verification
With the design meeting all constraints and design rules, the
next step in applying pTMR is verifying that the insertion of
TMR did not compromise the functionality of the design. The
application of pTMR should be transparent to the functionality
of the design, it should not alter the functionality of the
design in any way. This is verified for each pTMR application
formally though a logic equivalence check (LEC) using tools
such as Cadence Conformal, and it is verified by running exhaustive regression tests on the design while it is active in the
working system. These verification steps provide a high level
of confidence that functional behavior was preserved along
side the application of pTMR. This is especially important
in industry because it shows that verification efforts made
throughout the development of the design are not discarded
by the introduction of a third party tool.
Cadence conformal was used to provide the formal verification of pTMR in this design. Like other LECs, Cadence
Conformal compares the circuity of targeted module after
applying pTMR to it against the original circuit and looks for

Start

Scope
Selection

Parameter
Adjustment

pTMR
Insertion

Implementation

No

Passed?

Yes

Formal
Verification

Unit and
Regression
Testing

End

Fig. 3. Partial TMR Application Flow

HDL

Logic
Synthesis

pTMR
Tool

Map, Place,
and Route

FPGA
bitfile

Fig. 4. Partial TMR Insertion Flow

dependency in functional behavior. Vendor supplied libraries
are used to define the behavior of components in the netlist
of the module. Conformal maps key points between the
golden and revised design and then mathematically determines
whether or not the combinational logic between key points
is equivalent. Key points include: top level ports, flip-flops,
latches, and black boxes. Instance equivalence constraints
were added to denote which key points instances are TMR
copies of each other. Non-equivalent points are diagnosable
and equivalent designs pass comparison. The packet reader
module with added pTMR passed formal verification.
As a final form of verification, exhaustive unit and regression tests were run on the design while active in a working
system. Prior to applying pTMR to the packet reader module,
design engineers developed an extensive test suite to determine
that the product meets specification and conforms to expected
functionality. These tests target specific functionality of the
design and several different use case applications to ensure
proper operation once deployed. The target design passed
the entire suite of tests while loaded in a working system.
This again boosts confidence that the design with pTMR will
function as expected in deployment. To ensure that the design
will behave correctly in deployment, all of these tests were
run without the presence of fault injection or radiation testing.
After verifying the post-pTMR design, the next step is to
evaluate the improvement in SER obtained by applying this
technique. The two techniques used to measure this improvement is fault injection and radiation testing. The primary goal
in both these testing methodologies is to observe the behavior
of the design while it operates in the presence of configuration
memory upsets. Fault injection emulates CRAM upsets by
purposefully writing bad values to memory via partial reconfiguration while the design is active [10]. Radiation testing
exposes the design to an accelerated radiation source to introduce upsets within configuration memory while the design is
active [16]. From the collected statistics, the SEU sensitivity of
the design towards a specific failure mode can be determined.
By comparing the SEU sensitivity of the design with and
without pTMR applied to it, the reliability improvement factor
can be determined. Both fault injection and neutron radiation
testing were used to evaluate the reliability benefits of the

applied pTMR on the network switch design.
VI. FAULT I NJECTION T ESTING
A random fault injection campaign was conducted on this
system. The setup used for fault injection testing of the design
is shown in Figure 5. By randomly sampling faults, the
sensitivity of the entire design can be estimated [17]. Injected
faults were tested for failure sensitivity using the test flow
described in [10]. First, the system is brought into a known
good working state either through a power cycle or a soft reset.
Second, a fault is introduced to the system through partial
reconfiguration. Third, the system is monitored for failure behavior such as having a module become unresponsive, having
network traffic slow, or having network traffic stop. Monitoring
continues for a certain period of time (e.g., 1ms) to allow any
introduced errors to propagate through the system. Finally, the
injected fault is removed to simulate configuration scrubbing.
Testing continues by returning to step one. It is assumed that
if the system is functioning correctly after injecting a fault that
a power cycle or soft reset of the system is not necessary.
JTAG
Configuration
Manager

Traffic
Generator
Connection

JTAG
Connection

Loopback
Connection

Fig. 5. Fault Injection Setup

Table III shows collected fault injection data. Sensitivity
of the design is determined by dividing the total number of
failures by the number of injected faults. 95% confidence
intervals were calculated using the standard deviation of the
binomial distribution [10]. FIT rate was approximated by
multiplying the total number of CRAM bits in the device [9]
by sampled sensitivity and scaling the product by the CRAM
FIT per megabit rate [3].

TABLE III
FAULT I NJECTION T ESTING R ESULTS
Version

Baseline

Injected Faults
Total Failures
Sensitivity
(95% Conf. Interval)
Improvement
(95% Conf. Interval)
Approx. FIT

49,376
590
1.19%
(1.1%, 1.3%)
1.0×
–
67

Non-TMR
w/ Scrubbing
49,376
397
0.80%
(.73%, .88%)
1.48×
(1.2×, 1.8×)
45

pTMR
w/ Scrubbing
35,264
75
3.65%
(.16%, .26%)
5.62×
(4.2×, 7.8×)
12

Fault injection demonstrates a 5.6× improvement in reliability over the unmitigated design with pTMR applied to 1.8% of
the design. Note that the number of injected faults between the
baseline and non-TMR with scrubbing mode is the same. The
non-TMR with scrubbing test mode uses the same data as the
baseline but filters out the failures that were resolved when the
injected fault was removed via configuration scrubbing. This
data suggests that the protection offered by the partial TMR
scheme significantly reduces the FIT rate for this severe failure
mode using only a modest amount of additional resources.
This information also provides feedback witch can be used to
improve the pTMR selection process. Fault injection is used
to augment radiation testing results. Similar results are to be
expected in neutron radiation testing.
VII. N EUTRON R ADIATION T ESTING
The commonly accepted standard for testing terrestrial
cosmic ray-induced soft errors in semiconductor devices is
through the use of accelerated neutron and proton radiation
sources [10], [2]. A spallation neutron source with a wide energy spectrum similar to that found in terrestrial environments
is located at the TRIUMF Neutron Facility. The high neutron
flux available at this facility allows devices to be tested at
accelerated rates [2]. Radiation testing is costly, but provides
important insight into the expected reliability of a design.
Neutron radiation testing was conducted on the network
switch design in December of 2017 at the TRIUMF Neutron
Facility on the BL1B beam path [18]. FPGAs within the
product was aligned perpendicular to the one and one-half
inch collimated beam. Two modules were placed in the beam
to accelerate data acquisition. The distance of each FPGA from
the beam source was recorded and the flux of the beam was
degraded in analysis to compensate for the distance [2]. Both
the baseline version of the design (i.e., without any pTMR
applied to it) and the pTMR version of the design were tested
for the protocol failure mode SEU sensitivity.
Dynamic testing of FPGA design within the network switch
was conducted. This consists of bringing both modules online
with the same FPGA design, activating heavy traffic stress
tests with stimulus provided by an external traffic generator,
opening the beam stop with everything working, observing the
product’s behavior for the protocol failure events, recording
the fluence-to-failure, and restarting the experiment to collect
additional samples. A failed module is only restarted after the
beam stop is closed (i.e., other experiments continue until

the beam run ends). Data collected from this approach is
used to determine the FIT rate of the product for a particular
failure mode. In turn, the improvement from applying pTMR
is determined by comparing the FIT rate of the design with
and without added pTMR.
Table IV shows the collected data from the radiation test.
Total fluence is the number of neutrons per cm2 that passed
through the target FPGA design while active throughout the
duration of the test. Total failures is the number of observed
failures for a specific failure mode, which in this case is
the protocol error that causes network outages. FIT rate is
calculated from the neutron cross section of the targeted
failure mode to match the expected FIT rate of the design
in the presence of a terrestrial neutron flux at sea level in
New York. The neutron flux at sea level in New York is
accepted to be 13 neutrons per cm2 [2]. Confidence intervals of
95% were calculated using common practices [16]. Reliability
improvement represents the FIT decrease factor of the pTMR
version of the design compared to the baseline design.
TABLE IV
N EUTRON R ADIATION T ESTING R ESULTS
Version
Total Fluence
Total Failures
FIT Rate
(95% Conf. Interval)
Improvement
(95% Conf. Interval)

Baseline
2.89E+08
3
135
(27, 395)

pTMR
6.01E+08
1
22
(2, 121)

6×
(0.2×, 183×)

The data in Table IV shows that mean severe-failure mode
(i.e., network outage) FIT rate was reduced by a factor of 6×.
Scaling the fluence exposure of both versions of the design to
ground level in New York, the baseline and pTMR versions
of the design received the equivalent of 250 and 525 years
respectively of terrestrial radiation. With that level of radiation
exposure, the protocol failure mode occurred three times in
the baseline and once in pTMR version of the design. The
calculated mean FIT rates shows a significant improvement
in the SER, which was achieved using a modest amount of
additional logic resources on the FPGA for this particular
design.
Although this data represents centuries of testing, there is
still some overlap in the 95% confidence intervals between
the two versions of the design as shown in Figure 6. A

Baseline
pTMR

0

50

100

150

FIT Rate
Fig. 6. 95% Confidence Interval Overlap, FIT Rate with Error Bars Shown

large amount of radiation test data is required obtain sufficient
data to reduce these error bars. Assuming the measured cross
section is correct, three time the amount of collected data

would be required to remove the overlap from the confidence
intervals. Both versions of the design spent about 3 hours
in the beam while running on two different setups. Running
these experiments three times longer was not possible due to
obligations to other experiments during the same beam test.
VIII. C ONCLUSION
A soft-error induced failure mode was observed in a commercial FPGA-based network switch. Failure analysis determined that the failure mode caused continuous protocol errors
on the high-speed chip-to-chip interface between the FPGA
and ASIC components in the design. When this failure mode
occurred, the only way to recover from it was to power cycle
the network setup. In many cases of reported failure, the
system had been functioning correctly for six months to a
year. Fault injection and radiation testing were conducted on
the original design to verify that this issue was in fact related
to soft errors. Both forms of testing confirmed the failure mode
to be an SER issue.
Using a small amount of pTMR, the SER performance of
an FPGA-based commercial networking system was improved.
The scope of partial TMR application was limited to a critical
module. An automated partial TMR tool was used to iteratively
insert pTMR into the design until the maximum level of
redundancy was added while still meeting timing and other
constraints. The circuit size was increased by 2.8% overall.
Functional testing was run against the resulting pTMR design
to ensure proper functionality. Fault injection demonstrates a
5.6× improvement, suggesting that pTMR is protecting the
circuit. With additional analysis, the scope of application can
be adjusted to yield even higher benefit. Neutron radiation
testing estimates the FIT rate of the baseline design version
(i.e., without pTMR) to be 135 and the FIT rate of the pTMR
version to be 22; providing a 6× improvement with a small
amount of pTMR.
R EFERENCES
[1] M. Ceschia et al., “Identification and classification of single-event upsets
in the configuration memory of SRAM-based FPGAs,” IEEE Trans.
Nucl. Sci., vol. 50, no. 6, pp. 2088–2094, Dec 2003.
[2] Measurement and reporting of alpha particle and terrestrial cosmic
ray-induced soft errors in semiconductor devices, JEDEC Solid
State Technology Association Std. 89A, 2006. [Online]. Available:
https://www.jedec.org/sites/default/files/docs/JESD89A.pdf
[3] Device Reliability Report, Xilinx Inc. [Online]. Available:
https://www.xilinx.com/support/documentation/user guides/ug116.pdf
[4] D. S. Lee, M. Wirthlin, G. Swift, and A. C. Le, “Single-event characterization of the 28 nm xilinx kintex-7 field-programmable gate array under
heavy ion irradiation,” in 2014 IEEE Radiation Effects Data Workshop
(REDW), July 2014, pp. 1–5.
[5] A. M. Keller, T. A. Whiting, K. B. Sawyer, and M. J. Wirthlin,
“Dynamic SEU sensitivity of designs on two 28-nm SRAM-based FPGA
architectures,” IEEE Transactions on Nuclear Science, no. 1, pp. 280–
287, jan.
[6] J. M. Johnson and M. Wirthlin, “Voter insertion algorithms for
FPGA designs using triple modular redundancy,” in Proc. 18th Annu.
ACM/SIGDA Int. Symp. Field Programmable Gate Arrays, ser. FPGA
’10. New York, NY, USA: ACM, 2010, pp. 249–258.
[7] B. Pratt, M. Caffrey, P. Graham, K. Morgan, and M. Wirthlin, “Improving FPGA design robustness with partial tmr,” in 2006 IEEE
International Reliability Physics Symposium Proceedings, mar 2006, pp.
226–232.

[8] E. J. McDonald, “Runtime FPGA partial reconfiguration,” in 2008 IEEE
Aerospace Conference, March 2008, pp. 1–7.
[9] 7 Series FPGAs Configuration User Guide, Xilinx Inc. [Online].
Available: https://www.xilinx.com/support/documentation/user guides/
ug470 7Series Config.pdf
[10] H. M. Quinn, D. A. Black, W. H. Robinson, and S. P. Buchner,
“Fault Simulation and Emulation Tools to Augment RadiationHardness Assurance Testing,” IEEE Transactions on Nuclear Science,
vol. 60, no. 3, pp. 2119–2142, jun 2013. [Online]. Available:
http://ieeexplore.ieee.org/document/6519339/
[11] M. Berg, C. Poivey, D. Petrick, D. Espinosa, A. Lesea, K. A. LaBel,
M. Friendlich, H. Kim, and A. Phan, “Effectiveness of internal versus
external SEU scrubbing mitigation strategies in a Xilinx FPGA: Design,
test, and analysis,” IEEE Transactions on Nuclear Science, vol. 55, no. 4,
pp. 2259–2266, Aug 2008.
[12] A. M. Keller and M. J. Wirthlin, “Benefits of complementary SEU
mitigation for the LEON3 soft processor on SRAM-based FPGAs,”
IEEE Trans. Nucl. Sci., vol. 64, no. 1, pp. 519–528, Jan 2017.
[13] L. Sterpone and M. Violante, “Analysis of the robustness of the TMR
architecture in SRAM-based FPGAs,” Nuclear Science, IEEE Transactions on, vol. 52, no. 5, pp. 1545 – 1549, oct. 2005.
[14] H. Quinn et al., “Using benchmarks for radiation testing of microprocessors and FPGAs,” IEEE Transactions on Nuclear Science, vol. 62,
no. 6, pp. 2547–2554, Dec 2015.
[15] B. Pratt, M. Caffrey, J. F. Carroll, P. Graham, K. Morgan, and M. Wirthlin, “Fine-grain SEU mitigation for FPGAs using partial TMR,” IEEE
Transactions on Nuclear Science, no. 4, pp. 2274–2280, aug.
[16] H. Quinn, “Challenges in testing complex systems,” IEEE Transactions
on Nuclear Science, vol. 61, no. 2, pp. 766–786, apr 2014. [Online].
Available: http://ieeexplore.ieee.org/document/6786369/
[17] P. Ramachandran et al., “Statistical fault injection,” in 2008 IEEE Int.
Conf. Dependable Syst. and Networks With FTCS and DCC (DSN), June
2008, pp. 122–127.
[18] E. W. Blackmore and M. Trinczek, “Intensity upgrade to the TRIUMF
500 MeV large-area neutron beam,” in 2014 IEEE Radiation Effects
Data Workshop (REDW), July 2014, pp. 1–5.

