# Masking and Detecting Radiation-Induced Errors in SRAM-Based FPGAs Through Partial Circuit Replication

Andrew M. Keller and Michael J. Wirthlin NSF Center for Space, High-Performance, and Resilient Computing (SHREC) Department of Electrical and Computer Engineering Brigham Young University Provo, Utah 84602 {andrewmkeller, wirthlin}@byu.edu

Abstract-Radiation found in terrestrial and space environments can induce errors into SRAM-based FPGAs. Replication of circuitry can be used mask and detect these errors to improve reliability or availability. This work advances the understanding and implementation of partial circuit replication in SRAM-based FPGAs. Partial circuit replication is the replication of a subset of the components in a circuit. A reliability model is presented that evaluates the reliability benefit of partial circuit replication. The model suggests that the reliability benefit is inversely related to the portion of the circuit replicated. A partial triple module redundancy case study is also presented that evaluates several different selection algorithms. Random selection was found to be ineffective and maximizing protected routes while minimizing inserted voters provided a high return, reducing failure likelihood by 20% with only 9% coverage. A final study applied duplication with compare to an FPGA-based networking system to detect persistent silent network disruptions. A coverage of 29% was able to detect 45% of these failures in neutron radiation testing.

#### I. INTRODUCTION

Radiation found in space and terrestrial environments can induce errors that corrupt circuits implemented on SRAMbased FPGAs [1]. Radiation can cause soft errors in static random access memory (SRAM) cells by inverting their stored value [2]. SRAM-based field programmable gate arrays (FP-GAs) use a large array of these memory cells to store device configuration. When this configuration memory (CRAM) becomes corrupted, the circuit implemented by the FPGA may also become corrupt, which may result in circuit failure.

Failures are errors that cannot be tolerated, and errors are caused by faults. When radiation inverts the value stored in a memory cell, it introduces a fault. A fault is a condition that could cause a circuit to fail [3]. Faults can cause errors, which are observable deviations from expected behavior. Errors can cause failures when their behavior falls outside of the functional specification of the circuit or system in which the circuit resides. Some errors are tolerable while others are not.

Several techniques have been developed to mask and detect radiation-induced errors in SRAM-based FPGA designs. In particular, circuit redundancy has been used for this. The basic idea of using circuit redundancy for detecting or masking errors is that errors can be detected or masked by comparing the outputs of identical circuits. Errors are detected though the use of two copies and errors are masked through the use of three or more copies. The use of two copies for detecting errors is known as duplication with compare (DWC) [4] and the use of three copies for masking errors is known as triple modular redundancy (TMR) [5].

Figure 1, shows a basic implementation of TMR. Here circuit components are triplicated and voters are inserted that propagate the dominate value received by the three copies. Propagating the dominate value masks errors within a single circuit copy. So long as two or more copies and the corresponding voter are error free, the resulting output of the circuit will also be error free. The voters themselves can also be replicated where possible to avoid single points in the circuit where errors can affect the output of the whole circuit [6].



Fig. 1. Spacial TMR with Triplicated Voters

Figure 2 shows a basic DWC implementation. Errors are detected by comparing the output of the two identical circuits. Discrepancy are captured in a subsequent register. The detection logic is also duplicated so that errors that originate in the detection logic can be to filtered out [4]. In [4], DWC was applied to all circuit components in a set of circuits implemented on an SRAM-based FPGA, and DWC was able to accurately detect 99.9% of all SEU-induced circuit failures.

Both TMR and DWC are effective, but not every application can afford the overhead that these techniques require. Full TMR and DWC require at least two or three times as



Fig. 2. DWC using Redundant Detection Logic

many resources to implement respectively. Beyond increasing resource utilization, these techniques can negatively impact timing closure and power consumption. Some resources, such as multi-gigabit transceivers, clock managers, and analog-todigital converters, cannot be replicated within a single chip.

In situations where full circuit replication is infeasible partial circuit replication can be used. After a circuit is implemented on SRAM-based FPGAs, often times a significant amount of resources remain that are not utilized. The left over resources can be dedicated to replicating portions of the implemented circuit. By replicating portions of the circuit, a portion of radiation-induced errors can be detected or masked (depending on the replication scheme). This allows any remaining resources to be dedicated to the improvement of a circuit's reliability or availability.

The studies presented in this work explore the use of partial circuit replication for masking and detecting radiationinduced errors in SRAM-based FPGAs. Three studies are included. The first study looks at adapting existing reliability models to more accurately reflect the benefits available from partial circuit replication. The second study explores several selections algorithms for the application of partial TMR. The final study examines the use of partial DWC for reducing the likelihood of undetected failures in a commercial FPGA-based networking system. Theoretical benefits and practical applications are covered to improve the state of the art understanding and implementation of these two powerful error masking and detection techniques: partial TMR and partial DWC.

#### **II. METRICS**

In order to understand the concern concern that radiation presents, the likelihood of radiation-induced failure must be measured. Measuring this likelihood provides quantitative data. These data, in turn, provide context for concern. It is also through measurement that various approaches for lowering the likelihood can be evaluated in their effectiveness. Thus, addressing the concerns of radiation-induced failure depends heavily upon measurement.

All measurement approaches and metrics share a common theme: quantifying the probability of failure or success. The likelihood of radiation-induced failure hinges on many different factors and varies greatly between circuits and environments. The goal of this section is to provide a unified view of available measurement techniques and metrics. This understanding greatly aids the subsequent discussion.

The likelihood of failure is presented in terms of reliability and availability. Reliability is the probability of no failure within a given operating period, and availability is the probability that a circuit is operating correctly at any point in time [7]. Availability takes into consideration reliability and the amount of time it takes to perform a repair when a failure occurs. Availability improves when reliability improves or when the time it takes to make a repair is reduced.

One way that the reliability of a circuit can be estimated is through fault injection. The type of fault injection referred to here is the purposeful introduction of faults into a circuit to emulate, or mimic, the effects of radiation in the circuit [8]. This provides a discrete sampling of the circuit's radiation response and can be used to estimate the reliability of the circuit. The reliability of non-redundant simplex circuit is typically modeled using the exponential distribution [7]. Thus, the reliability of the circuit as a function of time in days is

$$R(t) = exp(-\lambda t) \tag{1}$$

where  $\lambda$  is the constant failure rate of circuit.

Often times, the average time to failure is more useful for comparison than the time dependent reliability function. The mean time to failure (MTTF) can be obtained from the reliability function through integration as shown in Equation 2.

$$MTTF = \int_0^\infty R(t)dt$$
 (2)

It also can be estimated by averaging the time to failure of several independent experiments [9]. MTTF is reported in units of time and can be scaled to any appropriate timescale.

A metric related to MTTF is failures in time or FIT. This metric is heavily adopted by industry. It is the number of failures per billion hours of operation [10]. The relationship between the two metrics is shown in Equation 3.

$$FIT = \frac{1,000,000,000 \text{ Hours}}{\text{MTTF in Hours}}$$
(3)

For comparison, a MTTF of 1000 years equates to 114 FIT approximately. FIT of memory arrays are typically reported in terms of FIT per Mbit ( $10^6$  bits) and are normalized to a reference environment like New York City at sea level [11].

A golden standard for radiation hardness assurance is accelerated radiation testing [8], [11], [12]. Accelerated radiation testing measures the likelihood of failure in terms of cross section, which can be converted to MTTF. Cross section is a hypothetical area (measured in cm<sup>2</sup>) that would result in failure should an energetic particle pass through it [2]. Flux is the number of particles that pass through a cm<sup>2</sup> per second. Equation 4 shows the conversion from cross section to MTTF.

$$MTTF = \frac{Flux}{Cross Section}$$
(4)

This conversion applies readily to a high-energy neutron cross section but a more sophisticated approach is needed for heavyion cross sections and other forms of radiation testing [13].

#### III. PARTIAL CIRCUIT REPLICATION MODELING

Reliability models can be used to evaluate the effectiveness of partial circuit replication. Effectiveness is evaluated in terms of improvement in circuit reliability or availability. Ultimately, the goal is to improve a circuit's reliability or availability as much as possible within a given set of constraints. Reliability models provide a theoretical threshold for the amount of improvement that can be expected, and they also lend guidance to the application of these mitigation techniques.

A reliability model based on Markov chains has already been developed for the application of full TMR with repair [9], but the model lacks consideration for partial circuit replication. Full TMR with repair applies TMR to all components in a circuit and incorporates some kind of repair mechanism that is able to fix a copy when it becomes corrupt. From this model, the benefits and drawbacks of full TMR with repair can be understood, but this classic model is incomplete. It does not compensate for errors that can instantly compromise multiple copies and it does not consider situations where only a fraction of the original circuit is replicated. These shortcomings make the model inadequate for understanding the full benefits and drawbacks of partial circuit replication.

A simple, yet profound, modification can be made to the aforementioned model to unlock hidden truths that accompany the application of partial circuit replication. The modification is the addition of a direct path from any functional state to the failure state (see the dashed edges in Figure 3). This modification is equivalent to adding a simplex (non-replicated) component as a dependency. For the system to function correctly the replicated portion *and* non-replicated portion must both be functional. This modification compensates for failure modes that compromise multiple copies and it compensates for portions of a circuit that are not replicated.



Fig. 3. Modified Markov Chain of TMR with Repair

The Markov chain shown in Figure 3 implements the modified TMR with repair reliability model. The three states represent a TMR system where: all three copies are functioning correctly  $(S_0)$ , only two of the three copies are functioning correctly  $(S_1)$ , and where two or more of the copies are not functioning correctly  $(S_2)$ . The failure rate of a single copy is  $\lambda$ , the repair rate for a single copy is  $\mu$ , and the additional system wide failure rate is c. From  $S_0$  to  $S_1$  there is a  $3\lambda$  rate of transition because any one of the three working copies could fail. Transitions from  $S_1$  to  $S_0$  have an occurrence rate of  $\mu$  representing the act of repairing a failed copy. The transition rate from  $S_1$  to  $S_2$  is  $2\lambda$  because only two functional copies

remain. The remaining transitions (from  $S_0$  and  $S_1$  to  $S_2$ ) have a transition rate of c and represent events that causes multiple copies to fail simultaneously or cause a non-replicated portion of the circuit to fail.

From this model, the following reliability and MTTF equations are derived:

$$R(t) = (5)$$

$$\frac{(5\lambda + \mu + \sigma_1) \exp\left(-\frac{t(2c + 5\lambda + \mu - \sigma_1)}{2}\right)}{2\sigma_1} - \frac{(5\lambda + \mu - \sigma_1) \exp\left(-\frac{t(2c + 5\lambda + \mu + \sigma_1)}{2}\right)}{2\sigma_1}$$

where

$$\sigma_1 = \sqrt{\lambda^2 + 10\lambda\mu + \mu^2}$$

$$MTTF = \frac{c + 5\lambda + \mu}{c^2 + 5c\lambda + c\mu + 6\lambda^2} \tag{6}$$

Through variable substitution, these equations can be used to model different applications. Without substitution, these equations model a full TMR system with repair and additional susceptibility to common cause failure (CCF - where a single event simultaneously compromises multiple copies). Substituting  $\lambda$  for  $\rho\lambda$  and c for  $(1 - \rho)\lambda + c$  introduces a portion variable,  $\rho$ , which reflects the amount of the original circuit that has been replicated.  $\rho$  takes on a value from zero to one. A value of one removes the substitution. The substitution is not included in the above equations to lend simplicity to their presentation, but it provides a more complete view.

Figure 4 presents the classic TMR reliability graph that compares the reliability over time of a circuit in three different configurations: TMR with repair, TMR without repair, and in simplex. All three of the plotted functions can be derived



Fig. 4. Classic TMR Reliability Graph

from Equation 5. This plot helps illuminate the relationships between variables. In a system without repair, adjusting the portion of the circuit that is replicated causes the reliability function to morph between that of a simplex circuit and that of a fully replicated circuit. Adding repair to a fully replicated circuit causes the reliability function to morph towards that of the TMR with repair function where the repair rate is  $40 \times$ greater than the failure rate of a single copy. As the repair rate approaches instantaneous repair (or an infinite repair rate), the reliability function continues upward as it approaches a constant one (or no possibility of failure).

The c component is an instantaneous or non-redundant failure rate. Including only c components reduces Equation 5 down to the reliability of a simplex circuit:

$$\lim_{\lambda \to 0^+} R(t) = exp(-ct) \tag{7}$$

Adding a *c* component on top of TMR with repair effectively multiplies the TMR with repair reliability function by that of a scaled simplex circuit. Doing so imposes a hard limit on the improvement that increasing the repair rate can render.

The maximum obtainable MTTF can be found by taking the limit of the MTTF as the repair rate,  $\mu$ , approaches infinity.

$$\lim_{\mu \to \infty} \text{MTTF} = \frac{1}{c} \tag{8}$$

Note that the maximum obtainable MTTF depends only on the c component. This is significant. It means that the maximum obtainable MTTF is limited by the amount of the circuit that is not replicated, and it means that the severity of limitation is extreme unless c is very small. The limit on MTTF improvement for partial circuit replication is the inverse of one minus the portion of the circuit that is replicated,  $\rho$ .

$$\lim_{\mu \to \infty} \text{MTTF}_{\text{Imp.}} = \lim_{\mu \to \infty} \frac{\text{MTTF}}{\text{MTTF}_{\text{Simplex}}} = \frac{1}{1 - \rho} \quad (9)$$

Accordingly, if TMR is applied to 50% of the components in a target circuit (a  $\rho$  of 0.5), then only a 2× improvement can be expected. If 90% of the circuit is replicated, than a 10× maximum improvement can be expected. At 99% coverage, a 100× maximum improvement can be expect, and so forth off to infinity. This outcome assumes that all component contribute equally to the original failure rate and it does not consider the increase in *c* that results from the addition of reduction voters. It is an illustrative example of the amount of coverage required for orders of magnitude improvement.

The subsequent study included in this paper expands this outcome by considering both the marginal contribution of individual components to the failure rate and the increase in c that results from inserting reduction voters. It maybe possible to disproportionately reduce the failure rate by replicating subsets of the circuit that are more important than others. The next study explores this possibility and demonstrates achievement. That demonstration still agrees with the model, but it is important to note that the coverage portion,  $\rho$ , relates to the coverage of the original failure rate, not the quantitative ratio of replicated components to total number of components.

While this section focuses primarily on modeling partial TMR, many of the concepts and takeaways discussed are

applicable to partial circuit replication in general. Modeling for partial DWC requires additional modifications. Partial DWC does not improve reliability (more errors occur, half of which are false positives due to the replication), but it can improve availability by alerting the circuit to errors that would otherwise go unnoticed. In this way, availability models are needed. The main takeaway from such models is the same as for partial TMR: the availability improvement of the circuit is hard limited the portion of the circuit that is *not* replicated.

#### IV. PARTIAL TMR CASE STUDY

The effectiveness of partial circuit replication is greatly influenced by component selection. Some selections are likely to be more effective than others for a number of different reasons. These reasons include the number of routes replicated, the importance of the protected subset, and the amount of supporting logic required such as reduction voters (where signals transition from a triplicated region to a simplex region). Variation in the circuits placement and routing implementation can also affect the effectiveness of partial circuit replication.

The study presented here examines the impact of various selection algorithms on the effectiveness of partial circuit replication for improving the reliability of a circuit implemented on an SRAM-based FPGA. Effectiveness was determined by measuring the reduction in neutron cross section rendered by the application of partial TMR using the various selection algorithms. The study examines four types of selection algorithms: combinational verses sequential logic, random, maximizing protected routes while minimizing the insertion of reduction voters, and feedback-based component relationships.

In this study, the objective of applying partial TMR to a circuit implemented on an SRAM-based FPGA is to improve the *overall* reliability of the entire circuit. This is different than using partial TMR to reduce the likelihood of persistent errors [14], [15] or using partial TMR to reduce the likelihood of a specific failure mode. This study looks at the application of partial TMR for reducing the likelihood of *any* error.

When partial TMR is applied to a circuit implemented on an SRAM-based FPGA, some of the primitive components used by the circuit are triplicated and some are not. Primitive components vary from vendor to vendor. They are the basic building blocks of a circuit and map to resources available on the device. Primitive components include I/O resources, buffers, clock managers, programmable lookup tables, registers, memory blocks, and arithmetic units. Connections between primitive components are made through a sea of programmable interconnects and these routes are replicated based on which components are selected for replication.

A simplified diagram of partial TMR is shown in Figure 5. Connections between replicated components are considered protected routes (or protected edges). Voters that transition a signal from a replicated region to a non-replicated region are considered to be reduction voters. Any non-replicated source that drives replicated sinks is considered a non-TMR source, and any non-replicated sink that is driven by a reduction voter is considered a non-TMR sink. The occurrence counts of these attributes in a partially replicated circuit are used as metrics for the comparison of component selections.



Fig. 5. Simplified Diagram of Partial TMR.

The benefit of partial circuit replication for improving circuit reliability comes from replicating components in such a way that errors are prevented from propagating beyond the replicated boundary within the circuit. Replicating portions of the circuit reduced the likelihood of radiation-induced failure, but the insertion of additional logic to transition from a replicated region to a non-replicated region (the use of a reduction voter) increases the likelihood of radiation induced failure. Thus, for partial TMR to be effective, the likelihood of radiation-induced failure must be decreased more through replication than it is increased through the insertion of voters.

It is hypothesized that maximizing the number of protected routes and minimizing the number of inserted voters will provide the greatest amount of benefit for a given level of partial TMR. The amount of partial TMR applied is determined by the percent of components that are replicated. It is thought that a significant portion of CRAM bits in an FPGA design are dedicated to the support of programmable interconnect points. Thus a large portion of the radiation-induced failure likelihood may be due to upsets of CRAM bits associated with routing. Within a single single, the number of protected routes can vary greatly based on which components are selected for replication. Thus one of the tested selection algorithms is based on the principle of maximizing the number of protected routes and minimizing the number of inserted voters.

# A. Neutron Radiation Test

The neutron cross sections of several circuit variations were measured in October 2019 at the Los Alamos Neutron Science Center (LANSCE) (see Figure 6). LANSCE is a spallation neutron source. The irradiation of chips electronics (ICE) instruments have an energy spectra that is similar to a scaled ground spectrum [11]. This make these instruments suitable for measuring the average neutron cross section of an observable event for a device deployed in terrestrial or high-altitude applications. Different partial TMR selection schemes were applied to the same circuit implemented in an SRAM-based FPGA. The cross section of any failure in the circuit were measured for each selection (including a baseline without any replication). These measurements are then used along with the previously mentioned attribute counts to evaluate and compare the effectiveness of each selection.



Fig. 6. Neutron Radiation Test Setup

Neutron irradiation was selected for this study (as opposed to proton or heavy ion) because this type of radiation is an important part of evaluating the soft error characteristics of FPGAs in terrestrial environments [11]. The techniques developed in this study will likely be used in large-scale deployments of circuits in terrestrial environments. Thus, neutron irradiation is an apt form of radiation to use.

Figure 6 displays the setup of the test experiment. Here, the test circuit is loaded onto five design under test boards. The boards in the experiment are Nexus Video development boards, which utilize Artix-7 200T FPGAs. The FPGAs are aligned perpendicular to the 2-inch collimated neutron beam so that beam passes directly through the devices. The distance of the devices from the neutron source was used to appropriately degrade the neutron flux measured at the device. The test circuit is provided stimulus and monitored by development boards placed outside of the neutron beam. A custom JTAG configuration manager (not shown) monitors the occurrence of radiation-induced upsets and orchestrates the flow of the test.

The test circuit has 256 instances of a group of interdependent state machines known collectively as the "B13". The B13 comes from the ITC'99 benchmark suite [16] and was formerly control circuitry for communicating with a weather sensor. This design has been used in several FPGA reliability studies [17]. A total of 36 design variants were tested including the baseline circuit. A subset of results are included here for brevity. Over the course of the entire test, a total of 3,052

| TABLE I |           |      |         |
|---------|-----------|------|---------|
| NEUTRON | RADIATION | Test | RESULTS |

| Design        | Cross Section (95% Conf.)                     | Cross Section Visual | Cov. | Red. | Ret.  | Edg. | Vot. | S/S    |
|---------------|-----------------------------------------------|----------------------|------|------|-------|------|------|--------|
| Baseline      | $2.6\pm 0.3\times 10^{-9}~{\rm cm}^2$         |                      | 0%   | 0%   | -     | 0    | 0    | 0/0    |
| All           | $3.8\pm1.3\times10^{-10}~{\rm cm}^2$          | Ē                    | 100% | 85%  | 0.85  | 234  | 9    | 10/9   |
| Combinational | $1.4\pm0.4	imes10^{-9}~{ m cm}^2$             |                      | 44%  | 44%  | 1.01  | 17   | 37   | 55/54  |
| Registers     | $3.9 \pm 1.0 \times 10^{-9}$ cm <sup>2</sup>  |                      | 56%  | -52% | -0.93 | 24   | 54   | 45/148 |
| Random 9%     | $3.5\pm0.7	imes10^{-9}~{ m cm}^2$             |                      | 9%   | -37% | -4.06 | 3    | 9    | 18/27  |
| Random 20%    | $2.6 \pm 0.6 \times 10^{-9} \ \mathrm{cm}^2$  |                      | 20%  | -2%  | -0.12 | 9    | 18   | 31/45  |
| Random 38%    | $2.7 \pm 0.6 \times 10^{-9}$ cm <sup>2</sup>  |                      | 38%  | -3%  | -0.09 | 32   | 28   | 43/58  |
| Random 50%    | $2.5\pm0.5	imes10^{-9}~{ m cm}^2$             |                      | 50%  | 4%   | 0.08  | 60   | 37   | 35/70  |
| Random 75%    | $2.6 \pm 0.6 \times 10^{-9}$ cm <sup>2</sup>  |                      | 75%  | 0%   | -0.01 | 134  | 39   | 28/49  |
| ILP 9%        | $2.1\pm0.5	imes10^{-9}~{ m cm}^2$             |                      | 9%   | 20%  | 2.27  | 20   | 3    | 4/6    |
| ILP 20%       | $1.8\pm0.4	imes10^{-9}~{ m cm}^2$             |                      | 20%  | 31%  | 1.56  | 60   | 3    | 4/5    |
| ILP 38%       | $2.0 \pm 0.5 \times 10^{-9}$ cm <sup>2</sup>  |                      | 38%  | 24%  | 0.64  | 101  | 2    | 12/4   |
| ILP 50%       | $1.6\pm 0.4 	imes 10^{-9} \ { m cm}^2$        |                      | 50%  | 40%  | 0.80  | 135  | 7    | 6/15   |
| ILP 75%       | $8.5 \pm 2.7 \times 10^{-10} \ \mathrm{cm}^2$ |                      | 75%  | 67%  | 0.89  | 191  | 4    | 11/5   |
| SCC Largest   | $1.4\pm0.3	imes10^{-9}~{ m cm}^2$             |                      | 69%  | 46%  | 0.67  | 163  | 24   | 2/31   |
| SCC Output    | $2.4 \pm 0.5 \times 10^{-9}$ cm <sup>2</sup>  |                      | 31%  | 8%   | 0.26  | 40   | 9    | 34/9   |
| TF Level 1    | $1.4 \pm 0.3 \times 10^{-9}$ cm <sup>2</sup>  |                      | 64%  | 46%  | 0.72  | 149  | 18   | 8/21   |
| TF Level 2    | $2.4 \pm 0.4 \times 10^{-9}$ cm <sup>2</sup>  |                      | 28%  | 7%   | 0.24  | 40   | 14   | 11/18  |

radiation induced upsets were observed under a total fluence of  $1.43 \times 10^{12}$  n cm<sup>-2</sup>.

## B. Results

The results from the radiation test are shown in Table I. The left most column shows the design variation labeled by the applied selection algorithms and accompanying level of coverage. The next two columns are the measured neutron cross section with a 95% confidence interval and a visual representation for an at-a-glance comparison. The remaining columns detail attributes of each variant for comparison and include the following attributes: coverage (cov.), reduction (red.), return or reduction divided by coverage (ret.), number of protected edges (edg.), number of inserted reduction voters (vot.), and the number of non-TMR sources and sinks (s/s).

The variants labeled "Baseline" and "All" provide important reference points for comparing all of the remaining variants. Baseline has no replication and is the reference point used to determine how far the cross section is reduced (or expanded) in subsequent variants. Replicating all components minus I/O ports is able to reduce the cross-section by 85% (to 15% of its original size). This is very likely the smallest the cross section will be since subsequent variants will replicate only a subset of components.

The next set of variants compared the selection of all combinational circuit elements against the selection of all sequential circuit elements. The combinational selection surprisingly reduced the cross section by 44% even through it required nearly as many reduction voters as the number of components replicated. The combinational selection included all look up tables in the design and provided a decent return of 1.01. The sequential selection (all registers) yielded a

surprisingly poor result by proportionally increasing the cross section nearly as much as the portion of the design replicated (a 52% increase for a 56% coverage). Further study is needed to understand the implications of these results.

The following three sets of circuit variants present the findings from: random selection, integer linear programming (ILP) selection targeting maximum edge protection and minimum reduction voter insertion, and feedback-based selection algorithms respectively. Random selection proved ineffective. It either increased the cross section or decreased it slightly. This outcome shows that methodical selection is a must. The ILP selection showed favorable results supporting the original hypothesis. Here the original cross section was reduce 20% through replicating only 9% of circuit components. That is a return of 2.27, which is the largest return experienced by any variant. The remaining ILP variants had sizable returns as well. The final set of variants base their selection on feedback found in the design. Feedback occurs whenever an output signal from a component propagates to that same component's inputs. The return of these variants were not as large as those of the ILP set, but far greater than the returns of the random set suggesting favorable benefit from leveraging feedback relationships in component selection.

This study presented several different selection algorithms for implementing partial TMR in a circuit implemented on an SRAM-based FPGA design. The objective in presenting these algorithms was to test the hypothesis that greater benefit from partial TMR can be gained by methodically selecting portions of the circuit to replicate, specifically by selecting areas that will protect the greatest number of routes and require a minimum number of reduction voters. Random selection was found to be ineffective. Replicating all combinational logic components significantly reduced the cross section whereas replicating all sequential logic components significantly increased the cross section. Replicating subsets that maximize protected routes and minimize the number of reduction provided the greatest return benefit of 2.27 for only 9% replication. Feedback-based selections also demonstrated promising results.

# V. PARTIAL DWC CASE STUDY

Another study was conducted that centered on applying DWC in a partial manner to an FPGA-based networking system. Terrestrial radiation, which originates from galactic cosmic rays that enter the atmosphere or form contaminates in the packaging material, can cause high performance networking systems to fail. Radiation-induced failure in terrestrially deployed network systems is very unlikely, but increased occurrence in large-scale deployments [18] bring attention to this issue. DWC is used in this study to improve a network system's awareness of radiation-induced persistent silent network disruptions.

Radiation-induced upsets in CRAM can disrupt the flow of network traffic in an FPGA-based networking system by compromising the functionality of sub-components in the system. Generally speaking, a network switch contains multiple independent streams of data, buffers, and control logic for arbitration based on packet information. Each of these constructs are implemented using primitive resources available on the device. When radiation corrupts the configuration of the device, the proper functionality of these constructs can become compromised. Many high level protocols are put in place to detect and respond to undesirable behavior but some effects of radiation-upsets can remain undetected. One of the most severe failure mode is the undetected loss of traffic flow without recovery.

Partial DWC was applied in this study to regions of the network system circuitry that were evaluated as more critical based on targeted fault injection results. Ultimately, partial DWC was applied to 29% of the network system's FPGA implemented circuit. Neutron radiation testing revealed that this design variant was able to detect 45% of all persistent silent network disruptions, which would otherwise go unnoticed. Another design variant, with DWC applied to 8% of circuit components, was able to detect 31% of persistent silent network disruptions cover the network system and test setup, critical region evaluation, DWC implementation, and the subsequent neutron beam tests with accompanying results.

#### A. Network System and Test Setup

The network system included in this study is a campus backbone switch. This type of networking device typically combines the networks of entire buildings across a wide area as opposed to connecting individual computers together in a single building or room. An enormous amount of data can be processed through these systems. The studied system is built for modular expansion. The chassis furnishes 16 networking ports and each additional modular expansion furnishes an additional 16 ports. Each group of 8 ports are paired with an interface ASIC and an SRAM-based FPGA - a Virtex 7 330T. The FPGA is responsible for a considerable portion of network data processing and plays an essential role in connecting the paired ports to the system and other available ports as a whole.

To mitigate network connectivity loss, these network systems are often configured with system-level redundancy and switchover capability. In the unlikely event that connectivity loss is sustained in a single system, the system-level redundancy with switchover capability provides an alternate path that can be used to complete the connection. Under this configuration, persistent silent network loss in a single system will have minimal impact. Higher level protocols will be able to route traffic to the alternate path. This study uses DWC to improve system awareness of persistent network loss to enable faster repair times.

Figure 7 shows a diagram of the test setup used in fault injection and neutron radiation testing. An external traffic generator is used to generate network data to be used as stimulus. The traffic generator also monitors the successful delivery of those packets. Packets are transmitted and received bidirectionally from each port connected to the traffic generator. The 16 ports on the modular network board under test are configured to forward received traffic to an adjacent port such that received data loops through each port until it reaches its final destination. It is important to note that this configuration causes traffic to travel through the backplane of the device. This exercises more of the FPGA logic and improves the coverage of the test. The FPGA on the board is connected via JTAG to a custom configuration manager (JCM) that monitors for radiation-induced upsets. The JCM can also injects faults as needed for fault injection testing. The host computer orchestrates the flow of the test and can monitor and control both the traffic generator and the modular network board. The network board is monitored and controlled through a console connection.



Fig. 7. Test Setup for Fault Injection and Radiation Testing

The presented test setup is used in tandem with a test flow to collect data for evaluating the effectiveness of the applied technique. Figure 8 present the test flow used for fault injection testing, which resembles the flow of a typical fault injection campaign [8]. First the system is brought into a working state. Then a fault is injected into the CRAM of the FPGA that implements the circuit under test. A set of diagnostic checks are then made to determine if the system is still working correctly. If it is, then the injected fault is repaired and a subsequent fault is injected to continue the test. If the system is not behaving correctly, the error is reported, the system is rebooted to bring it back into a working state, and a subsequent fault is injected to continue the test. This loop continues until sufficient data has been collected.



Fig. 8. Fault Injection Test Flow

A similar test flow was use for neutron radiation testing. In neutron radiation testing, faults are not purposefully injected. The neutron fluence is measured from the beginning of a test run to a failure event. This fluence to failure measurement is made for several samples of the same event, and that data is then used to estimate the neutron cross section of the event.

#### B. Critical Region Evaluation

It is hypothesized that some areas in a circuit are more important than others when it come to selecting which areas of a circuit should be protected by partial DWC. SEUs can occur anywhere in the device, but if they occur in areas where the underlying circuity has the greatest responsibility to the proper functionality of the system, then a failure outcome may be more likely. This hypothesis suggests that the failure rate of a circuit can be unevenly distributed across components in a circuit. To maximize the error detection return (percent of errors detected over percent of circuit replicated), selection priority should be given to high risk regions.

Critical regions were evaluated for their proportional contribution to the overall failure rate through fault injection. Vendor tools were used to specify which subset of CRAM bits were potentially used by different regions of the design. These potentially used bits are known as essential bits [19]. Fault injection was used to evaluate four different regions: the circuit as a whole, the packet reader (PR) module, the traffic manager (TM) module, and the Interlaken (Inter) module. Essential bits for sub regions were collected by removing all other regions from the placed and routed design and then regenerating the essential bits for the remaining placed and routed subregion. Only the essential bit that intersected both the original list and the regenerated list were used to filter out minor changes in peripheral routing that were necessary for bitstream generation. It is important to know that an upset in an essential bit does not guarantee a failure, it only indicates that the underlying circuit may be affected by an upset in the bit.

Table II presents the results from the fault injection campaigns. The first column presents the regions tested. The second column designates the number of essential bits in each region. Columns three and four show the number of random fault injections in the region and the number of observed failures respectively (randomly injected faults were tested independent of each other). The fifth column shows the sensitivity approximated by population sampling (failures over faults injected). The final column presets the estimated number of critical bits in the sampled region. Critical bits are bits that if upset will cause failure with a high likelihood [19].

 TABLE II

 FAULT INJECTION RESULTS WITHIN A TARGET REGION

| Region          | Essential<br>Bits | Faults<br>Injected | Failures | Sensitivity | Critical<br>Bits |
|-----------------|-------------------|--------------------|----------|-------------|------------------|
| Whole<br>Design | 27.1M             | 29624              | 360      | 1.2%        | 325.2K           |
| PR              | 1.7M              | 3628               | 104      | 2.9%        | 49.3K            |
| TM              | 2.1M              | 23402              | 467      | 2.0%        | 42.0K            |
| Inter           | 4.9M              | 19627              | 435      | 2.2%        | 107.8K           |

The estimated number of critical bits in the sampled regions can be used to estimate the distribution of the critical bits among the regions of the circuit. Table III shows the distribution of critical bits based on the resource utilization and the number of estimated critical bits in each region. The first and second columns show the evaluated critical region and the percentage of the whole circuit that their resource utilization makes up respectively. The "All 3" region combines the PR, TM, and Iter regions of the design, which are mutually exclusive. The "other" region evaluates the combination of regions that are not included in the first three regions. Column three presents the percentage of critical bits that reside in each region. A 95% confidence interval is included with each percentage. The final column presents the relative critical bit density of each sub-region of the circuit.

TABLE III DISTRIBUTION OF CRITICAL BITS

| Region | Percent of<br>Overall Circuit | Percent of<br>Critical Bits | Sensitivity compared<br>to whole circuit |
|--------|-------------------------------|-----------------------------|------------------------------------------|
| PR     | 6.1%                          | $14.4 \pm 4.7\%$            | 2.36×                                    |
| TM     | 7.7%                          | $12.6 \pm 2.7\%$            | 1.63×                                    |
| Inter  | 18%                           | 32.9±7.1%                   | 1.83×                                    |
| All 3  | 31.8%                         | 59.9±14.5%                  | $1.88 \times$                            |
| Other  | 68.2%                         | 40.1±14.5%                  | $0.59 \times$                            |

From this evaluation it can be seen that the PR region has the most critical bits per resource utilization with a factor of  $2.36 \times$ . The TR and Inter also have elevated sensitivities compared to the whole design, and the remaining portions of the circuit have a lower sensitivity the circuit as a whole. This evaluation suggests that applying DWC to all three of the elevated regions would likely be able to detect about 60% of persistent silent network disruptions while requiring duplication of only 32% of the circuit, which is a favorable outcome.

## C. Implementation

Partial DWC was applied to this circuit using a custom electronic design automation (EDA) tool. The basic flow of a circuit through this tool is shown in Figure 9. The circuit originates in hardware description language (HDL) source files. These are converted through logic synthesis into a netlist. A netlist details the circuit's components and connectivity. This netlist is then supplied to the custom EDA tool. Here, the user defined portions of the circuit are replicated and supporting connections and detectors are added. The EDA tool then produces an updated netlist that is supplied to vendor tools for technology mapping, placement, and routing. A resultant bitstream is then made available for deployment.



Fig. 9. Partial DWC Insertion Flow

Table IV shows the resource utilization of the respective variations of the circuit. Three variation of the circuit are examined in this study. First, the baseline without any redundancy added to it is used as a reference point for comparison. Second, a variant was made that duplicates large portions of all three of the studied critical regions (PR, TM, and Inter; about 90%). Finally, a third variant duplicated paths between strongly connected components within the studied critical regions. The number of slices, registers, lookup tables (LUTs), block memories (BRAMs) are shown. The percentage of components replicate and the number of required detector pairs are also shown. Note that based on the accompanying percentage of resource utilization (in parenthesis), it is not possible to duplicate the entire design.

TABLE IV PARTIAL DWC RESOURCE UTILIZATION

| Virtex 7 330T  | Baseline      | Partial DWC<br>PR/TM/INTER | Partial DWC<br>Between SCC |
|----------------|---------------|----------------------------|----------------------------|
| Slices         | 40,826 (80%)  | 49,016 (96%)               | 44,814 (88%)               |
| Registers      | 136,766 (34%) | 180,516 (44%)              | 152,106 (37%)              |
| LUTs           | 99,165 (49%)  | 134,639 (66%)              | 113,278 (56%)              |
| BRAMs          | 457.5 (61%)   | 569 (76%)                  | 477 (64%)                  |
| DWC Coverage   | 0%            | 29%                        | 8%                         |
| Detector Pairs | 0             | 2.627                      | 1.687                      |

#### D. Neutron Beam Test and Results

Figure 10 shows the setup used at neutron testing of the network system at the ChipIR experiment at the ISIS neutron source of the Rutherford Appleton Laboratory in the United Kingdom in March of 2019. One of the FPGAs in the system was positioned perpendicular to the neutron flight path such that the 2 inch collimated beam passes directly through the

device. The distance from the source to the targeted FPGA was used to appropriately degrade the neutron flux at the interception of the chip. Over a 36 hour period of testing a total fluence of  $4.5 \times 10^9$  n cm<sup>-2</sup> was collected towards the results.



Fig. 10. Accelerated Neutron Beam Test Setup at ChipIR

Table V show the accuracy of partial DWC in neutron testing for detecting otherwise undetectable persistent silent network disruptions. The total observed upsets (SEUs) is shown along with the total observed failures and the number of failures that were detected by partial DWC. Partial DWC applied to all three critical regions was able to detect 45% of all observed failures, and partial DWC applied to critical structures within those regions was able to detect 31% of all observed failures. It is important to note that more than 50% of detections reported by both versions were false detections meaning that an actual failure did not occur. This is expected behavior. Since only one of the replicas in a DWC scheme drive the I/O of the circuit, any error in the redundant copy will be flagged as a detection even though the error will not affect the final output of the circuit.

TABLE V PARTIAL DWC ACCURACY IN NEUTRON TESTING

| Design            | Baseline | Partial DWC<br>PR/TM/INTER | Partial DWC<br>Between SCC |
|-------------------|----------|----------------------------|----------------------------|
| SEUs              | 459      | 675                        | 1024                       |
| Total Failures    | 11       | 20                         | 32                         |
| Detected Failures | 0        | 9                          | 10                         |
| Accuracy          | 0%       | 45%                        | 31%                        |
| False Detection   | 0%       | 61%                        | 58%                        |

Table VI show the results from the neutron test. The number of persistent network disruptions that were undetected are listed in the second row. The fluence observed during the respective tests is shown on the third row. The estimated cross section with 95% confidence intervals is shown in the fourth row, and the fifth row presents the FIT of undetected failures (persistent silent network disruptions) adjusted to a New York City reference neutron flux [11]. The confidence intervals overlap between measurements. At face value, the FIT of undetected failures is reduced considerably through the application of partial DWC.

| Design                 | Baseline                  | Partial DWC<br>PR/TM/INTER | Partial DWC<br>Between SCC |
|------------------------|---------------------------|----------------------------|----------------------------|
| Undetected<br>Failures | 11                        | 11                         | 22                         |
| Fluence                | 9.37E+8 n/cm <sup>2</sup> | 1.35E+9 n/cm <sup>2</sup>  | 2.17E+9 n/cm <sup>2</sup>  |
| Cross Section          | 1.17E-8 cm <sup>2</sup>   | 8.17E-9 cm <sup>2</sup>    | 1.01E-8 cm <sup>2</sup>    |
| (95% conf.)            | (5.7E-9, 2.1E-8)          | (3.3E-9, 1.3E-8)           | (5.9E-9, 1.4E-8)           |
| FIT<br>(95% conf.)     | 152 (74, 273)             | 106 (43, 169)              | 132 (77, 187)              |

TABLE VI NEUTRON TEST RESULTS

In this study, DWC was applied to critical regions of a commercial FPGA-based campus backbone switch. The distribution of the circuits failure rate was evaluated through fault injection. Some areas were found to be more sensitive than other such as the PR module, which was  $2.7 \times$  more likely to result in failure should a random upset occur in this region of the circuit. DWC was applied to all three evaluated regions. A coverage of 29% was able to detect 45% of all otherwise undetectable failures and a coverage of 8% (a subset within the evaluated regions) was able to detect 31% of all otherwise undetectable failures. The results suggest significant improvement in the system's ability to detect these otherwise undetectable errors through the use of partial DWC.

#### VI. CONCLUSION

Three partial circuit replication studies are presented in this work. The first study examines the reliability model of partial circuit replication (more specifically partial TMR). This study finds that the reliability benefit of partial circuit replication is inversely related to the portion of the original failure rate that is protected through partial circuit replication. Benefit is greatest when a very large portion of the original failure rate is protected. The second study examines several selection algorithms for the application of partial TMR. This study finds that random selection does not improve reliability. Maximizing protected routes and minimizing the insertion of reduction voters was able to reduce the original failure rate by 20% while only replicating 9% of the circuit. The final study used partial circuit replication to improve the detection of radiation-induced network outages in an SRAM FPGAbased networking application. Partial DWC of 29% of circuit components was able to detect 45% of persistent network outages and partial DWC of 8% of circuit components was able to detect 31% of outages as demonstrated through neutron radiation testing. All in all, these three studies demonstrate some of the benefits of partial circuit replication and advance the state of the art understanding and implementation of partial TMR and partial DWC.

#### ACKNOWLEDGMENTS

This work was supported by LANSCE under proposal NS-2019-8294-A and by ChipIR at the ISIS neutron source of the Rutherford Appleton Laboratory (UK) under proposal 1900120.

#### REFERENCES

- [1] M. Ceschia, M. Violante, M. Reorda, A. Paccagnella, P. Bernardi, M. Rebaudengo, D. Bortolato, M. Bellato, P. Zambolin, and A. Candelori, "Identification and classification of single-event upsets in the configuration memory of SRAM-Based FPGAs," *IEEE Transactions* on Nuclear Science, vol. 50, no. 6, pp. 2088–2094, dec 2003. [Online]. Available: http://ieeexplore.ieee.org/document/1263846/
- [2] N. Battezzati, L. Sterpone, and M. Violante, "Reconfigurable Field Programmable Gate Arrays: Failure Modes and Analysis," in *Reconfigurable Field Programmable Gate Arrays for Mission-Critical Applications*. Springer, 2011, pp. 37–83.
- [3] P. G. Neumann, *Computer-related risks*. Addison-Wesley Professional, 1994.
- [4] D. McMurtrey, "Using Duplication with Compare for On-line Error Detection in FPGA-based Designs," Ph.D. dissertation, Brigham Young University, dec 2006. [Online]. Available: https://scholarsarchive.byu.edu/etd/1094
- [5] S. A. Elkind and D. P. Siewiorek, "Reliability techniques," in *Reliable Computer Systems: Design and Evaluation*, 3rd ed. A K Peters, Ltd., 1998, pp. 335–336.
- [6] J. M. Johnson and M. Wirthlin, "Voter insertion algorithms for FPGA designs using triple modular redundancy," in *Proceedings* of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays, ser. FPGA '10. New York, NY, USA: ACM, 2010, pp. 249–258. [Online]. Available: http://doi.acm.org/10.1145/1723112.1723154
- [7] M. L. Shooman, Reliability of computer systems and networks: fault tolerance, analysis, and design. John Wiley & Sons, 2003.
- [8] H. M. Quinn, D. A. Black, W. H. Robinson, and S. P. Buchner, "Fault simulation and emulation tools to augment radiation-hardness assurance testing," *IEEE Transactions on Nuclear Science*, vol. 60, no. 3, pp. 2119–2142, jun 2013. [Online]. Available: http://ieeexplore.ieee.org/document/6519339/
- [9] S. McConnel and D. P. Siewiorek, "Evaluation criteria," in *Reliable Computer Systems: Design and Evaluation*, 3rd ed. A K Peters, Ltd., 1998, pp. 335–336.
- [10] A. L. Silburt, A. Evans, I. Perryman, S. Wen, and D. Alexandrescu, "Design for soft error resiliency in internet core routers," *IEEE Transactions on Nuclear Science*, vol. 56, no. 6, pp. 3551–3555, dec 2009. [Online]. Available: http://ieeexplore.ieee.org/document/5341362/ http://ieeexplore.ieee.org/abstract/document/5341362/
- [11] Measurement and reporting of alpha particle and terrestrial cosmic ray-induced soft errors in semiconductor devices, JEDEC Solid State Technology Association Std. 89A, 2006. [Online]. Available: https://www.jedec.org/sites/default/files/docs/JESD89A.pdf
- [12] H. Quinn, "Challenges in testing complex systems," *IEEE Transactions on Nuclear Science*, vol. 61, no. 2, pp. 766–786, apr 2014. [Online]. Available: http://ieeexplore.ieee.org/document/6786369/
- [13] A. J. Tylka, J. H. Adams, P. R. Boberg, B. Brownstein, W. F. Dietrich, E. O. Flueckiger, E. L. Petersen, M. A. Shea, D. F. Smart, and E. C. Smith, "Creme96: A revision of the cosmic ray effects on microelectronics code," *IEEE Transactions on Nuclear Science*, vol. 44, no. 6, pp. 2150–2160, Dec 1997.
- [14] K. Morgan *et al.*, "SEU-induced persistent error propagation in FPGAs," *IEEE Trans. Nucl. Sci.*, vol. 52, no. 6, pp. 2438–2445, dec 2005.
- [15] B. Pratt et al., "Fine-grain SEU mitigation for FPGAs using partial TMR," IEEE Trans. Nucl. Sci., vol. 55, no. 4, pp. 2274–2280, aug 2008.
- [16] F. Corno et al., "RT-level ITC'99 benchmarks and first ATPG results," IEEE Des. Test. Comput., vol. 17, no. 3, pp. 44–53, Jul 2000.
- [17] H. Quinn et al., "Using benchmarks for radiation testing of microprocessors and FPGAs," *IEEE Trans. Nucl. Sci.*, vol. 62, no. 6, pp. 2547–2554, Dec 2015.
- [18] H. Quinn and P. Graham, "Terrestrial-based radiation upsets: a cautionary tale," in 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'05), 2005, pp. 193–202.
- [19] R. Le, "Soft error mitigation using prioritized essential bits," Xilinx XAPP538 (v1. 0), 2012.