Abstract-A variety of fault emulation systems have been created to study the effect of single-event effects (SEEs) in static random access memory (SRAM) based field-programmable gate arrays (FPGAs). These systems are useful for augmenting radiation-hardness assurance (RHA) methodologies for verifying the effectiveness for mitigation techniques; understanding error signatures and failure modes in FPGAs; and failure rate estimation. For radiation effects researchers, it is important that these systems properly emulate how SEEs manifest in FPGAs. If the fault emulation systems does not mimic the radiation environment, the system will generate erroneous data and incorrect predictions of behavior of the FPGA in a radiation environment. Validation determines whether the emulated faults are reasonable analogs to the radiation-induced faults. In this paper we present methods for validating fault emulation systems and provide several examples of validated FPGA fault emulation systems.
I. INTRODUCTION

I
N RECENT years many organizations have been studying methods for integrating SRAM-based FPGAs into space missions. Many of these organizations have studied how SEEs are manifest in FPGAs. The most common SEE for Xilinx FPGAs is the single-event upset (SEU). SEUs are possible in the configuration memory, the user memory, and the embedded cores of SRAM-based FPGAs [1] - [4] . SEUs in the configuration memory cause changes to the programmable logic and routing. SEUs in the user memory (internal block memory and user flip-flops) change the variables and state of the system and may lead to erroneous results or incorrect functionality. Finally, the embedded hard IP cores typically contain registers and SEUs could cause changes to the operation and behavior of these cores.
The error signatures caused by SEUs in an FPGA can be quite complex. The effects of SEUs are often not observable until they have manifest themselves throughout the system. Further, user circuits can logically mask some of these faults. RHA methodologies can be useful for studying these error signatures. The standard for testing error signatures is to use accelerated radiation testing. Many organizations have been leveraging fault injection techniques to obtain similar information. Fault injection covers two main disciplines:
• Fault Simulation: analytical methods for analyzing how faults affect the FPGA, and • Fault Emulation: hardware-based methods for inserting faults into the FPGA. Due to space limitations, this paper focuses only on fault emulation. There are many advantages for using fault emulation for RHA, including lower cost, rapid testing time, and the ability to customize the experiment. In the United States, accelerated radiation testing can cost between $500-1,500 per hour and is affected by the limited availability of facilities. For fault emulation testing, the only cost is the hardware of the emulation system and the labor to create and run the system. Once a fault emulation system has been designed, it can be replicated as many times as necessary to meet the needs of an organization. By controlling the type and location of faults algorithimically, test designers can develop a variety of custom experiments and test scenarios. Examples of such custom experiments include the testing of multiple-independent upsets (MIUs) and multiple-cell upsets (MCUs), non-uniform SEU upsets, design mitigation testing, and targeted SEUs into specific regions of the device.
There are limitations to fault emulation systems, including the inability to test for SEE cross sections, inability to inject faults into all locations of the component, and the inflexibility of some systems. Fault emulation does not provide a measurement of the SEU bit cross sections. SEE testing is still necessary to determine the SEE cross sections of the component. Fault emulation is also not able to inject into state hidden from the user. While the hidden circuitry are less sensitive to these types of failures than SEUs, it does mean that fault emulation systems cannot cover the entire fault space for FPGAs. Finally, many fault emulation systems are designed to inject specific types of faults and emulating all of the failure modes can be difficult or time consuming.
0018-9499 © 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/ redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
In spite of these limitations, FPGA fault emulation has proved to be very useful, especially for SRAM-based FPGAs. Fault emulation is useful for these FPGAs because the primary FPGA fault mechanism is the occurrence of SEUs in the configuration memory, user memory, or flip-flops of the device. As the configuration memory determines the operation of the circuit programmed on the device, SEUs in this memory cause disruptions in the circuit behavior that are easy to detect and monitor. These SEUs can be emulated by inserting upsets directly into the component and observing the behavior of the system. SRAM-based FPGA fault emulation has been successfully used in the following situations:
1) Verify/test mitigation techniques: Mitigation techniques can be difficult to apply effectively. Furthermore, designing new mitigation techniques can be time consuming. Fault emulation can help designers and researchers determine whether mitigation is increasing the robustness of the design. 2) Understand error signatures and failure modes: Because the users can test the FPGA uniformly, they can capture the system's output at each failure location. This output can be used to determine how the system fails and how these failures could affect the overall system. 3) Failure rate estimation: Not all of the SEEs on the FPGA lead to observable output errors, especially as much of the configuration memory and logic is not used. Fault emulation testing can help determine the number of configuration memory bits for a specific design that are sensitive to output errors. This number can be used along with orbit upset rates to provide preliminary design failure estimates in a particular environment. An important step of designing a fault emulation systems is the validation of the system. Validation determines whether the emulated faults are reasonable analogs to the radiation-induced faults. As fault emulation systems can create a number of different emulation scenarios, such as MCUs or MIUs, it is important that the validation process can mimic these same conditions.
In this paper we review FPGA-based fault emulation systems. We start with background on current FPGA-based fault emulation systems in Sections II and III. In Section IV we discuss methods for validating FPGA-based fault emulation systems. In Section V we discuss the underpinnings of statistical confidence intervals for fault emulation and validation. In Section VI we present methods and results from previously validated systems.
II. DESIGNING FAULT EMULATION SYSTEMS
In this section we review the fundamental aspects of FPGA fault emulation design and summarize several existing fault emulation systems. A more detailed handling of this topic can be found in [6] . It should be noted that FPGAs are used extensively to prototype hardware systems and to provide fault emulation in these prototype systems. These prototyping fault emulation systems will not be discussed in this paper. This paper will also not explore fault emulation of Single Event Functional Interrupts (SEFIs), as most organizations do not emulate SEFIs.
In this paper, we focus only on FPGA-based fault emulations that are being used to estimate the effect of radiation-induced errors on the FPGA itself. 2 . This diagram of the Autonomous Emulation System provides a useful reference for the different aspects of fault emulation system design [7] .
We start by examining a basic software algorithm and hardware design for a fault emulation system. An algorithm is shown in Fig. 1 and generically addresses the fault emulation process. A visual overview of a typical hardware system is shown in Fig. 2 , which shows the component under test (CUT) and the Emulation Controller that controls the fault emulation process. There are a few aspects that every fault emulation system needs:
• Access to the internal memory where faults are inserted, • Ability to stimulate and execute the circuit, • Ability to determine output errors, and • Ability to clear errors.
In this section, we discuss each of these aspects in detail.
A. Injecting Faults
The most important mechanism of an FPGA fault emulation system is its ability to inject faults into the component. For FPGAs there are often a number of methods for accessing the configuration memory. One of the most powerful tools in Fig. 3 . Fault emulation platform from Politecnico di Torino [12] . fault emulation today is the Joint Test Action Group (JTAG) boundary scan port [8] - [11] , as nearly all manufacturers implement some portion of the JTAG standard. While slower than other configuration mechanisms, JTAG provides a serial interface to the configuration memory that can be used by a variety of tools and software.
Most FPGA manufacturers also provide other configuration ports that can be used for higher speed fault emulation. Xilinx has an external parallel configuration port, called the SelectMAP port, that can insert faults faster than JTAG. Starting in the Virtex-II, Xilinx also provides an internal configuration port, called the Internal Configuration Access Port (ICAP) [12] , that can also be used to access the configuration memory from within the device. A basic fault emulation intellectual property (IP) core is a available from Xilinx [13] to modify the configuration memory and implement fault emulation. Similarly, Altera provides a fault emulation core for Stratix-V components [14] - [16] .
In recent years, the use of the ICAP port has become more common, as it makes it easier to use evaluation boards. These ICAP-based fault emulation tools often only use one FPGA for the entire system, including the test circuit, SEU controller and experiment controller (input data, functional monitoring). Fig. 3 from shows a Virtex-II Pro ICAP system that is designed to be executed with only one FPGA.
Emulating faults in the user memory and embedded cores is more of a challenge than emulating faults in the configuration memory. Accessing user memory while the circuit is executing may corrupt data and it might be necessary to modify the design so there is a secondary data input into latches, as is done in [17] - [19] . It is possible to emulate faulty input or output data by inserting SEUs in the user flip-flops attached to the embedded cores' ports. Inserting faults into the embedded cores is less developed at this stage.
The mechanism for injecting faults affects two different aspects of test space coverage: fault model coverage and timing coverage. Most FPGA fault emulation systems are designed to cover only one fault model, which means that only the fault locations for one fault model are emulated at a time. In comparison, radiation can broadly sample from all of the failure models. To avoid having disjoint data sets, it is best that one of the data sets covers a large portion of the fault model. Sometimes good coverage of the fault model can be difficult, especially when the number of fault locations is high and/or the emulation process is slow. We will discuss the coverage issue in the next section when we talk about statistical properties of fault emulation.
Coverage of timing can be a complex issue in fault emulation. [6] covers this topic from solely from an operational point of view, where the timing of the fault with the computation is explored. There are other timing issues, especially with single-event transients (SETs). For SETs it is important that the insertion point for the SETs cover the entire clock cycle, the SETs use a pulse length that mimics radiation and that the SETs are easily overwritten. In comparison, SEUs in SRAM FPGAs can be inserted on clock edges and should persist for several clock cycles.
B. System Stimulation
Once the fault is inserted, the circuit needs to be tested using input stimulus. The input vector set grows exponentially with the size of the input bus, which makes it difficult to cover completely. On top of it, some of the fault locations have input-sensitive errors, which can only be triggered with specific input vectors. Ideally, the input stimulus will be crafted to exercise all areas of the circuit and provide full coverage of the various corner cases in the circuit operation. Three techniques from software engineering that could be used for this purpose are random testing [20] , structural testing [21] and statistical testing [22] . Most commonly random testing, which uses pseudo-random input sequences, is used as the input stimulus when more carefully crafted test vectors are not available. The greater the coverage of the input stimulus on the circuit operation, the greater the probability that input-sensitive faults impacting circuit operation will be detected.
The input stimulus is usually provided through external pins on the FPGA. Ideally, the fault emulation system provides a sufficient number of pins to support the full size of the test vector. If input/output pins are used, the design and placement of these pins, especially clocks and resets, are usually fixed. The number of pins provided by some fault emulation systems is limited and requires a modification in the design to fit within the system constraints. Such modifications may lead to uncertainty in the fault emulation results. It is also possible to provide input stimulus through boundary scan or internally handling the input and output vectors (i.e., through internal memories).
C. Detect Errors
An essential step in fault emulation is the detection of system errors and there are a variety of methods used by FPGA fault emulation to detect errors. One common method is to provide an additional FPGA in the system that acts as the "golden" reference circuit. This golden circuit implements the same circuit as the FPGA under test but is not subject to fault emulation. The golden circuit is often implemented in either the same FPGA as the circuit under test or a secondary FPGA. The outputs of the golden circuit are compared with the design under test and any deviation in output is identified as a system error. Another method involves the comparison of the circuit outputs against a set of predetermined correct output values. A monitoring circuit carefully compares the outputs of the design under test against this table of predetermined output vectors.
An important attribute of an FPGA fault emulation system is the amount of time that the FPGA system executes while the fault is present within the FPGA. The longer the system executes the more time the error has to propagate to the output pins to be detected. For short fault execution times, it is possible that the fault causes an error but the error is not observed. Fault emulation systems must manage the trade-off between the amount of time to emulate an individual fault and the total time of a fault emulation campaign.
D. Clear Errors
Finally, most fault emulation tools not only remove the emulated SEU before injecting the next fault, but also return the circuit to a known good state before injecting the next fault. Even if no output errors have been detected by an individual fault emulation, it is still possible that a latent error exists in the system. It is possible that the next emulated SEU could cause the latent error to manifest, which could cause problems in fault attribution. The exception is for systems that inject MIUs where SEUs are removed at either set intervals or once a fault manifests. The easiest way to return the circuit to known good state is to reset all state of the FPGA (flip-flops, memories, etc.) to their initial value. This process can be completed easily by instantiating all of the flip-flops to the same reset line that is driven by the same output pin.
III. SURVEY OF EXISTING SYSTEMS AND WORK
Fault emulation as a technique has been used on a number of different types of systems [23] - [29] . One of the first surveys of the field was written in 1997 [30] . The paper covers all of the basics of non-architecture-specific fault emulation techniques and several of the early tools for fault emulation. This paper provides an overview of injecting faults into hardware through opening/bridging signals, SEUs, stuck-at faults, and altering current/power. The paper also covers injecting faults through the software by corrupting stored data, communications and software.
While there are many possible uses for fault emulation, in this paper we are particularly interested in fault emulation systems that accurately mimic radiation effects in SRAM-based FPGAs. As there are a number of different systems that exist at this point, we provide highlights of some of the existing systems. We attempted to create a taxonomy of fault emulation systems, but an obvious taxonomy does not exist. For this paper, we will split the discussion into two broad categories: software versus hardware techniques for inserting the faults. We also use this space to introduce a few existing systems that we will discuss in further detail later in the paper. We conclude this section with a few novel fault emulation techniques that we think might be useful in the future.
A. Software-based Techniques
Software-based techniques focus on methods to insert the faults through modifications of the user circuit through the hardware description language (HDL), synthesis byproducts or circuit instrumentation to insert the faults. One of the earliest software-based tools targeted the Xilinx XC4010XL FPGAs through the use of the JBits application program interface (API) released by Xilinx [31] . This API allowed for circuit design and configuration file manipulation through Java. This fault emulation tool inserted faults by manipulating the configuration files. In [32] the authors use a Java library called Static Mapping Library (SML) to do fault emulation. SML allows for small design changes by translating fault locations into the circuit design. NETFI [33] injects SEUs or SETs into the FPGA circuit using the netlist. The netlist is then synthesized, uploaded to the FPGA and executed. The system uses a two-FPGA system, where the two FPGAs communicate over Ethernet. The system presented in [34] uses the HDL testbench to control test vectors and provide the error-free output. It also uses Xilinx's Chipscope logic analyzer for monitoring the reliability statistics and for enhanced monitoring of the circuit's internal signals. Shift registers are used to shift in faulty data to the flip-flops.
Because many of the Altera FPGAs lack partial reconfiguration, most attempts to emulate faults in Altera FPGAs have had to leverage software techniques [35] , [36] . [35] provides an early technique for emulating SEUs in an Altera SRAM-based Flex10K200 FPGA. Without partial reconfiguration techniques available for this FPGA, the authors had to manipulate the entire configuration file to insert SEUs. While time consuming, this technique was the only one available for Altera FPGAs at the time. Fault emulation tools have been designed for newer Altera FPGAs [37] . As partial reconfiguration is available in the most recent Altera FPGAs, developing hardware-based techniques is now possible.
Part of the reason why software-based techniques have not been more popular is that often these methods are slower than hardware-based techniques. For example, in the JBits-based method [31] new bitstreams have to be created and uploaded for each SEU that is injected, which is much slower than overwriting a frame on an executing FPGA. Because of these limitations, many software-based techniques focus on randomly sampling the FPGA instead of providing a census of the FPGA. There are advantages to software-based techniques. In particular, the errors are tied directly to the design. Not only should this minimize the number of faults that need to be emulated, but each fault can be easily tied back to the design. In this way, the circuit designers can easily determine which part of the circuit needs mitigation, unlike in hardware-based techniques where determining the affected part of the circuit is time consuming.
B. Hardware-based Techniques
There are a number of systems that leverage hardware-based techniques for fault emulation [5] , [8] , [38] - [47] . Many of the older systems were focused on injecting single SEUs into the FPGA. There are several newer systems that emulate MIUs [48] , [49] and MCUs [43] , [44] , [50] . Finally, as the field has become more mature, some researchers have tried to make FPGA fault emulation more efficient [51] . Fig. 4 . The SLAAC-1 V conceptual design highlights the role of the three FPGAs labeled PE0-PE2. PE0 is used to control the experiment. PE1 is the FPGA that is corrupted with emulated SEUs. PE2 is the FPGA that is not corrupted. PE1 and PE2 are operated in lockstep to determine whether the emulated SEUs cause incorrect output. [5] .
1) The SLAAC1-V and Follow on Systems:
One of the earliest systems is the SLAAC1-V SEU Emulator [5] designed by Los Alamos National Laboratory (LANL) and Brigham Young University (BYU). The SLAAC-1 V fault emulation system has three Xilinx Virtex 1000 FPGAs, as shown in Fig. 4 [5] . This system uses the algorithm shown in Fig. 1 . It is also designed to have a "golden" FPGA to compare results to. The SLAAC-1 V system was used in a number of research projects over the years and was ported to the Cibola Flight System hardware for pre-deployment fault emulation [52] .
This system was the basis of several other fault emulation systems designed by LANL and BYU. Many of the LANL systems used two FPGAs, where the functionality of PE0 and PE2 were combined into one FPGA. One of these systems was used in the domain crossing error (DCE) study that looked at the limitations of mitigation [50] . This system, shown in Fig. 5 , injected both single-bit upsets and MCUs. The cabling shown in the figure is used to keep the inputs, outputs, resets and clocks in synchronization. LANL also implemented a two-FPGA fault emulation system into the Mission Response Module (MRM) payload [53] . The MRM fault emulation system included an improvement to speed up the fault emulation system. We present results from validating these three systems later in this paper.
2) UFRGS MIU System: In 2014, University of Rio Grande do Sul (UFRGS) published fault emulation and accelerated test results using their fault emulation platform [49] . This work validated the use of N-modular redundancy (NMR) for masking the effects of SEUs on FPGAs. Fig. 6 shows the components of their fault emulation platform, which uses the Xilinx ICAP-based scrubber/fault emulation core for single-FPGA systems. Most notably, this system was designed specifically for studies on the effect of MIUs and the system will inject faults until a functional error manifests. We will discuss the validation of this system later in the paper.
C. SET Fault Emulation Systems
SET fault emulation has become more important to Xilinxbased FPGA designers with the advent of the Virtex-5QV, which leverages dual interlocked cell (DICE) latches to reduce the SEU sensitivity in the configuration memory. With fewer SEUs, the SETs in the Virtex-5QV have become apparent. Some organizations have also designed fault emulation systems that emulate SETs in FPGAs [7] , [54] - [56] . In [57] , electronic pulses are inserted in the circuit using gate-level instrumentation of the circuit. The instrumentation circuit is connected both to the input pads and the circuit, so that an external signal generator can be used to drive an input into the instrumentation circuit, which can translate the signal into a pulse that is driven into the user circuit. While originally used as an FPGA-based prototyping system to emulate radiation-hardened by design ASICs on a Xilinx FPGA, the FT-UNSHADES emulator can also be used to inject SEUs in flip-flops [17] , [18] , [58] - [63] . Other fault emulation systems that inject faults into user memory have since been developed [45] .
D. Novel Fault Emulation Systems
We present three unique systems that were recently designed. These three systems exhibit new and novel techniques previously not seen in fault emulation platforms that will likely be areas of further study in the coming years.
The first one is a fault emulation system from Politecnico di Torino. In [64] a fault emulation system that works with algo-rithms that use dynamic partial reconfiguration to change the algorithm based on execution parameters is presented. These systems can be challenging to perform fault emulation on, because the system's algorithm could change non-deterministically. This system can help designers that looking to deploy dynamic partial reconfiguration in space.
The second system is one that focuses on soft-core processors, which are processors that have been instantiated in the reconfigurable fabric. As FPGAs are traditionally not well suited for implementing complicated decisions, soft-core processors have been used to implement the control logic. These processors have complex radiation sensitivities, as the processor will be affected by SEUs in the configuration memory, SEUs in the program and SEUs in the intermediate data products. In [65] , the authors determine the critical registers and variables in a soft-core processor that is implemented in an FPGA. This system allows designers to determine which parts of the processor and the software codes are most in need of mitigation.
The third system is a traditional use of FPGAs as a prototyping system. Recently, [47] has used fault emulation on a Virtex-5 to emulate a new FPGA architecture based on a faulttolerant FPGA computation cell that can detect all unidirectional errors. In this situation, fault emulation was performed on computation cells implemented on the Virtex-5. These experiments allowed the computation cells to be tested before the FPGA architecture exists. This type of method could be extremely useful in the future for the development of more advanced, fault-tolerant FPGA architectures.
IV. VALIDATION TECHNIQUES
Currently, the most commonly used methodology for validating fault emulation results is to perform accelerated radiation testing. In this method, the test results from fault emulation testing and accelerated testing are compared. While possibly more expensive than other methods, it is often the most straightforward method and might be able to leverage already existing hardware from the fault emulation system. Another method for validation is to compare on-orbit results to fault injection estimates. While on-orbit results are the most accurate, this method requires a deployed system in orbit and the upset data collected from on-orbit operation is very small when compared to fault emulation.
It is necessary to control the number of faults during the radiation tests so that it properly mimics the fault emulation system. Therefore, the radiation environment and the experiment conditions need to mimic the fault emulation environment. There are two factors that controls whether the experiment is run properly: event rate and scrub rate. The event rate is the rate in which the specific SEE that is being monitored occurs, such as the SEU rate. The scrub rate is the rate in which the contents of the configuration memory is restored to its correct value (i.e., "scrubbed").
The event rate is:
where is the linear energy transfer (LET) for heavy ions or Energy for protons, is the flux and is the cross section, and is the size of the event. From this equation, it is possible to find the LET/Energy-dependent upset rate of a range of SEU sizes.
The scrub rate depends on the scrubbing architecture and can vary widely depending on the scrubbing methods employed. Often times the scrubber uses external hardware so the scrubber is not affected by the radiation environment. There are two bounding cases for scrub rate: an SEU occurring in the next memory location to check and an SEU occurring in the last memory location checked. In the first case, the SEU will be detected and corrected quickly. In the second case, it will take one full scrub cycle of the component to detect and correct. Because of these situations, SEUs are not guaranteed to be detected and removed until two complete scrub cycles have been completed.
The upset rate and the scrub rate need to be balanced to mimic the fault emulation results. If the fault emulation system is designed to study the effect of single-bit upsets, then the upset rate needs to be no more than two times the scrub rate. While generally fast scrubbing can be helpful, the scrub rate needs to also be slow enough to allow errors to propagate to the outputs. If the fault emulation system is designed to study MIUs, the scrub rate might need to be reduced to allow faults to accumulate or the scrubber turned off until a functional error is detected. If these conditions are not met, the fault emulation system and the radiation test might be measuring two different phenomena.
In some test systems, the hardware involved in scrubbing might be different than the hardware detecting output errors. That means there could be two separate logs with information about SEU locations and output error information. In these situations, it is necessary to align these two data sets during the validation process. The easiest process of aligning the two data sets is to use time stamps. Even still the SEU might trail or lead the output error by several clock cycles. The most effective approach is to look at several SEU locations before and after the output error. This "window" of SEU locations can then be compared to fault emulation results to determine if any of these SEU locations caused an output error in fault emulation. An example of this technique is shown in Fig. 7 .
Even in the most controlled experiments it is hard to completely replicate the fault emulation system in the test fixture used for accelerated testing or in the deployed system. The algorithmic control of injecting and removing errors allows the designers to determine how many stimulus are run, how many errors are inserted, and when errors are removed for each emulation. In high radiation environments, these situations are harder to control. Even if you attempt to control the number of SEUs that are injected by the particle accelerator, Poisson statistics inform us that some times that number is higher or lower than expected. Furthermore, the test fixture and the deployed system do not have the precise algorithmic control of removing SEUs. Unlike fault emulation software, the scrubber does not know where the SEU is and has to determine where it is before correcting it, which means some SEUs are removed quickly while others linger. These differences will always create challenges in the validation process. In these circumstances, confidence intervals can be helpful to bound the error on the collected data set.
V. STATISTICAL VALIDATION OF FAULT EMULATION
Fault emulation is inherently a random activity and statistical analysis must be performed as part of the fault emulation vali-dation process. A typical approach for statistical validation involves the following three steps: first, estimate the sensitivity of a design using fault emulation, second, estimate the sensitivity of the same design with radiation testing, and third, compare the results from both fault emulation and radiation testing. If the measured sensitivity falls within an acceptable range of the estimate from the fault emulation, there is greater confidence that the fault emulation system properly estimates the sensitivity of a design in radiation test beam. This section will describe the statistical implications of all three steps of this validation process.
The estimated sensitivity for the design that is being measured in fault emulation and radiation testing is a function of the design. This sensitivity is closely related to both the utilization of the FPGA resources by the design and the properties of the design. For unmitigated circuits that fully utilize the FPGA resources, we have generally found that circuits have a sensitivity of 1-20%. For circuits that use a fraction of the FPGA resources or heavily mitigated circuits, the sensitivity could be quite low. A circuit that only uses 5% of the FPGA resources can have a sensitivity less than 0.1% and we have tested mitigated circuits with sensitivities as low as 0.01%. Circuits with very small sensitivity are more difficult to validate and require a higher number of injected faults for adequate confidence. While validating low sensitivity designs is expensive and time consuming, the need to understand faults in mitigated designs is critical for many space programs.
A. Estimating Design Sensitivity Using Fault Emulation
In most fault emulation scenarios, a number of faults are injected in the system and the outcome of the experiment is a binary result: the emulation causes a system error or the injected fault does not cause a system error. The goal of a fault emulation campaign is to estimate the probability that a random fault injected into the system causes the system to fail. The confidence of this estimate, however, will depend in a large measure on the number of faults that are emulated. In some fault emulation systems, like the SLAAC-1 V, upsets of all configuration bits (census) of the FPGA are tested. From a statistical perspective, this approach is excessive but a comprehensive fault emulation was necessary among the early approaches to build confidence in an unproven technique.
For some fault emulation systems, it is not possible to test all configuration bits and configuration sampling must take place. When a sample is taken, the sample size must be carefully chosen based on the population size (number of configuration bits) and desired confidence interval. This section will present a technique for measuring the confidence interval of a fault emulation sensitivity estimate.
An individual fault emulation experiment can be represented by a Bernoulli trial where there are exactly two outcomes: "success" (meaning the that the emulation caused the system to fail), or "failure" (meaning that the emulation had no effect on the system). This can be modeled by the Bernoulli random variable, , with only two outcomes:
(success) and (failure).
The goal of fault emulation is to estimate , the probability of success, by performing multiple trials (i.e., fault emulation experiments). The probability that "successes" are observed with trials can be represented with a binomial random variable with the following probability mass function:
The maximum likelihood estimator, , for a Binomial distribution is : (2) and the variance of the maximum likelihood estimator is (3) For cases where (i.e., no system errors detected), an upper bound on the design sensitivity can be made by assigning . As the number of trials, , increases, the probability distribution function of the binomial random variable can be approximated by the normal distribution due to the central limit theorem. The normal distribution is useful for computing confidence intervals of the estimate of . For a normal random variable, , with parameters , and , the random variable has a unit normal distribution:
This property can be used to determine the confidence interval. For a unit normal distribution, 95% of random outcomes will occur in the interval [a,b] This results in a 95% confidence interval of :
The size of the confidence interval depends heavily on the number of trials and this interval will reduce with many more trials. This concept is shown in Fig. 8(a) for a range of design sensitivity from 0.01-20%. interval decreases significantly as the number of trials increases. This is especially important for low utilization or low sensitive designs (i.e., small ) as millions of trials are needed to get a sufficiently small confidence interval. It should be noted in these low sensitive designs that Fig. 7 . Example of how to use fault emulation with accelerated radiation test results and how to window data for validation [5] . the estimate is significantly higher than the actual sensitivity until nearly 1,000,000 trials are completed.
One way to compare the relative size of the confidence intervals is to compute the coefficient of variation, or . The coefficient of variation is used to measure the dispersion of a distribution-a large coefficient of variation results in a larger confidence interval. For the Binomial distribution, the coefficient of variation is: (4) This result suggests that for a fixed number of trials, , the coefficient of variation will be higher for circuit designs with a lower sensitivity (i.e., lower ). Fig. 8(b) shows how the coefficient of variance changes for the same Monte Carlo trial shown in Fig. 8(a) . As an example, this figure shows that coefficient of variance does not fall below 0.01 until tens of thousands of trials for the most sensitive design (20%). In comparison, the least sensitive design (0.01%) does not reach a coefficient of variance of 0.01 even after 10,000,000 trials. Circuit designs with lower sensitivity (i.e., mitigated designs or designs with low utilization) will need to be tested with more trials (i.e., greater ) that designs with a higher sensitivity to achieve the same coefficient of variation.
B. Estimating Design Sensitivity Using Radiation Sources
Estimating the design sensitivity using radiation sources is relatively straight forward. A design is configured onto an FPGA and a radiation source is applied to the design while operating. During the test, the operational faults in the design are monitored and the upsets in the configuration memory are logged. A simple design sensitivity estimate an be obtained by Ideally, if the coefficient of variance from fault emulation is low, then radiation testing should be straight forward. There are a number of factors that can make this process challenging, though. These factors include:
• Handling fault emulation results that are under-sampled, • Determining the number of upsets are needed to validate the sensitivity measured by fault emulation, and • Categorizing results that do not match the fault emulation tests As shown in the previous section, if the fault emulation results have a large coefficient of variance, then it is possible that the actual sensitivity of the design is much lower than the estimate. That means that in radiation testing fewer sensitive bits might be found and it might be necessary to execute the test longer than expected. If the fault emulation results have a small coefficient of variance, though, this result can be used to estimate how many errors are needed in radiation testing so that the amount of testing can be minimized. Finally, one of the hardest issues with radiation testing is that the radiation will broadly sample from all of the fault models. That means SEFIs, SETs, and parts of the component untested by fault emulation will occur in the data set. It might be necessary to modify the test plan or separate the data set into different categories for processing. In this section we will discuss how to estimate the number of faults needed during the radiation test and ways to categorize the data.
The fault emulation results provide us a basic rate for the faults. For example, if fault emulation results found that 10% of the SEUs caused output errors, then one would expect approximately 1 output error for every 10 SEUs in the radiation test. Using probability of occurrence and Equation (2) it is possible to predict the expected number of errors caused by the fault emulation conditions. Table I shows the expected results with confidence intervals for design sensitivities of 0.01-20% for a range of SEU counts, . We chose this range of sensitivities, because these values are common for unmitigated and mitigated circuits. The calculations are also based on an FPGA with 50 Mb of configuration memory, so the range of SEUs range from a very light sampling of the component to the entire component. The expected value is determined by multiplying the sensitivity of the design with the number of SEUs. The confidence intervals are the typical 95% confidence intervals used for radiation testing, which use the Poisson distribution for counts smaller than 50 and the normal distribution for larger counts.
From these results, it is clear that with very few SEUs, low sensitivity designs will have wider confidence intervals than high sensitivity designs. As the sensitivity increases, widely covering the fault model is less necessary. For example, if 20% of the design is sensitive to output errors, confirming the trend of the results will be possible with light coverage of the fault model and statistical significance will be reached in the first 500 events.
All of these results and estimates are based on ideal scenarios. Sometimes the results will diverge from the fault emulation results. Often times the biggest cause of the problem is that the fault emulation system measures one type of fault and many different types of results occur during radiation testing. For example, if the fault emulation system was designed to measure the sensitivity to single-bit upsets, then MIUs and MCUs could increase the error rate during radiation testing. Some failure modes are difficult to measure in fault emulation, such as SEUs in the user memory, which could lead to higher than anticipated error rates during testing. In these cases, the radiation data will need to be subset into categories, such as single-bit upsets, MIUs, MCUs, unknown, etc. It is then possible to develop estimates of the sensitivity based on these categories. Finally, under-sampled estimates of the sensitivity in fault emulation can lead to a measured radiation sensitivity many orders of magnitude lower than the fault emulation results.
C. Comparing Results from Both Estimates
Once both sets of tests are completed, it is possible to compare the results to determine whether the two estimates are consistent. There are several ways to complete the comparison based on how the two data sets were gathered. In this section we will provide guidance for comparing a sampled radiation test to a census fault emulation test, comparing two sampled tests, and less qualitative methods.
Most commonly people use qualitative methods for comparing the data sets. Currently, there are a variety of qualitative methods that are used. One method the authors have used often is to compare fault locations. In these cases, fault emulation testing has provided a census survey of the population and radiation testing has sampled the same population. For single-upset data, either single-bit upsets or MCUs, it is possible to compare the lists of faults that triggered errors between to see what the overlap is. We present three case studies that use this method. Another method that is frequently used is to compare the two estimated sensitivities and determine if the results are "reasonably close." This method is often used to compare two sampled estimates of the sensitivity.
Quantitative methods could also be useful and remove some of the questions about the validation of the system. If the de- [66] signed was sampled in both tests, then it is possible to use the t-distribution to measure the differences between the two samples. If one of the tests was able to survey the population, it is possible to compare the differences in the results using Chisquared distribution. Both of these methods, if used, would add to the rigor of the results being presented.
These errors that cannot be correlated are considered part of the "unvalidated" response of the system. When using this method of validation, it is possible to report which portion of the accelerated radiation test correlates with fault emulation and which portion does not. In our own past work, we have shown that some of our FPGA emulation tools were valid to 99% for unmitigated circuits and 88% for mitigated circuits [67] .
VI. VALIDATION EXAMPLES
In this section we present several examples of validated fault emulation systems. These systems cover many systems we have previously discussed, including the SLAAC1-V, the Virtex II fault emulation system used for the DCE study, the UFRGS system and the MRM system.
A. SLAAC1-V Validation Using Fault Locations
As long as the user circuit that is being tested is the same one tested in fault emulation, the fault locations from fault emulation can be used to disambiguate the accelerator test results. While this method can usually help a designer correlate output errors with fault emulation results, some output errors defy correlation. For cases that are hard to correlate, it is possible to "play back" part or all of the test in the fault emulation tool, where the SEU log is used to inject faults in specific locations and in a particular order.
Two projects validated the results of this fault emulation system against radiation test data. The first project validated the estimated sensitivity of three different designs in a proton radiation test at the Crocker Nuclear Laboratory at UC Davis [66] . The second involved the validation of partial TMR with 65 MeV protons at the Indiana University Cyclotron Facility (IUCF) [68] . The results of both fault emulation and radiation testing from the first validation effort are summarized in Table II . This experiment measured the sensitivity of the design through fault emulation and then tested each of the designs in a radiation beam for sensitivity. The method described in Section V-A was used to estimate the 95% confidence interval of both the fault emulation and the radiation test results. All configuration bits were tested in fault emulation and a very tight 95% confidence interval was estimated for the design sensitivity. Radiation testing was performed with far fewer trials resulting in a larger confidence interval. The fault emulation result, however, falls well within the radiation testing confidence range suggesting that the fault emulation was successful in estimating the radiation design sensitivity. Further guidance on confidence intervals for these radiation tests can be found in [69] .
B. DCE Validation with Fault Locations
When validating the DCE fault emulation results, we inadvertently created a worst-case scenario for validation. In this project we were trying to confirm that some SBUs and MCUs could cause the design to go into an unrecoverable failure state, where the voter would vote in the bad data. We were doubly bound by two problems: a very low occurrence event and needing to keep the flux low to mimic the fault emulation environment. Table III shows how low the probabilities are for DCEs, MCUs in proton and MCUs in Xe. On top of it, to mimic the fault emulation setup, we needed to guarantee no more than two upsets per second, so that MIUs did not trigger DCE-like symptoms. Therefore, we needed to test in proton with very low flux.
Because the probability distribution for DCEs and MCUs are independent, the probability of a DCE occurring due to an MCU is (5) We can use this equation to determine the approximate number of upsets needed to trigger a single DCE. (6) where is the size of the event. This equation tells us that it would take at least 10,137 SEUs in proton and 12,658 SEUs in Xe to trigger one 1-bit DCE for this design. To confirm all 285 1-bit DCEs in this design, millions of SEUs would be need to be injected at a rate of two SEUs per second. After several hours of proton testing, we were able to induce 16 1-bit DCEs for which 88% were known fault locations from fault emulation. In the end, the use of confidence intervals is the best approach to low statistic situations.
C. UFRGS Tool Validation Without Fault Locations
Correlating results to fault locations can be time consuming and often only useful for fault models where only one error (even one MCU) exists in the system at a time. For systems that study MIUs or MCUs, using fault locations are not practical, as covering all possible combinations of MIUs or MCUs is intractable. For such a situation, methods that look at ratios of faults to output errors might be more useful. For example, consider a specific design in which 10% of the SEUs emulated cause output errors. If this system properly mimics the fault model for the FPGA and provides complete coverage of the system, one would expect with some confidence that around 10% of the radiation-induced SEUs would cause output errors. Similarly, fault emulation can tell the designers that the mitigated circuit will fail by the time 100 SEUs have accumulated. The number of SEUs in the system can be counted after the system fails in radiation testing to determine if 100 SEUs is the correct value. If the fault emulation and the accelerated testing are testing different parts, it might be necessary to scale the ratio based on the different sizes of the parts.
In 2014, University of Rio Grande do Sul (UFRGS) published fault emulation and accelerated test results using their fault emulation platform [49] . This work validated the use of N-modular redundancy (NMR) for masking the effects of SEUs on FPGAs. Results from their fault emulation campaigns are shown in Fig. 9(a) and the accelerated test results are shown in Fig. 9(b) . The two sets of results are clearly consistent with each other. The trend of requiring more SEUs to cause multiple domains to fail is consistent between both sets of results. Furthermore, except in the case of four domains, the accelerated test results are a subset of the fault emulation results. These results show that the fault emulation system is accurately predicting the mechanism of MIUs in the system.
D. MRM Validation with on-orbit Data
The MRM payload was deployed into a LEO orbit in 2011 and collected both SEU data and output behavior. This unique capability allows the on-orbit SEU behavior to be compared with fault emulation predictions that were carried out before the payload was deployed. This application ran for over a year after the system was deployed.
The MRM system has four Xilinx Virtex-4 FPGAs (two XQR4VLX200 and two XQR4VSX55) in two separate units. Both a fast and slow fault emulation was completed in-situ on one application, the technology readiness level (TRL) application, before deployment using the MRM payload hardware. MRM is designed to operate through SEUs, as the predicted SEU rate per day was very high. Therefore, the behavior of the payload is similar to the fault emulation system. The fault emulation tests show that there are some single points of failure remaining in the application, but no persistent cross section [70] . Fig. 10 shows a comparison of fault locations in the XQR4VSX55 FPGA that cause output errors. While twice as many sensitive bits were found using the slow fault technique, the location of the sensitive bits found on the FPGA are consistent.
6,994 SEUs occurred during the operation of the TRL application. Seven output errors were also observed during the execution of the TRL application that were coincident with SEUs in the FPGAs. Three of the output errors were caused by MIUs, one was caused by an MCU and three were caused by SBUs. While these three SBUs mimic the fault emulation conditions, these three locations are not in the set of known sensitive locations. Table V shows the breakdown of the expected and actual number of output errors by unit and FPGA, which overlap in most cases. Furthermore, while fault emulation estimated that 99% of all SEUs that fault emulation predicted, the deployed system masked 99.999% of all upsets. No persistent errors occurred during this time period. Given these facts, we suspect that the fault emulation was under sampled and that further testing would have provided better results.
VII. CONCLUSION
In this paper we have presented several methods for validating FPGA-based fault emulation systems that mimic the effect of SEEs on FPGAs. These techniques have compared both accelerated test results and deployed results against fault emulation. Statistical techniques for determining the confidence of these results were also presented. Several case studies were presented that showed that properly designed fault emulation systems can be used to model some radiation results with acceptable levels of uncertainty.
As more fault emulation systems are validated and demonstrated, it will be possible to compare fault emulation systems against each other. Such comparison can be used to identify problems in a fault emulation system or highly unique insight from one fault emulation system over another. Further, fault emulation systems that have been validated against a radiation source may also be used to validate new fault emulation systems. This method would be particularly useful for institutions that are moving from older FPGA systems to newer FPGA systems. Comparing fault locations would be difficult, as many have likely moved to new locations in the new FPGA. The number of fault locations that cause output errors is likely similar, as will the affected areas of the design.
Ideally, in-orbit results of FPGA upset behavior can be used to validate fault emulation models. As more FPGAs are deployed in harsh environments like space, the accuracy and confidence of fault emulation systems will increase. As our ability to accurately model the effect of upsets within FPGAs improves, fault emulation will increasingly be used to aid in FPGA validation efforts. Such systems, coupled with radiation testing, will facilitate the deployment of FPGAs in harsh environments like space.
