Problems with terrestrial-based neutron radiation from cosmic rays have become more commonplace. While the incident rate from neutron radiation is lower than space-based radiation, physics, system design and system locations have combined to make systems increasingly vulnerable to terrestrial radiation. FPGA systems are particularly sensitive to neutron radiation, as the FPGAs, microprocessors and memory are all sensitive to upsets. We are interested in reconfigurable supercomputers,which need to be highly reliable and highly available despite being very sensitive to radiation. In this paper, we estimate the error rate for FPGAs, memory, and microprocessors so that predictions for the sensitivity of the Cray XD1 reconfigurable supercomputer can be made. We also present possible mitigation methods that are appropriate for neutron radiation upset rates.
Introduction
Mitigating neutron radiation effects is often overlooked when building large scale, terrestrial-based computer systems. In recent years, several cases of cosmic ray upsets (soft errors) in microprocessor systems have been reported [16, 15] . The Q cluster and System X, which were the second and third fastest supercomputers on the November 2003 Top 500 Supercomputer Sites list [1] , both experience fatal soft errors presumably caused by cosmic ray induced neutrons. System X was tested during a Coronal Mass Ejection, which led one architect to joke that they "felt like [they] had not only built the world's third fastest supercomputer, but also one of the world's best cosmic ray detectors [14] ." Over a 10 week period from late April 2004 through early July 2004, the Q cluster experienced an average 26.1 CPU failures per week [16] . Fatal soft errors deriving from cosmic ray induced neutrons are one example of CPU failures, so this number provides an upper bound for the average weekly number of fatal soft errors experienced by Q over the same period.
Unfortunately, SRAM-based FPGA systems are not exempt from neutron radiation problems. All aspects of FPGA systems are vulnerable to terrestrial-based radiation. Besides the microprocessors, the FPGAs and memory are affected. Memory upsets are the most common side effect of radiation interference. SRAMbased FPGA systems are susceptible to both soft error induced state and configuration changes. Both Xilinx and Altera FPGAs exhibit the ability to upset in terrestrial-based systems [11, 6] . Even ECC-protected memory has issues with neutron radiation [3] . All of these components need to be analyzed to determine the total cross-section (the area sensitive to neutron upsets). We are interested in highly available and reliable reconfigurable supercomputers, which might have thousands of processors, thousands of FPGAs, and many gigabytes of RAM. These systems have very large cross-sections and have noticeable soft error rates (SERs). This paper focuses on estimating the crosssections for memory, FPGAs and microprocessors, so that the cross-section for large scale systems can be analyzed.
Many designers assumed that with shrinking transistor size and better manufacturing processes that soft errors should become less prevalent, but the opposite has been true. The increase can be explained by a combination of three factors: physics, system design, and system location. First, there are physical issues when the transistors shrinks. Smaller transistors are more sensitive to smaller changes in charge, so shrinking transistor size makes them more sensitive. Beyond physics, larger systems and more complex components contribute to keep cross-sections increasing. At the microscopic level of system design, such as memory chips, components become more complex each generation. Density, size and geometry combine to make component cross-sections grow [7] . At the macroscopic scale, large multiprocessor and multi-FPGA systems are becoming common. Increasing the number of devices increases the system cross-section. The third issue is the location of the system. Airborne multiprocessor systems are not uncommon, such as the multiprocessor SAR implementation that was designed for a Boeing 707 [9] , and airborne FPGA-based radar systems are being researched [10] . High altitude military and commercial aircraft systems encounter more than 300 times the amount of neutron radiation than a sea-level system. With all of these factors, we believe problems with atmospheric radiation will become more problematic as designers and manufacturers push the limits of system complexity and use.
The expense of soft errors in terrestrial systems is not straightforward. On one hand, a company's goodwill can be devalued by unmitigated or unshielded soft errors in their manufactured systems. Sun Microsystems received negative press from the way they handled soft errors in their high-end servers [15] . On the other hand, since the upset rates are lower than spacebased systems, the expense of shielding or mitigating might be very expensive as well.
System designers who specialize in highly available, highly reliable systems, such as reconfigurable supercomputers, need to mitigate or shield neutron radiation. Shielding systems from neutron radiation is a challenge, as neutrons can bore through five feet of concrete. Shielding for neutron radiation is usually done with meters of concrete, rock, dirt, and water. IBM tested their DRAM chips in an "underground vault, shielded by 50 feet of rock" [5] to be able to test their components without cosmic ray interference. Since shielding is difficult, mitigation methods are needed. Current mitigation methods, such as Triple Modular Redundancy (TMR), are designed for larger incidence rates than the SERs from neutron radiation, which makes the increase in power, speed, and area too expensive. Low-impact mitigation methods that guarantee reliability without sacrificing speed, power or area are needed. Especially attractive are methods that can be tuned to match the expected SER for a given location and system. Ideally, these methods will be researched now while the problem only affects the very high-end large-scale computing systems and not every high-end commodity server. Section 3 of this paper presents several options for low-impact mitigation.
In this paper we explore estimating cross-sections and mitigating soft errors. First, the cross-sections for FPGAs, memory chips, and microprocessors are estimated in Section 2. From these estimates we gauge the SER of the Cray XD1 reconfigurable supercomputer. In Section 3 low-impact mitigation methods are presented. The paper concludes with future work.
Estimates
In this section estimates for the SER for a variety of hardware components for nine locations are presented.
The hardware components that we analyze are FPGAs, memory chips and microprocessors. As neutron flux is location-dependent, we estimate the performance of these hardware components for a variety of locations, as listed in Table 1 . We used a variety of methods to determine the neutron flux, including the JEDEC Standard [12] , a Xilinx talk on neutron testing [11] , and the cosmic ray intensities from IBM [28, 8] . Note that there are several high altitude aircraft options, such as the U-2 and the ER-2, which operate in the 55,000-70,000 feet range [22, 17] . These aircraft operate at an altitude where neutron flux peaks and represent the worst case scenario for these estimates. 
General Trends in Estimating SER
Before continuing, we note some general trends in these estimates. SER is proportional to both device size and flux. Almost all hardware components have problems with soft errors at high altitudes or in large systems. Therefore, we expect the upset rate to be significantly worse when we increase the altitude or system size. The point of this paper is not to beat up on vendors but to look at these components realistically so that system designers can build reliable, available systems.
We estimate the SER of systems from the SER of a reference system. The equation for determining SER in the reference system is:
where σ dev is the device cross-section. In this paper, there are three types of cross-sections: system (σ sys ), device, and bit (σ bit ). Equation 1 holds for all three types of cross-sections. We assume the difference in SER between two locations is the ratio of the fluxes.
We also assume that the difference in SER between two systems is the ratio in the system sizes, as the cross section tends to not change much between device families [11] . If sys1 and loc1 are the system size and location of the reference system, then the SER of another system with size sys2 and location loc2 is estimated as:
In Equation 2 the ratio between the two fluxes scales the SER from one location to the next. The difference in two systems is determined by:
where dev size is the number of memory bits in each device, and num devs is the number of devices in the system. These equations provide order of magnitude estimates for the soft error rate from a limited amount of experimental and theoretical soft error data and the model of estimating flux. Besides scaling for flux and system size, there are several factors that might necessitate lowering (derating) or raising (uprating) the SER. The SER for a system is dependent on what the system is calculating. If only a fraction of the system is being used, then the SER for the system should be derated. Uprating might be needed in cases where the estimates from older technology are used to predict the SER for newer technology. Finally, when the estimates are based on accelerator data, the SER from atmospheric conditions is often lower. These factors are discussed below.
The SER is design-and data-dependent. The bit cross-section values for FPGAs and memory chips we present assume that the entire component is used and that each bit is important. In reality, most designs do not use the entire device and some upsets are masked by the design. Therefore, the estimated SERs might be overstated for any given design. The general "rule of thumb" is to derate by a factor of 5-20 for system utilization.
The transistor size also plays an important role in the SER. Most of the estimates in this paper were made with one transistor size. Smaller transistor sizes are more likely to upset which might make the SERs larger than we estimated. In our ongoing proton radiation research, which mimics neutron radiation at high energies, we found that the Virtex II is 30 times more likely to have multi-bit upsets (MBUs) than the Virtex I [13, 20] . Therefore, we expect single-bit and multi-bit upsets to increase as the transistor size shrinks, but the uprating factor is still under research.
Finally, the data in this paper is from both accelerator and atmospheric tests. Atmospheric testing is ideal but time consuming. Accelerator results tend to report larger SERs than atmospheric testing, so the results need to be derated. Xilinx has researched the derating factor between accelerator and atmospheric tests. Their results is the "Rosetta Factor" of 1.5 [11] .
There are several factors that can affect the SER for an actual system. These estimates are guidelines to the worst possible scenario. The actual SER for a system should be done by testing the design with the expected data.
FPGAs
In this section, we present an analysis of Xilinx, Actel, and Altera FPGA chips for soft errors. These estimates are based on three sets of previously published test results from Xilinx, Altera, and iRoC Technologies [11, 6, 21] . The Rosetta NSEU Test is an ongoing Xilinx test that does atmospheric testing of Xilinx chips. The Virtex-II Rosetta system is 100 XC2V6000 chips arranged in a 10x10 matrix for a total of 1.96E9 configuration bits. There are four of these Rosetta systems operating at different altitudes, as shown in Table 2 . The next set of results presented are from the iRoC Technologies test commissioned by Actel for full spectrum neutron beam analysis. In this test, several FPGA chips (AX1000, APA1000, XC2V3000, XCS1000, EP1C20) were tested at the Los Alamos Neutron Science Center (LANSCE) neutron accelerator. With the iRoC Technologies test we correlate the Rosetta test results. The Altera test presents data for the SER of Altera FPGA devices.
Xilinx Rosetta Test
In the Rosetta data there is a strong correlation between flux and the number of upsets. When the upsets for the four locations are normalized to upsets/year, the relationship between flux and upsets/year for the Rosetta system are:
Equation 4 determines the SER for a year and can be divided by 8760 hours/year to determine the SER per hour. Assuming the average bit cross-section for the Rosetta system is stable, we can scale flux and crosssections using Equation 2 to determine how different locations and system configurations are affected by soft errors. First, we increase flux. Since altitude has a first order effect on flux, we expect SER increases with altitude. Table 1 not in Table  2 , we estimate the MTTU for a Rosetta system in Table 3 using equation 4 and an average bit cross-section of 2.02E-14. At high terrestrial altitudes, the Rosetta system upsets 21.3 times more than at sea level. High altitude aircraft levels upset 327.8 times more than at sea level. The estimated results from the ratios for the four tested locations are consistent with the actual test results for these locations. Therefore, we should be able to use the ratio of the fluxes to estimate the SER for any given altitude using the sea-level data. While flux is bound by location and is at its maximum at 60,000 feet, system cross-sections are not bound and can increase in two dimensions. Either the individual device size or the number of devices can increase to make the cross-section larger. The results of increasing both dimensions are presented below. Table 4 .
MTTU (hours) for Different Systems
First, we increase the system cross-section by increasing the number of devices in the Rosetta system. Using the average bit cross-section of 2.02E-14, we calculated the MTTU for 500 and 1000 device Rosetta systems for all of the locations in Table 1 . These results, shown in Table 4 , scale by the product of the flux and the bit cross-section ratios. As expected, the MTTU for the 500 device San Jose system decreased by a factor of 5 times from the 100 device system. Similarly the Mauna Kea 500 device Rosetta system has a MTTU that is 74 times smaller than the San Jose 100 device Rosetta system, which approximates the value of the flux ratio multiplied by the device number ratio.
Next we increased the individual device sizes in the Rosetta system. Table 4 shows the estimates for using the 4VFX140 in 100, 500 and 1000 device Rosetta systems. The estimated number of configuration bits for the 4VFX140 device is 40,678,656 bits, which was calculated by subtracting the number of bits devoted to block RAM from the bitstream length [24, 25] . Because the 4VFX140 device is twice the size of the XC2V6000, the MTTU decreases by half. Therefore, the 1000 XC2V6000 device Rosetta system is expected to upset every 32 minutes at 60,000 feet, and the 1000 4VFX140 device Rosetta system is expected to upset every 15.6 minutes at the same altitude.
Chip
Soft Errors (FITs) AX1000 < 0.082 APA1000 < 0.038 XC2V3000 8680 XC3S1000 1240 EP1C20 N/A 
iRoC Technology Test
The iRoC Technology test most similar to the Rosetta test is their full spectrum test done at LANSCE. The full spectrum neutron beam at LANSCE simulates the range of energies seen in terrestrial radiation. The LANSCE test includes five chips: AX1000, APA1000, XC2V3000, XC3S1000, and EP1C20. We are in particular interested in their SEU testing shown in Table 5 . The two Actel chips (AX1000 and APA1000) are not expected to experience SEUs, since the configuration is not stored in volatile memory. The remaining chips are SRAM-based and are expected to experience soft errors. The Altera chip (EP1C20) did not have readback ability, so the number of soft errors is unknown. For our analysis, we looked at the SER for individual chips, and 100 device Rosetta-like systems. Finally, we extrapolate back to the 100 device XC2V6000 Rosetta system so that the two tests can be compared. The LANSCE data in Table 5 is in failures per 1E9 hours (FITs). First we convert from FITs to MTTU (hours). Next, we extrapolate the sea-level results to the other eight locations using Equation 2. The MTTU data for all of the locations and the Actel and Xilinx chips are shown in Table 7 .
MTTU (hours) for Rosetta-like Systems (iRoC)
The results of Rosetta-like systems are in Table 7 . These results are the Table 6 results derated by both system size (100) and the Rosetta factor (1.5). To be able to compare the iRoC test to the Rosetta test, we estimate the SER for a 100 device XC2V6000 Rosetta system from the 100 device XC2V3000 Rosetta-like system. The XC2V3000 device is smaller than the XC2V6000 device by a factor of 2.2, so the MTTU was derated by 2.2. These results indicate an MTTU that is 2.37 times smaller than the Rosetta data, but show a strong correlation between the two tests.
Altera Test
Altera Corporation presented soft error test results for four of their SRAM FPGAs (EP1C6, EP1C20, EP1C25, EP1S80) for San Jose in 2004 [6] . These results are shown in Table 8 in FITs. For our analysis, we converted these results to MTTU in hours, as shown in Table 9 , so that the results could be compared with the other two tests. Next we extrapolated the sea-level data to the other eight locations using Equation 2. Finally, we extrapolated the data from a one chip system to a 100 device Rosetta-like system.
With the Rosetta system data we can compare these Table 9 .
MTTU (hours) for Single Chips (Altera)
results to the Xilinx data from the Rosetta and iRoC Technologies test. On the surface, the MTTU in hours is similar to that of the Rosetta data. In particular, the 100 device EP1S80 Rosetta system upsets on the same scale as the original 100 device XC2V6000 Rosetta system. The bit cross-section is calculated using Equation 1, and these results can be fully compared to the Xilinx data. Table 10 shows the bit cross-sections for the four Altera chips. The Altera bit cross-sections are slightly smaller than the Xilinx bit cross-sections, but of the same magnitude. Therefore, Altera SRAM FPGAs might be more resistant to soft errors than Xilinx SRAM FPGAs but only by a very small margin.
Chip Bit Cross-Section EP1C6 1.49E-14 EP1C20
1.42E-14 EP1S25
1.72E-14 EP1S80
1.75E-14 Table 10 . Bit Cross-Section for Altera Chips
Conclusion
From these results, we are able to estimate the order of magnitude of the soft error rate of FPGAs using reference systems. Using Equation 2 we can estimate the soft error rate of a given SRAM FPGA from a reference SRAM FPGA as long as the flux and crosssections are known.
Memory Cells
For estimating memory's MTTU, we use data from two different tests: IBM's sea-level tests of three different types of memory and iRoC Technology's multiple Table 11 . iRoC Technology Memory tests altitude atmospheric tests [5, 17] . The iRoC Technology data, shown in Table 11 , exhibits how the upset rate per bit increases several orders of magnitude from sea level to 65,000 feet for SRAM and DRAM, which appears to be without ECC protection. While these upset rates seem very small, memory systems often have 10 6 to 10 9 bits. Table 11 shows that 1 GB of memory experiences 125 upsets every hour at 65,000 feet. Therefore, soft errors are more likely to manifest in the memory system than FPGAs. Tables 11 and 12 are correlated, the parity DRAM estimates are within an order of magnitude of the atmospheric tests.
These estimates show a drastic increase in the number of upsets above sea level for parity DRAM. One gigabyte of RAM is not unreasonable; many commodity machines have 1 GB of RAM. Servers regularly have an order of magnitude more than 1 GB DRAM. Therefore, ECC-or Chipkill-protected RAM should be used in a server-or supercomputer-grade computing system. ECC-protected RAM is significantly more expensive, slower, and needs more power than non-parity RAM. Consequently, many system designers still use non-parity RAM. 
Microprocessors
Soft errors in microprocessors have been important for manufacturers for several years now. Since microprocessors are both extensively manufactured and often need to be highly reliable, unmitigated soft errors can be very expensive. Most server-grade microprocessors have ECC-protected L2 and L3 caches and parityenabled L1 caches, as well as cache scrubbing. Cache scrubbing in either hardware or software should drastically reduce the number of failures due to soft errors.
Cosmic ray interference is still present in microprocessor systems, though. Many current and most old microprocessors do not have protected caches. Due to the financial advantage of using commodity parts many large scale microprocessor systems use older, cheaper parts, which usually have unprotected L2 caches. Despite their small size, an L2 cache is still susceptible to soft errors. An L2 cache is usually between 512 KB to 2 MB of SRAM. Table 11 shows that the MTTU for a 1 MB unprotected L2 is nearly 1370 years at sea level. At 60,000 feet 1 MB of unprotected L2 cache upsets every four years. As with FPGAs, the more microprocessors in a system, the larger the cross-section and the smaller the MTTU. Table 14 shows the MTTU for Q Cluster sized systems. These numbers show that a cluster this size would expect to upset ever 61 days at sea level and 4.5 hours at 60,000 feet. ECC-protected memory upsets are rarer: one upset per 35 million years at sea level and 107,000 years at 60,000 feet for 1 MB. For Q Cluster sized systems, the MTTU is 4270 years at sea level and 13 years at 60,000 feet. Therefore, for large clusters, only microprocessors with ECC protected caches should be used.
The second problem is that the caches are not the only memory in a microprocessor. Approximately 10% of the memory cells in the SPARC v8 microprocessor are in register files [18] . Register files are often not parity-enabled or ECC-protected, which means that an upset to a register file, while rare, can be catastrophic. Without parity, register upsets silently and undetectably corrupt register data which might contain Table 15 . FPGA Parts in the XD1
Large Scale Systems: The Cray XD1 Reconfigurable Supercomputer
The Cray XD1 system is a reconfigurable supercomputer with both standard microprocessors and FPGAs connected through a high speed interconnect. One chassis has 12 Opteron microprocessors, 26 Xilinx FPGA chips (outlined in Table 15 ) and 24 GB of ECC-protected RAM. Assuming a worst case scenario for the Opterons, where the MBUs don't get scrubbed out of the 1 MB L2 cache, we estimated the MTTU for the XD1 system, as shown in Table 16 . To determine the FPGA upset rate, the number of configuration bits is needed, which were determined by subtracting the number of block RAM bits from the configuration bitstream sizes for consistency with the previous estimates [23, 26] . The total number of configuration bits for all of the FPGA parts is 243,626,752 bits per chassis, which is an order of magnitude smaller than the Rosetta system. The MTTU for the entire system was predicted by adding together the data for the FPGA, memory and microprocessor components. Finally, we scaled the MTTU to multi-chassis systems in Table 17 . We scaled these systems by the number of racks, where 12 chassis make one rack. The last column of data is a 57 rack XD1 Cray system that has roughly the same number of microprocessors as the Q Cluster.
These results show that the FPGAs and memory have nearly the same effect on the MTTU, even though the ECC-protected RAM has roughly 100 times more bits that the FPGAs. Therefore, both the memory and the FPGAs need to mitigate soft errors. The larger scale system estimates in Table 17 indicate that a Qsized XD1 cluster will upset every 92 minutes in Los Alamos, where as the Q Cluster upsets approximately every 8 hours. With almost 18000 FPGAs and 16500 GBs of ECC-protected memory, a Q-sized XD1 cluster pushes the system design boundaries. Quite possibly these numbers will need to be derated for incomplete system utilization. This result indicates that soft error mitigation is needed to make large reconfigurable supercomputing clusters viable. 
Summary
While the error rates caused by terrestrial-based neutron flux might be significantly smaller than the error rates for space-based radiation, the number of possible upsets and the cost of these upsets are not insignificant. Therefore, reliability methods targeted at mitigating upsets that are expected to occur in the hourly or daily time frames are needed.
Low-Impact Mitigation Schemes
Due to the low SER from neutron radiation, lowpower and low-area mitigation methods are needed. There are many common sense approaches to system design that can be employed, such as using ECCprotected RAM and microprocessors. Using either the host's processor or an embedded processor for mitigation, instead of the FPGA, can reduce the amount of power and area needed. Likewise, mitigation methods that can be tuned to match the SER can lower the amount of power needed to detect/mitigate upsets.
This section presents several possible low-impact mitigation methods. These processes are broken into two broad categories: support logic methods and partial configuration methods. Support logic scenarios are implemented at the design level. Partial configuration methods focus on using the device's readback support to determine the current state of the device, and then use either the FPGA or a processor for detection/correction processing.
Support Logic Methods
Two support logic methods are presented in this paper. The first is a Xilinx design for the Virtex II Pro devices and uses one of the embedded processors. The second method is selective TMR, which automatically applies TMR to the critical areas of a design.
SEU Controller
Xilinx has provided a solution for soft error detection and correction for the Virtex II Pro FPGA family called the SEU controller module [4] . This module uses the ICAP interface and a Power PC 405 core. The module takes between 0.10-0.96 seconds to completely scan an entire device depending on device size. The frequency of the readback is user determined, so power consumption can be reduced by matching detection/correction cycles with SERs.
Since this module is implemented on the FPGA fabric it is susceptible to soft errors itself. For a XC2VP50 device, the module takes up 0.7% of the device area. Xilinx determined that the module has a mean time between failure (MTBF) of nearly 28,600 years [4] . The module was designed to detect any soft errors originating from the SEU controller module.
Selective Triple Modular Redundancy
TMR is commonly used in space-based applications to mitigate soft errors. A full TMR implementation of a design increases the power and area consumption by a factor of three. Experimentation has been done on applying TMR selectively to designs [19] . In selective TMR, the design is analyzed for critical gates. All critical gates are triplicated and the necessary voters are inserted. This method is useful when only a small percentage of gates are critical. For circuits with a high percentage of critical gates, the hardened design approximates a full TMR implementation.
Partial Configuration Methods
Readback allows the designer to access the device's current configuration state, which can be analyzed for configuration upsets. There are limitations to readback, such as not being able to write to block RAM and LUT RAM during readback without data corruption. Therefore, processing during readback is risky. Since the SER is very low, halting computation temporarily to readback is not very invasive. Below we present three different methods for doing detection and correction based on readback. The first method -single frame correction -uses CRC frame checks to detect upsets in the readback. The next method, processorbased detection of critical upsets, reduces power consumption by using the host processor. Finally, scrubbing is a method of reconfiguring the FPGA in anticipation of an upset. These methods are covered below.
Single Frame Correction
To implement single frame correction, readback supplies the current state of the frames and CRC frame checks are used as part of the detection/correction phase. Detection/correction computation are done in parallel with readback so that entire process takes no longer than readback. Readback execution times are dependent upon the number of frames in the device, the number of bits per frame and the interface used for readback. For a XC2VP100 with 3500 frames and 9792 bits/frame, doing readback on the SelectMap interface takes approximately 90 ms and 1 s for the JTAG interface. For the Virtex 4 family, Xilinx has included ECC Logic that calculates whether there has been an upset in the configuration data while doing a readback [24] .
Single frame correction for the entire chip must be faster than the SER. For example, assuming single frame correction for a XC2VP100 processes in parallel with readback, then the SER cannot exceed 11 upsets per second. For neutron radiation, the SER is several orders of magnitude slower than single frame correction. Therefore, power can be conserved by running the process intermittently.
Processor-based Detection of Critical Upsets
This detection/correction scheme maintains a database of bad bits for comparing readback data. The host pro-cessor determines if the configuration has been upset while the FPGA is computing. The database entries are marked as either critical or sub-critical, as with selective TMR. Once upsets are prioritized as being critical or sub-critical, a repair cycle can be triggered in the appropriate time frame. With sub-critical upsets, computation continues to the next checkpoint before repairing, whereas critical upsets trigger an immediate repair cycle. As with the single frame correction scheme, the entire detection/correction process is faster than the rate of upset, so power conservation can be explored. An important aspect of processor-based detection is a method to communicate from the FPGA to the host processor that a detection process needs to be started. Exception-based processing schemes allow the FPGA to raise an exception or an interrupt when an upset is suspected or detected. In this scenario, the FPGA processes normally until an upset either occurs or has likely occurred. For instance, voters can be used to raise an exception when inputs don't match. Likewise, readback can be used to raise exceptions. This scenario could help integrate FPGAs into an XD1-style system, as exceptions and interrupts are commonly used in many types of computer systems to alert the processor of issues with the peripheral devices.
Exception-based FPGA processing in the presence of neutron radiation upsets allows the designer to be more reactive than proactive when mitigating upsets. The detection/correction process would not be executed until needed.
Scrubbing with Cyclic Redundancy Check
Scrubbing reloads the CLB Frames from the bitstream [2] . When combined with TMR the failure rate due to neutron radiation induced bitstream upsets are zero as long as the SER is low [27] . For this scheme, support circuitry is needed to generate the memory map addresses and control the memory and SelectMap interfaces. Scrubbing is a more lightweight process than single frame correction, as there is no detection or correction. Scrubbing can also be run while the device is processing data. There is some risk with using scrubbing, though, in the rare but possible case that the SelectMap interface is corrupted by a soft error. The scrub rate is dependent on the SER. Xilinx recommends scrubbing at rate 10 times greater than the upset rate to guarantee that no more than one upset will be present between scrubs [2] . For a SER of 13 upsets per week (or an upset every 12.92 hours), this would mean scrubbing the device every 80 minutes. One possibility to save power and to increase the probability of catching all of the upsets would be to perform scrubbing in conjunction with either single frame correction or selective TMR.
Future Work and Conclusions
In conclusion, we expect mitigation of soft errors to become more important in future system design due to decreasing transistor size and increasing complexity of systems. Redundancy methods that take into account the infrequent nature of these upsets are needed so that the errors can be mitigated without severely impacting system performance. This paper has presented a few possible low-impact mitigation methods that use support logic and/or partial reconfiguration.
In the future, we will be stepping back from the mitigation methods to focus on the reliability of mitigation. Currently, modeling the reliability of large circuits is very time consuming and complex. We are focusing on building a system that determines the combinational reliability of FPGA implementations of circuits using EDIF and a library of reliability calculations for FPGA cells. We would like to easily and accurately model reliability of large circuit systems on particular FPGA fabrics so that we can compare the relative reliability of different architectures.
We are also interested in more traditional reliability research on how inserting scrubbing or detection/correction methods into FPGA designs can be modeled using Markov methods. Using scrubbing with TMR virtually eliminates soft errors in the presence of radiation, but currently there is no way to empirically determine scrub rates. Scrub rate are based on "rule of thumb" calculations based on upset rate. We would like to determine a design's scrub rate using traditional reliability modeling methods.
