Abstract-A comparison of two scrubbing mitigation schemes for Xilinx Field Programmable Gate Array (FPGA) devices is presented. The design of the scrubbers is briefly discussed along with an examination of mitigation limitations. Heavy Ion data are then presented and analyzed.
INTRODUCTION
FPGAs are versatile devices that allow a function to be implemented by mapping it into the FPGA's pre-existing logic (resources).
The mapping is referred to as its configuration. FPGA devices can be sensitive to faults in both their configuration and their pre-existing logic. Hardened electronics and mitigation strategies have been developed to address both concerns. Some FPGA vendors implement hardened device structures (routing networks, clocks, flipflops, and configuration). Such devices generally have system functional speeds up to 350MHz and are one-time configurable [1] . Other FPGA vendors are creating commercial devices with advantages such as: lower cost, lower core voltage, system speeds reaching 550MHz , reconfigurability , and increased resource availability [2] [3]. However, inserting these Commercial Off-The-Shelf (COTS) devices into a project requiring fault tolerance may necessitate the inclusion of circuitry to mitigate configuration and logic errors and thus increase the complexity of the overall design.
Because of the harsh environment of space and its effects on electronic devices, the aerospace community has traditionally followed a conservative design methodology, in which devices selected for flight (referred to as SEUhardened) are optimized for operation in mild to harsh [1] [10] . Unfortunately, current anti-fuse based FPGA parts are expensive and leave no room for mistakes or changes since the anti-fuse is one time programmable. This has led spacesystem architects and researchers to investigate alternative approaches to design. One aspect of this investigation has been to determine how flight projects can capitalize on the capabilities of cutting-edge COTS FPGA devices despite their greater susceptibility to upsets.
One category of COTS devices being considered is that of reconfigurable SRAM based FPGAs.
Xilinx (Virtex-II, Virtex-4, and Virtex-5) SRAM-based FPGAs are prime reconfigurable devices in aerospace research. Such devices can be reprogrammed because their configuration is stored in SRAM (vs. fixed configuration types such as anti-fuses). The implementation pros are: decreased cost, increased speed, increased resource availability and agility. A caveat is that the configuration memory is not radiation-hardened and is susceptible to faults [4] [5] [9] . SRAM-based FPGAs can incur upsets in their configuration and in their functional logic paths [7] [8] (flip-flops , SLICEs Look-Up- Table (LUT) transistors , counter logic, etc... ). However, the configuration memory has a prevailing error cross-section being susceptible terrestrially and in the mildest space radiation environments [5] .
Proposed SRAM-based mitigation techniques for configuration vs. functional logic are very different and are not easy to implement.
The purpose of configuration mitigation is to reduce error duration and to prevent upset accumulation within configuration memory. The mitigation requires the system to have the ability to write correct configuration data (referred to as the configuration bitstream In order to correct faulty bits within the configuration memory, there must be a port of access. Xilinx SRAM-Based devices contain two available interfaces for configuration memory access: JTAG (serial) and SelectMAP (parallel). These ports can be used for both configuration and correction. Due to the faster speed of the parallel interface, mitigation schemes tend to utilize the SelectMAP mode. Configuration mitigation can be categorized as follows:
Category 1 Reconfiguration: Apply a full configuration to the FPGA (periodically or upon detection of error state). This requires the following steps to be taken: a) System operation comes to a complete stop. Clocks are stopped and the FPGA becomes "un-configured". Care must be taken to protect connected board-level components. b) The original bit-stream IS loaded into the configuration memory. c) System is brought back to an operational state. Returning back to the previous state of operation can only be accomplished by having redundant devices or a monitored system (e.g. host monitor that can control the state of the reconfigured FPGA).
Category 2 Scrubbing: Scrubbing is transparent in regards to functional operation (i.e. the system does not come down during scrubbing). It is a process of writing over portions of the configuration memory that does not disrupt operation. The configuration bits that are corruptible during scrubbing and can interrupt operation are (nomenclature is specific to Xilinx Inc.; reference [3] for further detail): SRL16s (16-bit shift registers), distributed RAM, and BRAM (block RAM). To avoid corrupting these bits the scrubber must [9] : a) Set the "GLUTMASK_B" bit in the Xilinx "CTL" register to a logic 0 in order to protect SRL 16s and distributed RAM bits. b) Be designed so that it does not write into utilized BRAM blocks. SRL16's, distributed RAM, and BRAM can be protected from the scrubber; however, they are not protected from faults. If the circuitry utilizes these resources, additional mitigation beyond the scrubber should be included into the design.
The caveat is that some portions of configuration memory, hidden internal circuitry, and interface controls are not able to be written ("scrubbed) and are therefore still susceptible to upsets (i.e. Power On Reset (POR), Digital Clock Managers (DCMs), and SelectMAP interface).
Upon faults to inaccessible configuration space or upon uncorrectable functional logic upsets, the system will have to come down and a category 1 reconfiguration will have to be performed. This paper will focus on category 2 (scrubbing) configuration-upset mitigation schemes. A comparison of implementation, performance and error cross-sections between a self-contained category 2 scrubber developed by Xilinx [4] vs. an external category 2 scrubber developed by the NASA/GSFC Radiation Effects and Analysis Group is presented. The research was initiated by the NASA/GSFC "Space-Cube" flight project [16] . All investigations and analysis concerning the Xilinx Internal scrubber are specific to the proposed Intellectual Property (IP) Core usage in the Xilinx user guide "Single Event Upset (SEU) Detection and Correction using Virtex 4 Devices" [4] because this is the actual architecture implemented in Space-Cube. The REAG external scrubber developed by NASA/GSFC was utilized as a reference point to determine if advantages in implementing the internal scrubber exist.
I.

SELF-CONTAINED XILINX IP CORE SCRUBBER IMPLEMENTATION
A. General Description
The top-level block diagram of the Xilinx internal scrubber is illustrated in Figure 1 . It is an IP core available by Xilinx and is named SEU_cntlr. Detailed operation and control are described in [4] . The hardware for the SEU Controller block comprises several sub-modules as illustrated in Figure 2 . During the scrubbing process, the configuration bit steam is constructed using the SEU cntlr sub-modules and sent to the configuration memory via the Internal Configuration Access Port (ICAP). ICAP cannot be accessed by any other device except Xilinx internal circuitry. It was developed so that the Virtex internal circuitry can access SelectMAP 1/0 ports (traditionally only accessible externally). The "Frame_ECC" block is responsible for error detection and correction (EDAC). It uses a hamming code Single Error Correct and Double Error Detect (SECDED) scheme [4] .
B. Xilinx Internal Scrubber: SEU Correction Operational Flow
The configuration memory space is divided into frames containing a corresponding parity syndrome (also stored in configuration memory) [4] .
Once the SEU controller commences operation, frames are read one at a time and a SECDED ECC is performed [4] . If a single error within a frame is detected, the SEU DETECT (Figure 1 ) flag will go high, after the correction is calculated and the corrected frame is written into configuration memory, the SEU DETECT will go low. If there is a double bit error the ECC circuitry will raise the SCAN ERROR Flag (see Figure 1 ). [4] due to uncorrectable multiple bit configuration hits and functional errors within the internal scrubber 1) ICAP interface can become inoperable 2) ECC correction can malfunction 3) BRAM utilized by the IP CORE can become corrupted 4) Pico-blaze processor (Controller Core) can malfunction or become dead-locked Malfunctions can be detrimental to the point of writing entire frames of incorrect bit-streams. Multiple frames containing large number of faults can cause internal contention or I/O damage (i.e. output changing to an input) and thus can potentially cause damaging board level current spikes.
D. Investigating Internal Scrubber Integrity via Fault Injection
The SEU_cntlr IP core contains fault injection. This option was used to investigate the effectiveness of the double error detection circuitry. First single bit errors (one bit per frame) were injected into each frame. All were corrected. When double bit errors were injected within one frame; all are detected and not corrected. Several variations of multiple bit (greater than 1 error) injections were investigated by varying the separation of bit addresses (thus varying error patterns within memory). It was observed that all odd number injections were noted as corrected (SCAN_ERROR). Due to the fact the SECDED circuitry can not correct MBUs; this event indicates that incorrect frames of configuration can be written into configuration memory due to the existence of an odd number of MBUs (thus errors can compound overtime because of the inability to correct). It was also observed that 4 and 8 bit error injections go undetected. The possibility of writing incorrect frames (or have corrupted frames remain undetected) must be taken into account when analyzing error cross-sections while utilizing the Xilinx SEU controller. Table 1 contains a summary of fault injection results.
Frame ECC ICAP Interface
HardIP ECC Interface 
C. SEU Correction and Detection Failure Modes
The SEU_cntlr Controller Core (see Figure 2 ) is a processor (Xilinx pico-blaze IP Core). Its correction time is on the order of 100 ms. The Xilinx internal scrubber has two general modes of failure: 1) Multiple bit faults can not be corrected and have the potential for invalid bit-streams to be written to the configuration memory 2) The IP Core does not contain functional mitigation and is mapped utilizing configuration memory. Although it is scrubbed (it scrubs itself-except for its BRAM), an error in the configuration memory or an error in the functional logic of the scrubber can cause the scrubber to malfunction. A scrubber malfunction in this mode can also cause corrupted configuration bits to be written to configuration memory. The following are some specific examples of malfunction 
A. General Description
In order to perform general radiation testing of SRAMBased devices, REAG had developed a scrubber that was internal to the group's Low Cost Digital Tester (LCDT) [6] [14] . Specifically for testing purposes , the premise was:
1. To emulate space flight missions that would implement an external (to the Xilinx Device) scrubber 2. To isolate various DUT irradiation responses that can incur faults such as: configuration bit errors and functional behavior from faults that could incur using an internal scrubber. Figure 3 illustrates the high-level topology of the LCDT, scrubber, and the SRAM-Based Device Under Test (DUT). Particularly for Space-Cube project radiation test analysis, the REAG external scrubber was used as a reference point for scrubbing methodology response investigation. 1t was also considered as a potential alternative approach to configuration memory mitigation for future generations of the Space-Cube project.
OUT
B. REAG External Scrubber and Space Flight Implementation
If implemented in a space flight mission, the REAG external scrubber would be implemented in a hardened device. The hardened device can be an ASIC or anti-fuse FPGA.
1t is generally assumed that the major advantage of using an internal scrubber is that it does not require additional hardware external to the Xilinx device (saves board area and complexity). However, (without scrubbing) a system with SRAM-Based FPGAs already requires additional circuitry in order to implement configuration (and re-configuration). The following are the minimal hardware requirements for (re)configuration: 1) Memory: stores configuration bit stream. Usually flash components 2) Configuration Controller: reads bit-stream from memory and transfers the bit-stream to the FPGA configuration memory (through either the JTAG or SelectMAP interface [9] ). The controller can be implemented in an ASIC, FPGA, or processor. 1t is assumed that a flight project will implement the controller in hardened circuitry.
The REAG external scrubber was developed so that if implemented in a flight system, it would be able to reutilize the existing configuration hardware (with slight modifications).
The configuration bit-stream is comprised of a series of commands followed by the actual configuration mapping data. The configuration controller is expanded to contain a scrubbing mode. While scrubbing, the controller discards the command portion of the original configuration bit-stream and creates its own in order to avoid bringing down the FPGA (and to over-write the interface registers to place them in the appropriate modes for scrubbing). Following the command segment, the original bit-stream that contains the configuration data is ported to the FPGA via the SelectMAP interface. The extension to the configuration controller is minute in order to implement a hardened scrubber.
The REAG external scrubber does not use configuration read-back and ECC circuitry in order to correct (contrary to the Xilinx internal scrubber). Instead, the external scrubber (while in correction mode) only performs writes to the configuration memory. The golden (modified) configuration bit-stream is periodically written to the configuration memory regardless of potential faults as long as the faults do not disrupt configuration interface control. The periodicity of configuration writes is user-programmable.
C. External Scrubber Integrity
Concerning only the SRAM-based Xilinx device, MBUs and potential internal scrubber faults are no longer a factor. 1t is important to note that the Xilinx SelectMAP interface and configuration control registers are still susceptible. However, their circuit area is small and has a relatively low probability of being upset. The hardened device that controls scrubbing is also susceptible. However, (depending on the hardened device) its sensitivity level will be lower (on the order of 1E-4 faults/day [10] [12]).
III. RADIATION T ESTING
Heavy-Ion SEE radiation testing was performed on the both the Xilinx IP Core internal scrubber and the REAG external scrubber. The Device Under Test (DUT) was the Xilinx Virtex-4 LX25.
SEE tests were performed at TAMU cyclotron with 25 MeV/AMU Argon. . At normal incidence the LET in the active part of the die was 5.7 Me'Vcmvmg, At 45 degrees incidence, the LET was 8.1 MeVcmvrng. The following sections describe the selected design that was implemented in the Virtex-4 LX25, test methodology , and the tester utilized to control and analyze DUT functionality.
A. Implemented Design
NASA-GSFC REAG tests many types ofFPGA devices. To perform a direct comparison among FPGA error crosssections, the windowed output shift register architecture was implemented as the Design Under Test (DUT) [4] .
T ABLE 2 : S UMMARY OF XILI NX VIRT EX-4 LX25 D EVICE LOGIC UTILIZATION AND DIS TRIBUTION FOR TIl E 6 WINDOWEDSIII FT R EGISTER IMPLEMENTATIONS TABL E 3: S UMMARY OF XILINX VIR TEX-4 LX25 D EVICE LOGIC U TILIZATION AND DIS TRIBUTION FOR TilE 6 WINDOW ED SIII FT R EGISTER IMPLEMENTATIONS INCLUDING TIl E XILINX INTERNAL SCRUBBER
gate count for design Table 2 and Table 3 each contain a brief summary of resource utilization for the shift register implementation sans the Xilinx internal scrubber and containing the internal scrubber respectively.
Inserting the internal scrubber increases the resource utilization between 1% and 2%. This increase is negligible and thus eases the insertion process for the designer during the logic mapping and place and route phases of implementation.
All shift registers were run at 100 MHz. All 6 of the shift register strings were contained in one FPGA device. The 6 shift registers were implemented as one design without the Xilinx IP Core internal scrubber and as another design containing the Xilinx IP core internal scrubber.
B. The Tester
A daughter board was constructed for the Virtex-4 LX25 Xilinx devices.
The LX25 board interfaced to the NASA /GSFC Low Cost Digital Tester (LCDT) [8] through high-speed micro-strip connectors.
The LCDT provided clock, data, and reset inputs to the DUT. The LCDT was also responsible for capturing the outputs from the DUT, analyzing the DUT response, and reporting errors to the user-Host computer (refer to Figure 5 ).
The following were some of the tester requirements: a) Process commands from the host. Commands were used to set test parameters, start tests, and halt tests. b) Supply variable DUT clock signals ranging from 1
MHz to 100 MHz. c) Supply the following DUT data input patterns upon request: Static 0, Static 1, and alternating 1's and O's (checker-board). d) Supply a reset to the DUT. e) Capture DUT outputs at every shift clock cycle (shift clock was Y4 the speed of the DUT system clock). See Figure 5 and please reference [10] for more details on the windowed shift register architecture and test methodology. f) Determine if DUT output was in error. If in error, the tester sent the DUT outputs and timestamp to the host computer for further processing. g) Have the ability to configure and scrub the DUT through the SelectMAP interface. h) Scrub configuration memory with a frequency of at least 25 MHz and have the ability to halt scrubbing. i) Perform scrubbing only on the architectures that do not contain the Xilinx internal Scrubber j) Control the inputs and monitor the outputs of the Xilinx internal scrubber for the associated DUT architectures and test runs. k) Read-back configuration memory during and after each radiation test. Table 2 and Table  3 
(1)
NE TFL
Because possible error events can be masked during a burst period, an adjustment to the reported fluence must be made for accurate cross section calculations.
The proposed modification to equation (1) handles a burst error as if it were one event. The adjustment to the formula is the difference in total reported fluence (TFL) and the fluence accumulated while in burst (TB*FLUX). 
NE a==--------
TFL
flux
The DUT system clock supplied to the shift registers was equal to IOOMHz. The system clock was always active during irradiation (classified as dynamic testing). Because a simple shift register architecture was implemented and was shifted at each clock cycle, every DFF resource utilized within the design was observable during testing. The probability of incurring a fault is dependent on both the configuration memory and the device logic.
While calculating an error cross section per bit, care must be taken because the probabilities of configuration memory and of DFFs (logic bits) are not the same and logic faults are generally masked by the configuration errors. Simply normalizing the error by total bit count does not take this phenomenon into account.
Because of the potential discrepancy, the error cross-sections are calculated per device. In this case, errors can only be masked by bursts and are not masked due to complex functionality. There exist two types of categories of errors: 1) Burst: Errors occurring for a long period of time generally due to configuration bit faults. The burst of errors will remain active until the configuration bit is corrected (depends on scrubbing methodology) and the design has been functionally corrected. 2) Single Point: A single point fault is generally due to an upset within the functional logic. The Design chosen for this experiment reflects single point faults as an error occurring for one DUT system clock cycle (or a small number of clock cycles). When the output of the DUT is stuck in a burst state, other potential errors will be masked. Due to the long duration of functional error, a true cross-section can not be accurately calculated by the traditional method of number of events divided by fluence (see equation (1)). were run until a fluence between I.OE5 and I.OE6 was reached or the functionality was observed to be unrecoverable from error. The external scrubber operated at 25MHz. The scrubber cycled though configuration memory approximately every 40ms (time is specific to the Virtex-4 LX25). Correction was accomplished when the scrubber reached a bit that was in error. The average time in functional burst error was calculated to be approximately 16ms. Tests were run with low flux (approximately 100 particles/tcrrr'tsecondsj) in order to not overwhelm the scrubbing mitigation. The Device Malfunctional cross-section was calculated using equation (2) .
The internal scrubber operated at IOOMHz. The flux was approximately 100 particles/tcrrrtseconds). The algorithm for correction is complex: a frame is read, SECDED is calculated, if the frame is in error, the processor portion of the scrubber (pico-blaze) writes to the configuration memory. The correction write via the pico-blaze took approximately 100 ms. Therefore, time in functional burst was relatively longer than the external scrubber because time to correct was in the order of IOO's of milliseconds (time in burst includes read-back). The Device Mal-functional cross-section was calculated using equation (2) 
D. Cross Section Analysis
The difference in performance (error cross-section) between the external vs. internal scrubbers was relatively similar. This cross-section does not reflect time to unrecoverable failure, it only reflects recoverable malfunction during operational time. Accordingly, although the cross-sections did not have a large difference in value , the external scrubbing was always recoverable without the need for a reset or power cycle (specifically for the shift register DUT design), whereas the internal scrubbing was never recoverable -i.e. faults occurred that were uncorrectable and a reconfiguration was always necessary. Such results suggest that the internal scrubber consistently reached a state where it could no longer correct (either the scrubber circuitry getting hit or MBU 's occurring).
E. Resource Analysis
Resource analysis was performed by reading back the configuration memory after every irradiation test. The NASA/GSFC external scrubber had the best performance as expected (because it is not dependent on SECDED circuitry and it is hardened). The design that used external scrubbing incurred zero interconnect errors (resources not including BRAM) at the end of each test run as illustrated in Figure 3 . The design containing the internal scrubber would generally finish with approximately 100 to 500 interconnect errors. This was due to the fact that unrecoverable functionality with the internal scrubber tests was always reached before the target fluence of I E 5 was obtained. If the circuit is in an unrecoverable state, this generally means that the FPGA has become un-configured. These results suggest that internal scrubber incurred faults and thus corrupted configuration memory (or ceased to function).
IV. SUMMARY AND CONCLUSIONS
Space flight projects are investigating the insertion of COTS FPGAs into complex systems in order to take advantage of their state-of-the-art features.
Currently, system reconfigurability is one of the major benefits of commercial devices. To properly utilize COTS devices in highly reliable systems, mitigation must be designed into system hardware that can correct configuration and functional faults . The work presented in this paper focused on SRAM-based FPGA configuration mitigation. Two Xilinx configuration memory mitigation schemes (scrubbers) have been presented -one developed by the device manufacturer, Xilinx (self contained, internal scrubber) and one developed by NASA/GSFC REAG (external scrubber).
The Xilinx IP-Core scrubber uses read-back in conjunction with SECDED ECC to detect and correct errors.
It is implemented internally to the Xilinx Device. This scheme is inherently limited however may still be effective for more benign space radiation applications. Its limitations were investigated via fault injection while observing false correction detection and absent error detection (in the case of existent configuration memory MBUs). Further research concerning the Xilinx internal scrubber was performed with SEE radiation testing.
Reading the configuration memory after irradiation suggested that the internal scrubber corrupted configuration memory. A reconfiguration was necessary after every radiation test when the DUT was utilizing the internal scrubbing.
The REAG external scrubber uses a golden configuration bit-stream with no frame by frame read-back and no ECC circuitry.
It simply periodically writes the correct configuration bit-stream to configuration memory. In accordance to the fact that no SEE tests had resulted in an unconfigured design, it has been proven that the REAG design has improved performance over the Xilinx internal scrubber.
Both scrubbing mitigation techniques increased average time to configuration corruption. The internal scrubber met the reliability requirements of Space Cube when used in conjunction with a category 1 reconfiguration scheme. Although the external scrubber had dramatically better SEE performance results, the internal scrubber's advantage was that the IP-Core was readily available and easy to implement.
