Abstract-This paper describes a FPGA configuration scrubbing approach for Xilinx 7-Series FPGAs that combines the high-speed internal scrubbing available within these devices with an external scrubber. The internal scrubbing unit continuously monitors the frames of the FPGA configuration memory and corrects single-bit frame errors and is used to detect multibit frame errors. Multi-bit upsets are repaired by means of a secondary scrubbing mechanism that is primarily external to the FPGA fabric. This Xilinx 7-Series hybrid configuration scrubbing architecture scans 25,636,224 bits of the XC7Z020 device in several microseconds and detects upsets within 8 ms and then corrects most multi-cell upsets in under an additional 6 ms. This configuration scrubber was validated with configuration fault injection and neutron radiation testing.
I. INTRODUCTION
S RAM Field Programmable Gate Arrays (FPGAs) are flexible semiconductor devices which allow for a variety of applications to be implemented in hardware logic that can later be upgraded or modified after deployment in the field through device reconfiguration. To support in-field reconfiguration, SRAM FPGAs contain a large array of static memory cells that define the operation of the look-up tables, routing switches, block memories, and I/O resources. This memory, called the configuration memory, can be changed by loading new configuration data into the device through JTAG or another configuration interface.
The static memory cells used to define the configuration memory (CRAM) are susceptible to radiation effects particularly when used in aerospace applications involving harsh radiation environments [1] . High-energy particles may strike a transistor within a CRAM cell and change its logical state. Known as a Single Event Upset (SEU), these upsets within the configuration memory may disrupt proper functionality of the FPGA design and cause undesirable behavior. Since FPGAs include millions of CRAM cells, addressing CRAM upsets is a vital concern when using FPGAs in high-radiation environments.
This paper presents a technique for configuration scrubbing for the Xilinx 7-Series FPGA families that provides much higher speed than external configuration scrubbers and is able to repair all SEU upset types. This technique combines two forms of scrubbing: internal scrubbing using the Readback CRC feature of the 7-Series FPGA families and external scrubbing through JTAG or another configuration interface. The internal Readback CRC provides high-speed repair for single-bit upsets and high-speed detection for multi-bit upsets. The external scrubber provides slower, yet robust repair for multi-bit upsets. This scrubber was verified with both fault injection and radiation testing in a high-energy neutron source.
II. CONFIGURATION SCRUBBING
Configuration scrubbing involves periodically "repairing" the contents of the configuration memory that may have changed through ionizing radiation. 1 Configuration scrubbing typically involves reading the CRAM memory, detecting CRAM upsets, and using dynamic partial reconfiguration to repair these upsets [2] . Configuration scrubbing on Xilinx 7-Series FPGAs does not interrupt normal FPGA functionality and may operate continuously if needed.
Configuration scrubbing can be implemented in a number of different ways including dedicated hardware circuits, software running on a nearby processor, and internal FPGA logic. The collection of hardware and software components used to implement configuration scrubbing is often called the "scrubber" and is responsible for managing the proper state of the FPGA during the operation of the system. In addition, some configuration scrubbers are also responsible for detecting and responding to single-event functional interrupts or SEFIs. A variety of different configuration scrubber architectures have been created that provide a differing set of trade-offs, including size, power consumption, speed, and robustness ( [3] - [7] ).
A common approach for configuration scrubbing is to supplement the FPGA system with a custom external circuit that reads and writes the configuration memory. As shown in Figure 1 (a), an external circuit is attached to the configuration interface of the FPGA (SelectMAP in this case) and issues the appropriate commands to perform configuration readback or configuration writing. This external circuit can be implemented in a technology that is less susceptible to SEUs such as an antifuse or FLASH-based FPGA. This configuration architecture may contain a memory external to the FPGA that holds the "golden" configuration data. The configuration controller will perform a readback on the FPGA and compare the readback data with the "golden" configuration data. When the readback data differs from the "golden", an SEU is identified and a configuration scrub is performed.
Another style of configuration scrubbing uses internal resources to perform scrubbing. Some FPGAs contain internal configuration ports such as the Internal Configuration Access Port (ICAP) or, on devices such as the ZYNQ family SoCs, the Processor Configuration Access Port (PCAP) [8] . These ports allow the device to perform configuration readback and configuration scrubbing from within the FPGA [9] . The advantage of internal scrubbing is that fewer external resources are used (hence it is much less costly) and they are usually faster due to their close physical proximity to the configuration memory. The disadvantage is that the internal scrubbing circuit is susceptible to radiation effects and may suffer from more SEFI failures than an external scrubber.
The Soft Error Mitigation (SEM) IP core implements a style of internal configuration scrubbing similar to the approach described in this paper [10] . This core, provided by Xilinx, exploits the internal scrubbing feature of the FPGA and provides the ability to augment the internal scrubbing feature with other scrubbing functionality through the ICAP. Implemented as logic within the FPGA, this core provides reporting, faultinjection, and a number of scrubbing modes that are accessible through a serial UART interface.
The scrubber described in this paper uses a combination of the external scrubbing and internal scrubbing styles described above. It is called a "hybrid" scrubber since it combines both an internal scrubber with an external scrubber to take advantage of the benefits of both approaches. The internal scrubber provides high-speed scrubbing that repairs single-bit upsets and detects most multi-bit upsets. The external scrubber uses limited hardware to repair those upsets that cannot be repaired by the internal scrubber. Although slower than the internal scrubber, it can detect and repair all upset types.
III. XILINX 7-SERIES CONFIGURATION
AND ERROR DETECTION To understand the hybrid scrubbing approach described in this paper, it is necessary to understand several important details about the configuration mechanism of the 7-Series FPGA devices [11] . The atomic unit of configuration is a configuration frame made up of 101 × 32-bit words (3,232 bits). Each configuration frame defines the operation of the FPGA at a specific tile region within the device. Each configuration frame is assigned a unique frame address (FRAD) and device configuration involves writing configuration frame data to specific frame addresses. These frames are divided into a number of regions or "blocks" each having a specific configuration purpose. For instance, block 0 frames define the logic and routing of the device and block 1 frames used to configure the Block RAM and its contents. Two additional blocks, block 2 and block 3, are also present within the device but the purpose of their contents is not known. Because the contents of the BRAM memories change during operation, only block 0 frames typically undergo configuration scrubbing.
Configuration of the FPGA device occurs by setting an internal frame address register (FAR), indicating the number of frames to write, and then writing frame data into the device configuration port. The configuration frames can be configured all at once (through full configuration) by sending multiple frames in sequential frame addresses or can be configured individually by writing a single frame at a single frame address. Configuration frames can also be read back in a similar manner using the same configuration interface.
The data integrity of an FPGA configuration memory is essential and all FPGA manufacturers provide a variety of strategies for preserving this integrity. The 7-Series FPGA family employs several complimentary techniques for ensuring the correctness of the configuration data. These techniques include a global cyclic-redundancy check (CRC), a framebased error correction code (ECC) word, and the Readback CRC scan. The purpose of each of these techniques are described below.
A. Global CRC
Cyclic-redundancy check (CRC) codes are frequently used for detecting errors within large blocks of data. CRC codes are useful for detecting multiple errors within a data block and are relatively easy to compute in hardware. FPGAs use CRCs to check the validity of a configuration bitstream during device configuration and readback. For the 7-Series FPGAs, a single 32-bit CRC word is provided with each complete bitstream to validate the full bitstream during configuration and during operation through device readback. The global CRC can be used to prevent a faulty configuration bitstream from being loaded in an unconfigured FPGA.
The advantage of the global CRC is that it is relatively robust at detecting errors anywhere within the bitstream and can detect multi-bit upsets more effectively than other common or localized error detection strategies. The disadvantage of the global CRC is that it takes a relatively long time to compute the CRC-the entire bitstream must be read before the global CRC value can be computed and compared. This increases the detection time of configuration errors. The hybrid scrubber described in this paper uses the global configuration memory CRC as the the final line of defense for error detection. Any errors not caught in or introduced by other error detection/correction mechanisms will be caught by the global CRC [12] .
B. Frame ECC
In addition to the global CRC, the 7-Series FPGA family provides local error detection and correction at the frame level [11] . Each 101-word configuration frame contains a 32-bit word containing the error correction code (ECC) word to aid in the detection and correction of errors within the configuration frame. The error correction strategy used for these frames is a conventional single-error correction, doubleerror detection (SECDED) Hamming code.
As configuration frames are written to or read from the device configuration interface, an ECC syndrome is computed for the frame of the configuration data. If no errors are found, this syndrome is zero. If single-bit errors are found, the syndrome is non-zero and the syndrome result can be used to determine the type and location of the error. The results from this syndrome calculation are available to FPGA designs through the FRAME_ECCE2 primitive. This primitive provides the frame location of the syndrome computation, the location of the error (for single-bit errors), and an indication of single versus double-bit errors.
The primary benefit of the internal frame ECC is that its output signals provide a window into the scrubbing status of the Readback CRC mechanism (described in the next section). The frame ECC is the component that allows an external scrubber to know where errors have occurred and also is useful for reporting the location of single-bit errors.
A limitation of the SECDED code used in the frame ECC is that odd, multi-bit errors in a frame will alias as a single-bit error. The SECDED will interpret these odd, multi-bit errors as a single-bit error and will try to "repair" this upset bit causing an additional error to be introduced into the frame.
The challenge of odd multi-bit errors in the same frame is seen in Figure 2 . The top example demonstrates a configuration frame with no errors. The second example demonstrates a configuration frame with three adjacent errors. A syndrome is computed for this corrupted configuration frame and identifies a configuration bit unrelated to the three-bit upset. Although the probability of three upsets in the same frame due to the same ionizing particle is low, such upsets have been observed with heavy ion testing [13] , [14] .
As shown in Figure 2 , the syndrome is used to "repair" a single bit in the frame of 3,232 bits and thus satisfy the SECDED code. This "repair" is introducing another error into the frame resulting in a configuration frame with four errors that is subsequently not detectable by the SECDED code. Although not detected within the frame with the SECDED code, the presence of these four errors will be detected by the global CRC mechanism at the end of the internal scrubbing cycle as described in the previous section.
C. Readback CRC
The 7-Series FPGAs also contain an internal built-in hardware mechanism for performing continuous readback during normal device operation through a mechanism called the "Readback CRC". Dedicated circuitry within the device performs readback sequentially and computes the ECC syndrome of each frame. Single-bit errors can be corrected and most multi-bit upsets are detected. In addition, the Readback CRC is the circuit that computes the global CRC of the bitstream while reading the frames sequentially. The final CRC value computed after reading back all frames is compared against a pre-computed CRC value to identify any error conditions that are not found through the individual frame ECCs. Because the Readback CRC is implemented in dedicated circuitry, it can operate much more quickly than traditional external readback mechanisms which often require additional FPGA resources.
IV. HYBRID SCRUBBING ARCHITECTURE
The hybrid scrubbing architecture presented in this paper combines two separate configuration upset repair mechanisms to achieve higher performance and robustness. The internal 7-Series Readback CRC mechanism is used to correct SBUs and detect the presence of multi-cell upsets (MCU). An external scrubber is then used to correct the MCUs that are identified but not correctable by the Readback CRC. The two mechanisms combined take advantage of the Readback CRC's high-speed performance, while maintaining the complete robustness of an external readback scrubber. Figure 3 provides a high-level diagram of the structure of this hybrid scrubbing architecture. The inner core of the scrubber is the internal FPGA Readback CRC repair mechanism. This inner core is responsible for continuous readback of the configuration memory to correct single-bit errors and detect multi-bit errors. The outer core is responsible for watching the error events generated by the FRAME_ECCE2 primitive and deciding when an external scrubbing event must occur. When an error event occurs that cannot be fixed by the internal Readback CRC core, the outer core will access a configuration port to repair the error.
The outer core can be implemented in a number of different ways depending on the type of device used and the requirements of the scrubber. No matter how the outer scrubber is implemented, it will contain the following four components: the FRAME_ECCE2 primitive, a Error Event FIFO, Scrubbing Logic, and memory to store the golden bitfile. Part of the outer scrubber must be implemented in the programmable fabric of the FPGA including the FRAME_ECCE2 primitive and the Error Event FIFO. The FRAME_ECCE2 primitive is instanced within FPGA to provide access to the error conditions identified by the internal Readback CRC. A FIFO is also added within the programmable logic to capture all error conditions indicated by the FRAME_ECCE2 primitive. The FIFO stores multiple sequential error events because the ordering and number of error events provides essential information necessary when choosing the proper error response.
The outer scrubber also contains a memory to store a golden copy of the configuration bitfile. A golden copy of the configuration bitfile is needed to repair the configuration memory when multi-bit upsets occur within a frame or when global CRC errors are identified. A variety of memory technologies could be used to store the golden bitfile including DRAM, SRAM, or non-volatile memory such as FLASH.
The heart of the outer scrubber is the "scrubbing logic". The purpose of the scrubbing logic is to monitor the error events in the Error Event FIFO and to issue an external scrubbing activity when the error signatures in the FIFO indicate an external scrubbing event is necessary. The scrubbing logic can be implemented in a variety of ways such as a software program running in a processor, a hardware circuit within the FPGA, or a hardware circuit outside the FPGA. The details of the error response in the scrubbing logic to these error events is described in the following section.
V. OUTER SCRUBBER DECISION PROCESS
The primary purpose of the outer scrubber is to monitor error events captured by the FIFO and decide when an outer scrubbing activity is necessary. The steps used by the hybrid scrubber are summarized in Figure 4 . The decision logic is complex and must respond to a variety of unusual error conditions and situations for proper correction. Each of the important correction steps within this flow chart will be described below.
After initializing the scrubber (Step 1), the outer scrubber continuously monitors the error signals generated by the FRAME_ECCE2 module. In particular, the outer scrubber monitors the CRC and ECC error signals (Step 2) to identify global CRC configuration errors or individual frame ECC errors. In most cases, there are no errors and the scrubber remains in this cycle waiting for an error condition. If there is an error, the details of the error event are placed in the error FIFO and the outer scrubber must characterize the error to provide the appropriate response. The following sub-sections describe the primary error types and the outer scrubber response.
A. Even-Numbered Errors
If the ECCERROR signal on the FRAME_ECCE2 is asserted, the Readback CRC hardware has identified an error within a specific frame of the configuration memory. The first check is to determine whether the error was an even or odd upset using the ECCERRORSINGLE output. If an even upset is detected, then the SECDED code cannot repair the upset and the outer scrubber must intervene. Fortunately, the frame address of the error is saved by the FRAME_ECCE2 module and the outer scrubber can repair the upset at the known upset frame location.
When the Readback CRC identifies an even upset, it will continue to scan and compute the global CRC of the existing configuration bitstream. Since even upsets are uncorrectable by the inner scrubber, the global CRCERROR signal will be asserted indicating a corrupt bitstream. This error can be repaired by writing the corresponding "golden" configuration frame back into the device (Step 3) at the frame address specified by the EFAR signal. If details of the even error are needed, a readback is performed on the frame and the golden frame is compared against the error frame to identify exactly how many errors have occurred and the location of the errors.
B. Odd-Numbered Errors
If the FRAME_ECCE2 module indicates an odd error, the response is not as simple as with even errors. There are a variety of unusual error conditions that can generate the "odd error" signal. As described earlier, all odd errors will appear as single-bit errors with the SECDED code used by the Readback CRC. For single-bit errors this is not a problem as the actual single-bit error will be corrected. However, for most location combinations of multi-bit odd errors (3,5,7, etc.) , the single-bit "repair" will actually introduce an additional error in the system (see Figure 2) . The outer scrubbing system must identify this situation and provide a proper repair response.
There are two known types of odd multi-bit error conditions that must be handled: "Out-of-Bounds" errors and "In-Bounds" errors. "Out-of-Bounds" errors are those in which the configuration bit that is being artificially corrupted (through the "repair" process) does not exist in the configuration memory. The SECDED code used by the FRAME_ECCE2 module contains a 13-bit syndrome which can repair any single bit within a set of 2 (13−1) − 1 or 4095 bits. Since the configuration frame contains only 3,232 configuration bits (101 words), a syndrome may be generated during a multibit odd upset detection that falls outside of these 3,232 bits as shown in Figure 5 . In this "Out-of-Bounds" case, no change to the configuration frame actually occurs since the bit does not exist. Odd "Out-of-Bounds" multi-bit upsets are relatively easy to detect since the syndrome generated by the FRAME_ECCE2 module can be checked against the acceptable ranges of the 3,232 valid configuration bits (bits 0 through 3,231 of the frame). If the syndrome is outside of this range, the outer scrubber can immediately perform a configuration scrub on the affected frame (Step 3) .
If the syndrome generated by the odd error is "In-Bounds", then there are three possible classifications of the error: singlebit upsets, multi-bit "In-Bounds" upset, and masked bit upsets. If exactly two error signatures are seen in the FIFO, then the error is either a single-bit error or a multi-bit odd "In-Bounds" upset. The first of the error entries in the FIFO is the ECC event indicating that an odd error was identified. The second error event in the FIFO indicates that the error was repaired. No more error events are generated since a repair was made that satisfies the ECC syndrome logic.
If the error is a multi-bit odd "In-Bounds" upset, then the Readback CRC injects an error into the frame (see Figure 2 ) and the next computation of the global CRC will reflect the presence of these upsets. The CRCERROR signal indicates to the outer scrubber that the entire frame must be scrubbed. If the CRC error signal is not asserted, then the error was a single-bit upset properly repaired by the Readback CRC system and no outer scrubber activity is needed.
If more than two identical error signatures are seen in the error FIFO, then the error is caused by a "masked bit" upset. Masked bits within a configuration memory are those bits that change during normal operation (such as user flipflops and embedded LUT RAM). Because these bits change during normal operation, they are not used to compute the error syndrome of a frame (nor do they contribute to the global CRC). Although direct upsets to these bits may cause problems to a user circuit, they do not impact the computation of the frame error syndrome. However, it is possible that a multi-bit, odd upset will generate a syndrome that attempts to "repair" one of these masked bits. Since such bits cannot be repaired, the Readback CRC will continuously try to "repair" this bit and never complete a full internal scrub. When the outer scrubber detects this situation, it performs a scrub on the entire frame to correct the upsets and allows the Readback CRC to continue through the entire configuration memory.
C. CRC-Only Errors
The vast majority of the errors are identified and repaired using the ECCERROR signals of the FRAME_ECCE2. These errors are repaired relatively quickly-single-bit errors are repaired by the internal Readback CRC and multi-bit errors within a frame are repaired by scrubbing a single frame with the outer scrubber. Occasionally, the CRCERROR will be asserted without a corresponding ECCERROR suggesting that something is wrong somewhere within the configuration memory. Although the global CRC is good at detecting errors, it provides no insight on the location of the error. Since the location of the error is not known, the entire bitstream must be scrubbed (Step 4). This is a very time consuming process that involves writing every frame of the bitstream into the FPGA device (though it is not the same as a full reconfiguration). This full device scrub is the outer line of defense for errors and, although costly to perform, is essential for repairing all identifiable errors.
D. Single-Event Functional Interrupts (SEFI)
A common function of some configuration scrubbing implementations is to identify the presence of single-event functional interrupts or SEFIs. A SEFI is a system-wide fault within the device that prevents the device from operating properly (i.e., "functional" interrupt). Known SEFIs for FPGAs include the power-on reset SEFI, frame address SEFI, and the SelectMap SEFI. Although the sensitive cross-section of SEFIs is very small, their impact on the system is significant and methods for detecting and responding to their occurence are important. The scrubbing architecture described in this paper does not detect or respond to SEFIs. It is possible that radiation-induced upsets may occur within this scrubbing architecture that cause unrecoverable SEFIs. Future work will investigate the sensitive cross section of these SEFIs and mechanisms to respond to them.
VI. ZYNQ HYBRID SCRUBBER IMPLEMENTATION
The hybrid scrubbing approach described above was implemented in the Xilinx ZYNQ programmable SoC that incorporates the 7-Series FPGA fabric onto an ARM-based SoC. As described earlier, the inner scrubber is implemented using the internal Readback CRC mechanism found on all 7-Series FPGA devices. The outer scrubber is implemented with software running on the ARM processors within the Zynq device and outer configuration activities are performed using the "PCAP" configuration port found on all Zynq devices. 2 The golden configuration memory used by the outer scrubber to repair upsets is located within the Zynq global DRAM system memory. The PCAP allows the ARM processor cores to directly write or read configuration data using DMA accesses [8] . The software that comprises the outer scrubber monitors the FRAME_ECCE2 module and implements the error classification and repair strategy described in Figure 4 .
The specific device used was the XC7Z020 which contains 7,692 block 0 configuration frames and a total of 25,636,224 configuration bits. 3 Other Zynq devices have different numbers of frames and thus will have correspondingly different configuration times. The timing performance of the scrubber was measured with benchmarks and the effectiveness of the scrubber was validated with several radiation tests [12] .
A. Performance
The performance of the scrubber can be evaluated using the following upper bound timing measures: worst case detection time (T D ), single-frame correction time (T C ), scrubbing overhead (T O ), and the total scrubbing time (T S ). The scrubbing overhead is the extra time needed by the software to implement the scrubbing logic. A variety of faults were artificially injected into the FPGA fabric to measure the scrubber performance. These metrics were measured both in software by using timestamps between functions and in hardware by counting clock cycles between error events. Table I shows the timing measurements for the hybrid scrubber for several different upset types. These measurements were made on the XC7Z020 device.
The single-bit upsets (upset type #1), which are the vast majority of upsets seen in a radiation environment, are the fastest to detect and repair. These are repaired with the dedicated internal Readback CRC in several microseconds without any intervention from the outer scrubbing system. The internal Readback CRC can scan the entire device relatively quickly (in 8.02 ms).
Upset types #2-#7 are upsets that require intervention by the outer scrubber. The detection time for these upsets can be longer to implement as there are a variety of decisions that must be made and inputs obtained (see the sequence described by Figure 4 ). These upsets require the configuration of an entire frame through the external configuration interface. In some cases, multiple Readback CRC scans are needed through the device to fully detect and characterize the error.
The CRC-Only upsets are those upsets that cause a CRC error and whose error FIFO entry provides no feedback on the location of the upset. In this case, a full configuration of all configuration frames is required to repair the upsets. If the actual locations of the upsets are needed for logging or 3 The additional configuration bits come from scrubbing block 2 and block 3 frames. SEU data collection, a full readback is needed before the full configuration. Full readback and full configuration requires almost two seconds of configuraiton time and is the most costly repair operation performed by this scrubbing approach. Fortunately, this error condition occurs very infrequently and thus does not contribute considerably to the average scrubbing time.
B. Radiation Testing
This hybrid scrubber was used in a neutron radiation test at the Los Alamos Neutron Science Center (LANSCE) [15] . The test was performed in the Weapons Neutron Research (WNR) facility in a beam which has a wide energy neutron energy spectrum from 0.1 MeV to 600 MeV using the Target Flight Path 30L (ICE House). The hybrid scrubber was implemented on a Zynq System-on-Chip (SoC) XC7Z020 using the PCAP interface to perform configuration scrubbing. Table II summarizes the results of the hybrid scrubber's execution during the test. This table indicates the number of upsets observed in upset frames and the number of occurences of each upset type. As expected, the majority of upsets (89%) were single-bit upsets within a single frame and were corrected quickly by the inner scrubber. Many of the single frame upsets reported in Table II were actually two-bit upsets caused by the same event but interleaved between different frames [16] .
Although most upsets were single frame upsets, there were a large number of multi-bit upsets (11%). All of these upsets required repair from the outer scrubber using the logic summarized in Figure 4 . All even-bit errors (377) were identified by querying the ECCERRORSINGLE signal as described in Section V-A. The odd multi-bit errors (138) were repaired using the technique described in Section V-B. No odd multi-bit "out-of-bounds" or "masked bit" upsets were observed suggesting that all multi-bit errors were "in-bounds". Although these rare error conditions were not observed in this radiation test, they have been observed in fault injection experiments and other radiation tests.
Unlike many scrubber approaches that continuously perform a full readback, this hybrid scrubber performed a full device readback a relatively small number of times (only 69 times during the 45 hour test). This allowed the scrubber to operate with far fewer memory reads and configuration port accesses. In this experiment, no global CRC full reconfigurations were required.
VII. CONCLUSION
This paper presents a hybrid scrubbing architecture for 7-Series FPGAs. This architecture harnesses the performance advantages offered by the Readback CRC mechanism while simultaneously compensating for its limitations with an external readback scrubber to achieve the desirable combination of high-performance and robustness. Explanations of unique upset types that this hybrid scrubbing architecture handles were also given. This work presented the results of testing the hybrid scrubber in a radiation beam which validated its effectiveness. This hybrid scrubber will be used in several scheduled satellite missions for scrubbing the FPGA logic within the Zynq programmable SoC. Future work will include porting this hybrid scrubbing architecture to the recently released UltraScale and UltraScale+ FPGA families.
