Abstract--This study examined high-current events observed in Xilinx Field-Programmable Gate Arrays irradiated with heavy ions. A probable cause and proposed changes to the test methodology to prevent these high-current events is described.
activating over-current protection (OCP) circuitry and making the problem difficult to diagnose.
In December 2015, an investigation into these high-current events was performed at Lawrence Berkeley Laboratory's 88-inch cyclotron using the 16 MeV cocktail. This paper describes the test approach and setup, proposes a configuration Single-Event Functional Interrupt (SEFI) mode that causes the current event, and outlines an approach to prevent these high-current events on VCCINT.
II. TEST DESCRIPTION

A. FPGA Device Under Test
A 20 nm Xilinx Kintex UltraScale FPGA was selected as the candidate device-under-test (DUT) for this experiment, since the test conditions and methodology required to recreate high-current events were readily available from [2] . The specific part utilized for the DUT was the XCKU040-2FBVA1156E, which is a mid-range, extended temperaturegrade, flip-chip Kintex UltraScale device [6] attached to a commercial Xilinx KCU105 evaluation board.
UltraScale devices operate with a nominal 0.95 V main core voltage (VCCINT), an auxiliary voltage of 1.8 V (VCCAUX), and programmable I/O pins at voltages from 1.2 V up to 3.3 V [3] . The configuration memory in these parts is comprised of an array of highly robust CMOS configuration latches that behave similarly to a static randomaccess memory (SRAM) [4] . This configuration memory controls the behavior of the various internal components and the programmable interconnect.
To prepare the DUTs for irradiation, the package lid was removed from the device and the silicon substrate was thinned to 74 μm. A picture of the DUT fixture is below in Fig. 1 . A measurement of the DUT thickness for one of the boards is shown below in Fig. 2 .
An Analysis of High-Current Events Observed on Xilinx 7-Series and UltraScale FieldProgrammable Gate Arrays As mentioned in the overview, this experiment was performed to investigate the cause of a high-current event on the core voltage rail (VCCINT) of Xilinx Virtex-7 and UltraScale FPGA families [3] . These events are classified by a sudden and continual rise in operating current (over 2A) occurring at a rate of over 1 amp per second (and usually much faster, often occurring over the course of several milliseconds). As an example, the operation of a static test design that draws 0.75 A on VCCINT nominally showed a jump to over 10 A in under 100 ms when this event was first observed.
During the initial studies that observed these events, the power supplies were often not able to deliver enough current once the event occurred, causing significant voltage droop and interfering with device operation. Thus, it was difficult to explore if these events were caused by latch-up or some other phenomenon. To mitigate this, the KCU105-based DUT board was modified to decouple the VCCINT and VCCAUX rails from the on-board regulators. These supplies were then fed externally from a power supply that was capable of supplying 20 A of current, which helped to avoid exceeding the power supply's capacity and tripping OCP when the event occurred. The current was logged at about 13 Hz; a fast logging rate was necessary to observe the rate of current increase during these events and allowed exploration of the cause when they occurred.
The remaining power rails on the board were powered by the KCU105 on-board regulators, which utilize a single 12V input. The temperature of the part was monitored both through the embedded Xilinx Analog-to-Digital Converter (XADC) and by a thermocouple that was placed near the device substrate.
B. Configuration Monitor
A JTAG-based configuration monitor (JCM) developed by Brigham Young University based on the Xilinx Zynq programmable SOC device was the configuration monitor for this experiment. This board connected to the KCU105 through JTAG and was responsible for performing initial configuration, periodic readback of the configuration memory, and configuration scrubbing to repair upsets in the FPGA configuration memory state. 
C. Particle Beam Properties
The Kintex UltraScale DUTs were irradiated in vacuum at the Lawrence Berkeley Laboratory (LBL) 88-inch Cyclotron. All irradiation was performed using the 16 MeV krypton at normal incidence, providing a LET of 40.15 MeV-cm 2 /mg. A picture of the test setup is below in Fig. 4 . 
D. Test Procedure
The FPGA was configured with a bitfile that consisted of a static, unclocked design with multiple flip-flop-based shift registers and instantiating all of the BlockRAM in the device. Prior to irradiating the device, the configuration memory scrubber was started on the JCM, which checked the integrity of the device configuration memory and re-sent the bitfile when an upset configuration memory cell was detected.
UltraScale testing was conducted at nominal voltage biases. The device was tested in vacuum, which also caused the temperature of the device to rise above normal ambient levels. Efforts were made to keep the device temperature below 100 degrees Celsius, and a test run was not started until the device cooled below 90 degrees Celsius.
Readback of the configuration memory state occurred frequently (on average, every 15 seconds) during irradiation. When a high-current event was observed, the beam was immediately stopped in order to more thoroughly investigate the event.
A general sequence of events that comprised each test run is as follows:
1. Start power logging on the power supplies. 2. Power on the DUT. 3. Command the JCM to configure the DUT with the bitfile containing the static FF/BRAM test design. 4. Command the JCM to begin readback and scrubbing of the FPGA configuration memory. 5. Begin irradiation of the DUT with 16 MeV krypton. 6. Stop irradiation when the power supply current begins ramping (indicating the high current event is occurring), or when upset configuration cells cause too much contention in the device (observed during high flux runs where the scrubber is not fast enough to correct upsets caused by the beam). For the purposes of this test, the temperature threshold at which the run would be terminated early was 110 degrees Celsius. 7. Stop configuration readback and scrubbing. 8. Stop power logging on the power supplies.
III. RESULTS
A. Test Results
Multiple captures of the high-current event were obtained over the course of the test on the Kintex UltraScale DUT. There were a number of interesting observations in the initial investigation of this phenomenon:
• The current draw observed for each high-current event was not always the same.
• Smaller current events (that did not trip OCP) could be cleared through the hardware master reset (PROG). Note that this contradicts the original findings in [3] ; however, in [3] , the power supply OCP had already tripped, which likely activated brown-out circuitry that prevented the event from responding to PROG in that investigation.
• The rate at which the current rise would occur was directly proportional to the rate at which the device was being scrubbed. In other words, if the JCM scrubbing speed was slowed down by a factor of 5, the corresponding current rise during a high-current event slowed by a factor of 5 as well. In particular, it was this last observation that seemed to indicate that this event was not actually a phenomenon directly caused by single-event effects (like single-event latch-up).
In order to more closely investigate the scrubber's role in the high-current event, the scrubbing speed was lowered significantly until the current slope for these events was observable at roughly 10 A per second. Upon seeing the start of a high-current event, the scrubber was stopped as quickly as possible. As soon as the scrubber was stopped, the current rise also stopped as well. Based on these observations, the hypothesis for the cause of the current events seen on 7-series and UltraScale is an SEU or SET in the configuration controller that causes incoming scrubber data to be written to the wrong location in the device.
There are at least two reasons this may be occurring. The first reason could be a SEU occurring in the Frame Address Register (FAR), which holds the current memory address that is being written into when configuration scrubbing occurs. When scrubbing multiple frames, the FAR is automatically incremented as configuration data is written into the device. An upset in one of the FAR registers could cause configuration data to be written to the wrong location in memory.
The second reason may be due to a SET on the clock line for the boundary-scan cells that feed the configuration SRAM cells. The SET would cause the data to jump ahead of the location where it should be written into, essentially also causing the configuration data to be written into the wrong location.
Regardless of the cause, the writing of configuration memory into the incorrect location causes significant contention current as drivers are connected to each other within the chip, and is responsible for the quick rise in current. This is akin to a "Scrub SEFI" seen on previous generation Xilinx families [7] . To further confirm this theory, a few runs were performed without scrubbing, and this rapid current rise was not seen (although, it should be mentioned that the test time for these runs was limited, as SEU-induced configuration upsets caused a great deal of contention current as well over time).
An annotated graph of current over time for one of the runs with a high-current event is shown above in Fig. 5 when scrubbing is activated. Note that the scrubber was slowed down considerably below its maximum speed (by a factor of approximately 5) in order to slow down the rate of current increase when an event occurred. The beam run can be broken down into the following events: 0. The device is powered on but not configured. This is the lowest current state of the test. For this particular run, due to the device operating at high temperature (around 90 degrees Celsius), the current in this state was nominally 2.5 A. 1. The device was configured with a bitfile containing the static test design. The device is now operational and the DONE pin is high. The current rose to 3 A once the device started up, and increases as the design configuration uses more power than the un-configured device which increases temperature and thus slightly increases current over time as well. 2. Irradiation with 16 MeV krypton begins. 3. As configuration upsets occur, internal drivers are being connected together which causes contention and a significant rise in current. The scrubber would normally be able to correct these errors in the configuration memory, but as we slowed down the scrubber significantly for this one run, the scrubber cannot keep up with the rate of upsets. 4. Either the FAR is hit during a scrub cycle or a SET occurs on the clock line feeding configuration data; the net result is that invalid configuration data is now being written into the device (a.k.a. "Scrub SEFI"). This writing of incorrect data causes a massive amount of contention resulting in a high core current draw. In this case, the current rose over 6 A over approximately 1.5 seconds. 5. The test operator observed the high current rise and manually halted the scrubber to prevent it from writing any further bad configuration data. 6. Irradiation was stopped to allow for investigation. 7. Though there are no new configuration upsets occurring in this section of the current trace, the internal contention which is present in the device causes the temperature of the device to rise significantly (now well over 100 degrees Celsius) and thus the core current continues to rise correspondingly. 8. The scrubber was reset and restarted in an attempt to fix the bad configuration frames that it caused earlier in step 4, and to fix the accumulated upsets that occurred after the scrubber was stopped in steps 5 & 6. 9. The scrub is complete, and all configuration data is now back to normal. 10. The temperature of the device starts dropping as there is no longer any contention within the device. Current draw is still at 6.5 A, but it would be expected that if the device were allowed to cool for long enough, it would match the current before irradiation began at step 2. 11. As an experimental measure, the PROG (reset) pin was asserted to clear the device configuration and halt device operation. The current drop corresponds with the fact that the device is being held in global reset by the configuration state machine, awaiting a new bitfile. The current is at about 4.3 A but again would be expected to drop to the levels observed at step 0 if allowed to cool down to the same temperature as when the experiment started.
B. Mitigating the Scrub SEFI
Since the high-current contention is created by writing multiple frames into the wrong memory location, the effects of the event can be mostly mitigated by a more careful approach to scrubbing the design in the configuration memory. As mentioned before in Section III(A), this event may be caused by either a SEU in the FAR register, or by a SET on the clock connected to the boundary-scan registers that feed the configuration SRAM cells. Unfortunately, there is no way to mitigate either of these effects directly. However, the scrubbing methodology can limit the extent of the damage caused by scrubbing in data to the wrong location.
The configuration memory space in Xilinx FPGAs is divided into configuration words, or "frames," which comprise the smallest block of configuration memory that can be written in one operation. These words are 3,232 bits in length for 7-series devices and 3,936 bits in length for the UltraScale family.
When scrubbing the configuration memory, the easiest (albeit not recommended) method is to set the FAR to the base address of the configuration memory space, then to write in the entire bitfile. The FAR will automatically increment after each frame and the end result will be that all configuration errors in the entire device will be corrected. However, the long scrub time makes the device susceptible to a SEU in the FAR or a SET on the configuration data clock, which would induce the scrub SEFI. In fact, all known instances of rapid high-current events observed in these devices so far has been caused by scrubbers implemented with this scrub methodology.
The solution and best way to avoid the scrub SEFI is to scrub frames individually, rather than perform full scrubs that would blindly dump all of the bitfile contents back into the device in a single operation. Thus, when a frame with an upset is detected, the corresponding frame from the bitfile (and only that frame) should be loaded into the part; following the single-frame scrub, a readback of that frame can be performed to verify the data was written back into the correct location and that the frame scrub has corrected the errors in that particular frame [7] .
The proposed methodology is as follows:
1. The FAR should be written with the address of the frame that contains an SEU in its configuration data. 2. The FAR should be read to ensure that the address matches the expected address that was written. 3. If the FAR matches, then the configuration frame should be written into the device.
4. The FAR should be read out one last time, to ensure that the FAR was not affected by SEU during the write. 5. A readback of the frame should be performed to verify the contents of the frame were written correctly. 6. Note that if a SEU occurs in the FAR, the original frame that was going to be corrected and the frame at the address pointed to by the upset FAR may both need to be corrected.
It should be noted that this effect is limited to scrubbing only. When initial configuration of the device occurs, CRCs for each configuration word or "frame" are calculated and compared to the CRC that is embedded in the configuration bitstream. A mismatch of CRCs causes an error and the configuration state machine will prevent the device from starting up. However, this CRC is not checked when the configuration memory is written during device operation (such as when scrubbing to correct SEUs).
IV. CONCLUSION
The high-current events previously reported on the core (VCCINT) voltage rails on 7-series and UltraScale FPGA parts is hypothesized to be caused by a scrubbing methodology where SEU or SET affects the configuration engine. This paper proposes that a frame-based scrubbing methodology will prevent these high-current events from occurring altogether by scrubbing frames one at a time, rather than scrubbing the entire device in a single operation. As no other high-current modes exist in either 7-series or UltraScale, the validation of this claim will be a solid step forward in enabling the use of these parts for future space missions.
