Abstract-
I. INTRODUCTION
NE of the features of the Xilinx Virtex-5 FX series of Field Programmable Gate Array devices (FPGAs), is the availability of multiple Multi Giga-bit Transceivers (MGT) in the device array. MGTs are embedded SerDes (Serialize/De-serialize) devices, providing the FPGA designer with a built-in solution for applications that require gigabit band-widths to get data on and off the FPGA. The capabilities of MGTs and of SerDes devices in general, have defined the infra-structure interface data band-width requirements of terrestrial electronic systems, particularly for communications, and such capabilities are also desired in space-bound applications, whether scientific, commercial or military.
The space-grade Virtex-5QV MGTs are thus being studied in an effort to provide FPGA designers with the appropriate information needed to make use of the MGTs with confidence in the harshness of the space environment. Understanding the functional response of the MGT elements as they are subjected to the Single Event Effects (SEE) is fundamental in the development of efficient, space-grade MGT systems, which are required to operate reliably and predictably. Evidence of the space community's interest is Manuscript the different characterization efforts that have been undertaken and disclosed during the last half-dozen years, providing varied results [1] - [4] .
The goal of the work described here was to irradiate MGTs with accelerator heavy ions and identify the different upset manifestations observed, determine their impact, and if any stimulus is required to recover complete functionality. This investigative effort was carried out with the intent of generating test data products that could be easily transposed to the perspective of an MGT-system designer. In this text, we first describe the custom instrumentation developed for these tests, report on and briefly discuss the results of the characterization measurements, and illustrate a simple analysis example, using the data obtained, on the effective bit-error rate (BER) incurred by a typical MGT channel operating in a geosynchronous (GEO) orbit.
II. INSTRUMENTATION
Based on the nature of the MGT transmit and receive mechanisms, two distinct basic types of events can be expected; events with an almost instantaneous effect and subsequent self-recovery of the MGT channel; and events that result in the MGT function being upset in a way which requires some sort of user or protocol activity to fully recover the communication channel. All observed SEEs were categorized into these two types; the first type is referred to as a Bit-Error (BE) Event, and the second type as a Loss-ofLink (LOL) Event.
A test system was developed that could capture both BE and LOL events on multiple MGT test channels. The system was devised so that data describing the characteristics of each LOL event observed, like the event's time duration, start time, and expected and erroneous data words, could be assembled and collected. The system was also capable of counting the total number of Bits-in-Error (BiE), of BE events in parallel with the LOL event detection and capture scheme. Test hardware was assembled with two FPGAs; a space-grade prototype Virtex-5 FX130, and a commercialgrade Virtex-5 FX130T FPGA. The Virtex-5 architecture provides MGTs as full-duplex instances arranged in one column of the device array; each instance is referred to as a Giga-bit Transceivers (GTX) Tile, with two Transmit (TX) and Receive (RX) MGT Channels or Lanes per tile. The device to be irradiated, the space-grade prototype, was To isolate the SEE susceptibility of the MGT's TX and RX functions, half of the MGT Tiles in the column of the DUT FPGA were hidden behind an ion mask, while the other half were exposed, as shown in Fig. 1 . Then, by interleaving the parallel paths between an exposed and a hidden MGT tile, separation of the effects on RX and TX lane pairs into individual MGT test channels was achieved. This is shown in Fig. 2 , which illustrates the MGT channel configuration of four of the test channels and the unique TX or RX element exposed for each of them.
A custom, simple, synchronization and Bit Error Rate Test (BERT) protocol was developed specifically for these tests, based on previous investigations [2] . The protocol was independent for each test channel, alternating a PseudoRandom Bit Sequence (PRBS) data pattern with synchronization characters. Upon detection of a LOL event in a test channel, the protocol activated a series of recovery stimulus signals, recording the recovery stimulus that fully recovered the channel. The recovery stimulus signals exercised were the collection of Lane-level resets made available in the MGT primitive by Xilinx [5] . The recovery timeline upon LOL detection is shown in Fig. 3 . If none of the stimuli applied was able to recover the channel, the irradiation beam was halted and more intrusive stimuli were applied manually. This stimuli included performing tile-level resets and power-downs (affecting all the channels of the tile), scrubbing the tile's Dynamic Reconfiguration Port (DRP), resetting the Digital Clock Manager (DCM), and ultimately, initiating a complete device re-configuration.
A Service FPGA was developed with the capability of generating and verifying 12 test channels of BERT MGT data at 3.125Gbps. The test system was setup to perform continuous scrubbing of the test FPGA design configuration image at a rate of several times per second. 
III. SINGLE EVENT EFFECT (SEE) CHARACTERIZATION
Heavy ion irradiation took place at the Texas A&M University Cyclotron in the summer of 2010. All MGT SEE manifestations were categorized as Bit Error Events (BE) or Loss-of-link Events (LOL) for each TX and RX test channel. Cross-section plots for all the events are shown in the Appendix; the Weibull fit parameters are listed in Table  I . 
A. Bit Error (BE) Events
Bit-Errors (BE), are quasi-instantaneous, recovering just as quickly as they occur and leaving behind one or more bits in error, with no other impact to the MGT functionality. Cross-Section 1 shows the TX and RX Bit-Error events measured per channel, while Fig. 4 shows how the average BiE per event increases with the incident particle's deposited energy, as well as how the TX events manifest more BiE per event than RX events.
To differentiate between BE and LOL events, only continuous errors in the test protocol's PRBS data pattern that exceeded a pre-specified amount of time were considered LOL events. Different LOL time metrics were applied, ranging from 100ns to 13.1µs, but no significant LOL and BE count variations were noted, indicating that the duration of BE events are much shorter than the smallest metric used, 100ns. This is corroborated by the average BiE observed for each LET, which were never more than two bytes in all cases, for either RX or TX channels.
B. Loss-of-Link Events (LOL)
LOL events were classified by evaluating the recovery mechanism required to recover the affected channel's functionality. These are made up of events that required Resynchronization to recover (Re-Sync LOL), events that required a channel Re-Initialization (Re-Init LOL), and events that only recovered after more intrusive or cumbersome mechanisms were applied, such as tile-level resets, and scrubbing of the Dynamic Reconfiguration Port (DRP) image.
1) Recovery by Channel Re-Synchronization
The more benign LOL events are the ones that are recovered by simply re-synchronizing the communication channel. This is typically done periodically in serial communication protocols by sending Synchronization (Sync) characters in between every user-data packet transmitted. The sync characters are defined by the 8-bit/10-bit encoding standard which the MGTs make use of. In the test, upon LOL detection, the test channel was given 512ms to recover functionality while only transmitting Sync characters, after which recovery stimuli would be applied. Re-Sync events were observed and captured for both, TX and RX channels, both demonstrating susceptibility, as shown in Cross-Sections 2 and 3.
Analyzing each Re-Sync event's duration, the common duration signatures of the events were obtained. Duration centers peaked in between 1 and 2us, at 170µs, and at 200μs; many events did recover within the first 60μs, but as shown in Fig. 5 , they peaked in repeatability at around 1.26μs. Each of the common duration centers more than likely represents a particular SEE source mechanism in the MGT circuitry. Certain TX mechanisms have a signature that is not apparent for RX channels, and vice-versa, as is Fig. 6 , where TX channels have a failure center at 170µs, while RX channels have it at 200µs. When the incident particle's deposited energy was increased, duration centers began appearing at around 60 and at 80μs, as is shown in the 3-dimensional plot in Fig. 7 . For RX channels, some Re-Sync events were observed that were of random duration (no repetition), and lasted between hundreds of milliseconds and 8 seconds (CrossSection 3). These unique Re-Sync LOL events reflect the susceptibility of the more particular elements of the MGT, such as the RX Clock Data Recovery (CDR) engine, which have been known to take long times to recover from SEEs [6] [7] . Clock-Data Recovery capabilities are needed by the MGT to recover a sampling clock from the 0-to-1 transitions of the incoming signal. The sampled CDR clock is then used to execute the RX serial-to-parallel conversion. LET; illustrating the different cross-sections of each duration center.
2) Recovery by Channel Re-Initialization
These types of events, referenced to as Re-Init LOL events in Cross-Sections 2 and 3, require a lane or channel level reset to recover. The test protocol activated the sequence shown in Fig. 3 , which was chosen since with each sub-sequent reset step, a higher-level of circuitry within the MGT tile and link is reset. [5] Very few lane Re-Init events were observed for the TX channels, as can be observed in Cross-Section 2, where less than 0.2% of all TX LOL events required a re-initializationlike action. RX channels on the other hand demonstrated a higher likelihood of requiring Re-Init activity, as is seen in Cross-Section 3, where approximately 4% of LOL events required Re-Init. The detailed cross-sections of what particular reset was required to recover the RX channel are shown in Cross-Section 4.
3) Recovery by MGT Tile-Level Reset
An MGT Tile Reset is a much more intrusive action than the lane-level resets. Tile-level resets affect the entire tile, including the common Phase-Lock Loop (PLL) and the two transceivers, disabling all functionality until recovered. A MGT Tile Reset, which is referred to as "GTX_RESET" in the Xilinx primitive, places all circuitry of the MGT in the power-up state, taking milliseconds to complete. [5] Throughout our testing, RX channels experienced several events requiring GTX Tile resets, but only one event was observed to require so for TX channels (less than 0.6% of all GTX Reset events). Cross-Section 5 shows the measured cross-section of the Tile (GTX) Reset events.
4) Recovery by DRP Image Scrub
The Virtex-5 architecture equips each GTX Tile with a Dynamic Reconfiguring Port, which as the name implies, provides a dynamic reconfiguration ability for certain features of the MGT. Transmission clock rates, reference clock sources, protocol control and alignment characters, including channel bonding and clock correction settings, can all be changed dynamically, without altering the FPGAs configuration image. 
-10
In the test configuration utilized, the MGT tile's DRP bits could be scrubbed by enabling the Global Look-Up Table ( GLUT) feature of the device configuration scrub sequence, executed by modifying one of the ongoing scrub cycles. Several events of this type were observed for RX channels, and no events were observed for TX channels. The DRP Scrub Event cross-section measured is shown in Cross-section 5, where they demonstrate a significant higher LET threshold than any of the other events. This is expected of the bits that make up the DRP configuration image, since they are related to the hardened configuration cells of the Virtex-5QV FPGA.
C. Loss of Link Manifestations on Multiple Channels
Data was collected for each LOL event that indicated the precise start time and duration of each; using a sample period of 10 ns. The final test system provided 12 MGT test channels arranged over 4 tiles, and for 2 of those tiles, both transceiver lanes (Lane 0 TX and RX, and Lane 1 TX and RX) were fully monitored. By noticing coincident start times for a LOL on two, three or four of the different channels of the same tile, events that affected multiple channels can be extracted. It is statistically very improbable that two different particle strikes would produce this signature, since the overall event count rate of the accelerator testing is relatively low (2 or 3 events per second in the worst of cases).
A cross-section per tile was obtained from analyzing the data, demonstrating that LOL manifestations on a single channel are just as likely as a LOL that affect multiple channels, particularly true for TX channels. Cross-Section 6 shows the cross-section of these events, which is plotted against the LOL cross-sections for events unique to single TX and RX channels. Also observed and plotted in CrossSection 6 are the LOL manifestations observed across multiple tiles; these were very uncommon and it should be noted that the cross-section shown is per tile.
It can be deduced then that multiple channel events in the same tile are more than likely due to upsets in the common circuitry shared by both of the MGT tile's lanes. A very important aspect of the Virtex-5 GTX Tile is the common PLL across all the channels, which provides the serial rate clock to the MGTs. The test design instantiated common circuitry across tiles as well, in the form of a shared reference clock, which is input and buffered by one MGT tile and distributed to the other MGT tiles implemented. 
D. Characterization Summary
The list of all SEE manifestations that were characterized, and their Weibull curve-fit parameters, are shown in Table I . All of the critical functions of the MGT tiles have been subjected to accelerator ions, including the SerDes elements, the shared PLL, the data encoders and decoders, the internal parallel data buffers and data pipeline of both the RX and TX elements, the RX CDR mechanisms and so on. The effects observed and measured in this testing encompass all of the possible manifestations generated by the mechanisms and sub-circuits that drive the MGT tile's functionality.
The stimulus necessary to recover from all the SEEs observed can be actuated when performing a protocoldriven channel re-initialization sequence. Protocols can actuate the lane-level or tile-level resets, as well as the periodic transmission of synchronization characters needed to keep the channel synchronized. GTX Tile DRP scrub capability is probably not a part of a protocol's suite of activities, but can be made to initiate one if necessary. Thus it is likely that the re-initialization sequences of standard off-the-shelf protocols can recover all events except DRP scrub LOL events. And if a certain recovery mechanism is not actuated by the protocol, it can be modified to do so by customizing the core that implements it.
IV. SYSTEM IMPACT: BIT-ERROR RATE (BER) ANALYSIS EXAMPLE
To estimate the BER incurred by a MGT communication system as a result of SEEs, we must first understand how each one of the events can contribute to the BER, and then sum to obtain the total effective BER from all the different events. For each type of SEE there is an amount of time associated with the event: the total time to detect and, if applicable, to correct the event. During this total duration, it has to be assumed that every bit transmitted by the channel could be in error, thus the number of corrupted bits caused by each event is the product of the effective data rate of the system and the total down time of the event.
Upset probability rates were predicted for each manifestation using the on-orbit event prediction tool, CRÈME-MC [8] . GEO orbit parameters are chosen since spacecraft at that orbit receive no protection from the magnetosphere. It's possible to now obtain a number of effective bit-errors per day for each event elaborating on the BER expression.
no. bits in error total no. bits
Where P SEU is the daily SEE upset rate probability for each event type; N susc is the number of susceptible elements to the event, in this case its channels or tiles; N err is the total bits in error as an outcome of the SEE; N bits-day is the effective maximum number of bits transmitted in one day by our case-study MGT link; and K is the susceptibility duty-cycle factor of each event. The susceptibility dutycycle is essentially the probability of having the event coinciding at the same time as when the pertinent elements are susceptible. The case study used in the BER analysis consists of two FPGAs communicating via a pair of MGT RX/TX links in a GEO orbit. The BER of the MGTs used in the communication link is only one of the contributors to the total BER of this communication interface. For the total BER of the interface, we must include the contributions from the communication core, and any upstream or downstream design partitions of the FPGA that provide and collect the data transmitted. The operational assumptions used for the case study variables are listed below:
• ; this estimate is conservative and could be driven down by performing a more sophisticated analysis and insuring the event recovery by the protocol is expedited as quickly as possible. The SEE with the most impact to the effective BER is the long, random duration Re-Sync event that is suffered by RX channels. For this analysis, a worst-case time duration of 2.64 seconds was used, which is the average duration of all the events measured during testing. Hardening the particular upset mechanism responsible for this event will have the most impact in increasing the SEE tolerance of the Virtex-5 MGTs, by reducing the BER by more than an order of magnitude.
V. CONCLUSIONS
The SEE susceptibility of the Virtex-5QV FPGA MGT devices has been classified and quantified through heavy ion accelerator testing in a way which can assist designers in understanding the impact of energetic particles on a MGTbased communication system. The effective BER impact is demonstrated through analysis to be not very significant, assuming the communication protocol's physical layer reinitialization and re-synchronization procedures will actuate the appropriate recovery mechanisms when required. Further mitigation is possible through commonly used embedded error correction in the user data, or by making use of a protocol with "guaranteed delivery". SEEs observed on multiple-channels must be taken into account when evaluating the SEE impact on channel-bonded and linkredundant systems.
VI. APPENDIX: SEE CROSS-SECTIONS
Cross-Section 1. Bit Error Events. A special mention to the contributions made by XRTC members Boeing Space Systems, Los Alamos National Laboratories and Sandia National Laboratories, whose insights have been helpful, even instrumental, and also to SEAKR Engineering Inc., for contributing resources to assist in the production of ion masks and the collection of data. Single RX Chan.
