Abstract-Field Programmable Gate Arrays (FPGAs) are successful platforms for the implementation of communication systems, due to their potential high throughput and low development costs. In such systems, however, investigating the system's resilience to soft errors is crucial in many scenarios, such as when facing stringent dependability constraints or when operating in radiation-harsh environments. Therefore, in this paper we present a fault injection platform that targets specifically FPGA-based communication systems. The platform operates partially on the device and partially on a host computer, aiming at both flexibility and high performance. A ReedSolomon decoder is used as a case study to validate the platform, showing its ability to measure metrics that are relevant both for critical systems (e.g., fault sensitivity) as well as for communication systems in general (e.g., bit error rate).
I. INTRODUCTION
Field Programmable Gate Arrays (FPGAs) are reconfigurable devices widely used for the implementation of communication systems, since they provide the advantages of reconfigurability coupled with high throughput processing of data streams.
In aerospace applications, such as communication satellites, failures occur frequently due to the increased flow of ionizing particles that can induce single event upsets (SEUs). The configuration memory of SRAM-based FPGAs is especially sensitive to SEUs [1] , making the increased size and density of this memory in the newer devices a major concern. The occurrence of SEUs in FPGAs can flip one or more bits of the configuration memory that may modify the device functionality, generating incorrect results in an unpredictable way [2] . Therefore, the use of FPGAs for data communication applications that require high reliability must be carefully evaluated by studying the impact of configuration faults on the system reliability and other relevant metrics for such systems, so as to develop effective and low cost fault tolerance mechanisms.
Fault injection in the FPGA configuration memory by flipping the configuration bits is a widely used method to simulate radiation-induced SEUs and to evaluate the resilience of computing systems, as well as to validate fault tolerance mechanisms [3] - [7] . This approach offers important advantages compared to actual ground radiation testing, such as faster fault injection times and more controllability. Moreover, appropriate configuration fault injection has been validated compared to ground radiation [3] . Therefore, fault injection is a complementary approach to radiation testing, which allows the exploration of alternatives to improve the resilience of computing systems in the early project stages.
In this work, we propose a novel fault injection platform to evaluate the impact of configuration faults on the communication quality metrics of communication systems, such as the bit error rate (BER) and the frame error rate (FER). As a case study, we evaluate a Reed-Solomon decoder, which is a family of channel codes suitable for space applications [8] .
This work is structured as follows. In section II, we present and discuss briefly some works regarding fault injection. In section III, the platform structure is presented in details. In section IV, we present experimental results about the platform performance and the resources needed. Also, this section presents the validation of the platform. In section V, we present the conclusions and indicate future works.
II. RELATED WORKS Several works proposed fault injection platforms to develop and evaluate fault tolerance techniques in order to protect critical systems of the occurrence of SEUs in the configuration memory of SRAM-based FPGAs.
The fault injector presented in [4] uses two FPGAs boards. The circuit under test (CUT) is implemented in one board, while another board performs the fault injection and applies the input vectors generated by software. This platform allows injecting each fault in 50μs.
The platforms presented in [5] and [6] are designed with a single FPGA and offer fault injection times around 10μs, by using the internal configuration access port (ICAP) available in devices from Xilinx. The platform proposed in [5] uses a hardwired Power PC microprocessor to perform the fault injection through the ICAP, while the one proposed in [6] is designed with the most common elements that comprises an FPGA (LUTs, flip-flops, multiplexers and memory blocks), which is its main advantage. The performance of those platforms is limited, however, by the communication (RS-232) overhead between the FPGA and the host computer that receives the results.
Compared to previous works, the fault injection platform herein proposed is optimized for communication systems in two main aspects: 1) it operates partially on an FPGA device and partially on a host computer, providing high flexibility, e.g., to easily generate various channel models and/or to measure different metrics that may be relevant for any particular communication system; 2) it attains high performance by using the ICAP to inject faults and a PCIe interface for communication between host and device. In the following section, we discuss the implementation of the platform and the partitioning of tasks between FPGA device and host computer, which was done to maximize injection throughput while maintaining ease of programmability.
III. FAULT INJECTION PLATFORM
The proposed fault injection platform consists of a hardware component on an FPGA and a software application on a host computer. Fig. 1 shows a simplified form of the platform structure. In the following sections we describe its main components with more details.
A. PCIe I/O Controller
The PCIe I/O controller is responsible for the communication between the FPGA and the software application that runs on the host computer. The communication solution adopted in this work was the Xillybus system [10] , which is a flexible and ease to implement PCIe solution for FPGAs. The Xillybus core interface with the PCIe bus is made through an IP Core provided by the Xilinx Core Generator [11] .
On the FPGA side, the communication is done with two asynchronous streams through I/O operations on FIFO memories [12] . An 8-bit stream allows the host computer to send commands to the system controller and also for this controller to return the current state of the system. A 32-bit stream allows the host computer to send the input vectors to the CUT I/O controller and also for this controller to return the results generated by the CUT. A high data flow is reached by the extensive use of DMA (Direct Memory Access) buffers. Fig. 2 shows details of the PCIe I/O controller.
B. System Controller
The system controller requests the fault injection and the fault removal to the fault injector (SEU injector), and controls the execution of the CUT through the signals start and done. While the CUT is running, the system controller remains in wait state until the CUT I/O controller reports that execution already finished. Then the system controller requests the fault removal to restore the correct state of the configuration memory. This cycle is repeated until the desired amount of faults to be injected is reached.
The SEU injector is adapted from [6] . The ICAP allows a controlled access to each configuration bit related to any region of the device. With this possibility, an area under test (AUT) is defined, where the CUT should be positioned with placement constraints. Therefore, the fault injector can flip only the bits responsible for the configuration of the AUT, which prevents fault injection in the injector itself or in other components of the system that are not related with the CUT, which would likely interrupt the injection campaign or generate invalid results.
C. CUT I/O Controller
The CUT I/O controller receives the input vectors from the host computer and converts the raw stream from the PCIe controller to the specific width expected by the CUT. Also, it sends the CUT output results to the host computer. Therefore, the user of the platform can determine the information that should be transmitted to the software, through the inclusion of additional hardware in the CUT I/O controller.
D. Software Application
The software application communicates with the FPGA through I/O operations on device files in order to perform some basic functions:
• Control the hardware component through sending commands to the system I/O controller and receiving the system state.
• Generate test vectors and send them to the CUT.
• Receive the results from the CUT and calculate any relevant metric for data communication systems that allows evaluate the impact of the injected faults on the system reliability.
Additionally, the software defines a thread for each data stream, to enhance the throughput of the host computer. Fig. 3 shows the proposed strategy. 
IV. RESULTS
In this section, we present the results obtained with the proposed fault injection platform. The hardware module was described in VHDL and synthesized for a Xilinx Virtex 5 FPGA, with the development platform XUPV5-LX110T [13] . The software application was developed in C++, and the communication with the FPGA was done through the PCIe 1.0 x1 bus available, which allows a theoretic transfer rate of up to 250MB/s. In the following subsections, we describe the resources occupied by the hardware module, the platform performance and its validation with some experiments done on a Reed-Solomon decoder. Table I shows the amount of resources used by the hardware components of the platform, as well as the proportional use of the available resources in the device. Note that the CUT I/O controller includes the Reed-Solomon decoder used as a case study in this work. The low device occupation by the platform components is an important result, since it saves resources for the CUT.
A. Resource Occupation and Performance
The system controller, that includes the SEU injector and the ICAP, runs at 50MHz. This operating speed allows a strict fault injection and correction time under 10μs. Therefore, the main concern regarding the performance is about the data throughput between the software application and the FPGA. A complex communication system may have millions of configuration bits, making an exhaustive fault injection campaign impracticable if the data throughput is not high enough. The software application generates the input vectors with a xorshift+ pseudo random number generator (PRNG) [14] and sends them through a 32-bit stream. The data throughput achieved was around 200MB/s, which was adequate for our experiments.
B. Platform Validation
In order to validate the platform proposed in this work, we have performed some experiments with a Reed-Solomon decoder. A detailed description of Reed-Solomon codes and examples can be found at [15] . As is typical for most channel codes, the Reed-Solomon decoder is substantially more complex than the encoder. Therefore, this work focuses on studying the decoder's reliability, which was implemented based on the traditional decoder architecture described in [16] , divided into three pipeline stages.
For our experiments, we implemented an RS(255, 239) encoder/decoder with 8-bit symbols, running at 125MHz. One symbol is received and/or flushed out per clock cycle, which means we have a data throughput of up to 125MB/s. Initially, the encoder was intended to be simulated in software, but due to a poor performance it was moved to the FPGA, but kept outside of the AUT. The placement of the decoder in the AUT was made separating its main modules, which allows evaluate how each module impacts the system reliability when their configuration bits are flipped. Fig. 4 shows this approach.
The software application generates the input vectors through a PRNG and calculates the BER of the system. For a statistically significant measure of the BER, we send approximately 10MB for each fault injected. We performed an exhaustive fault injection campaign, i.e., all the configuration bits of the AUT were flipped. The evaluation of the effects of configuration bit flips was done for two cases: simulating a perfect channel (no errors) and a noisy channel (errors inside of the code capacity). The number of corrupted symbols for the noisy channel scenario was defined by a PRNG. Table II shows the results.
With a perfect channel condition, as expected, the majority of the configuration bits do not affect the system operation. Regarding the critical bits, a small part introduces a BER around 0.5, meaning that the output has become random noise, decoupled from the input. Another part causes a time out condition by affecting the clock and control signals. The majority of the critical bits introduce a BER that varies with Fig. 4 . Placement of the Reed-Solomon decoder modules. With a noisy channel, the number of critical bits increases. Moreover, almost all of those new critical bits introduce a variable BER. As can be seen, the Berlekamp was the most affected component, becoming the most critical.
In our implementation, the Key module is the only one to do not have a signal to control the sequence of operations through the system modules. Therefore, it was expected that a failure in this module would not cause a time out.
V. CONCLUSION In this work we have presented a fault injection platform optimized for use with FPGA-based communication systems. The main goal was to provide a tool that allows evaluating the impact of the most critical FPGA faults (configuration memory faults) on communication metrics, such as FER and BER. The tool allows the partitioning of tasks between the FPGA device and a host computer, in order to maximize performance and simplify modifying and extending the platform for different CUTs.
Main future works include extending the platform to support other FPGA families, as well as using its injection results to guide the development of fault mitigation techniques for this class of applications. 
