Introduction
In the last decade, the new manufacturing technologies made the development of SRAM-based Field Programmable Gate Arrays (FPGAs) feasible. FPGAs became very popular thanks to their capability of implementing complex circuits in a short development time. They are used for different applications, such as signal processing, prototyping, and networking. However, an FPGA is vulnerable to radiation effects known as single event effects (SEE) [1] . Although single event upset (SEU), which is one of the SEE impacts, does not cause physical damage to the chip, it is a critical issue for an FPGA. There are two reasons for this: an SEU causes bit flipping in the memory elements, and an FPGA consists of a large number of SRAM cells. In an FPGA, both the combinational and sequential logic are controlled by several customizable SRAM cells. An SEU can seriously affect the performance of an FPGA.
SEUs are classified into "transient errors" and "permanent errors" [2] [3] . A transient error is an error that occurs on a flip-flop (FF) or a latch and therefore only affects the output for an instant. On the other hand, a permanent error is an error that occurs in configuration memory or storage memory, which may cause faulty operation of the FPGA as mentioned above. Permanent errors are more crucial because they can only be corrected at the next load of the configuration bit-stream. The SEU is particularly significant in space environments because of the presence of very high doses of radiation. However, we also consider the effects of an SEU at ground level. Device shrinking coupled with voltage scaling significantly reduces noise margins of transistors [4] . Current FPGAs are generally not considered reliable enough to be used in dependable systems such as automotive or infrastructure systems, etc. One possible solution to this problem is to use radiation-hardened FPGAs, but these devices are very expensive. Therefore, reliable implementation techniques that can mitigate and repair these SEU effects are desirable.
The SEU sensitivity for a configuration memory can be mitigated using a hardware redundancy
Related Research
Many studies on dependable reliability implementation techniques such as time redundancy, dual modular redundancy (DMR) and triple modular redundancy (TMR) have been conducted [1] . However, time redundancy cannot mitigate the permanent error effect, and DMR cannot identify an SEU affected module. Hence, full reconfiguration is required to eliminate a fault from them. This is undesirable for a system in which continuous operation is required, because the full reconfiguration process stops and initializes the system. Consequently, we focus on the TMR and partial reconfiguration (PR) schemes.
The basic concept of the TMR scheme is the robustness of a circuit against SEUs by designing three copies of the same circuit and building a majority voter on the outputs of triplicated modules. The TMR scheme can recover a transient error and mitigate a permanent error. These SEU effects must be eliminated from the configuration memory because multiple soft-errors cannot be mitigated by TMR scheme alone. To prevent the SEU accumulation, TMR is often coupled with reconfiguration techniques such as full reconfiguration, partial reconfiguration (PR), or scrubbing. Most of the previous studies have used TMR with scrubbing or dynamic partial reconfiguration (DPR). DPR helps in hiding the configuration time and is effective for a combinational circuit [6] [7] . Scrubbing is a technique in which the entire content of the configuration memory is reloaded periodically [8] [9] [10] . However, these techniques cannot be applied to a sequential circuit because of its state information [10] .
Pilotto [5] considered synchronization using a checkpoint state in the FSM. Once the reconfigured module reaches the checkpoint state, it is set on hold until the other modules reach the checkpoint state. This is suitable for simple circuits, but not realistic for complicated systems. Some of studies have considered the implementation of a reliable soft-core processor [11] . They only suggested an implementation method, and did not focus on the recovery technique. Consequently, we have proposed a dependable design technique for a soft-core processor using TMR and PR with state synchronization.
Whatever hardening technique is adopted, designers need to evaluate the resulting system to prove detector then outputs an error signal to an external pin, and the faulty MB is reconfigured immediately. While the faulty MB is reconfigured, the other MBs continue to run. Therefore, the reconfiguration process is performed on-the-fly. After reconfiguration, the interrupt process of the RTOS triggers the synchronization process. This is why the registers of the reconfigured MB are different from those of the other MBs. The synchronization process is performed as mentioned in the previous subsection. After these recovery processes, the SEU effect is removed from the faulty MB, and the operation of all MBs returns to their normal state. The overhead of the recovery process is only the time required to store and restore the registers for synchronization because the reconfiguration is performed on-the-fly. We verified the recovery processing using a 16802A 68-channel portable logic analyzer (Agilent Technologies) and observed the output of the detector. Table 2 shows the verification procedure. In Step 1, we inject the fault by editing a bit corresponding to the MB in an original bit-stream. In Steps 2, 4, and 6, the FPGA outputs the logical addition of the error signals of the detectors to the external pins. The signal is then checked using a logic analyzer.
Verification of the recovery sequence
In
Step 4, the detector outputs an error signal after the reconfiguration because the values of the reconfigured MB registers differ from those of the other MBs. After the synchronization process, at
Step 6, the error is not observed. The detector checks all outputs of the MBs, but the detector does not indicate that any errors occurred. As a result, the proposed technique is found to have a beneficial effect on the soft-core processor with regard to recovery from an SEU.
Dependability Estimation Method
This section presents the reliability estimation technique for the triplicated circuit and ECC integrated memory. The reliability is represented by the failure in time (FIT) criterion. The FIT means the number of system errors per billion hours of use. In this work, we assume following conditions: -System error is only induced by SEU accumulation, and is recovered by full reconfiguration -Only the SEUs in the configuration memory and the Block RAM (BRAM) are considered -SEUs never occur on the same memory bit -Each configuration memory bit has same SEU probability -Each BRAM memory bit has same SEU probability
Our estimation technique calculated the FIT of the system from each module's bit-counts and FIT/bit. We refer to the SEU information from the Xilinx user guide [20] . The abstractions of this technique are presented in [17] . We enhanced the pre-proposed estimation technique, and modified the style to make it clear. Figure 5 . Example of the SEU occurrence Figure 5 shows an example of the SEU occurrence in the TMR scheme. In Figure 5 (a), the SEUs are mitigated by the TMR scheme because they occur on other modules and both are mitigated by the voters. Moreover, the triplicated circuit can mitigate more than two SEUs if they occur on the same module. For example, both the first and second SEUs occur on module A, and can be mitigated because the other modules are not injured by an SEU. On the other hand, the SEUs induce a systemerror when the SEUs occur on more than two of the triplicated circuits such as in Figure 5(b) . We present Eqs. 4 and 5 to calculate the FIT of the triplicated circuit. Eq. 4 calculates the system-error probability of the group 't' which is one of the group of triplicated circuits. The following are the calculation processes of Eq. 4:
1. Count up the triplicated module groups. For example, "modules A, A', and A" " are Group '1' and "modules B, B', and B" " are Group '2'. Their numbers correspond to 'j' in Table 3 . 2. Decide the target Group 't' from the module groups. 3. Calculate the SEU mitigation probability by using conditional probability. This is calculated in the middle part of eq.4.
We assume that − 1 times the SEUs are mitigated, and at least an SEU occurs in the target group. Moreover, all 'p' combinations must be considered in this part. 4. The SEU mitigation probability is multiplied by which is the nth SEU probability. If the nth SEU occurs on other modules within the target group, this means that 2 of 3 modules have SEU damage. The TMR cannot mitigate them, and the nth SEU induces a system-error. 5. The calculated system-error probability is multiplied by 3 . This is the combination of modules, which is attacked by the SEUs.
Eq. 5 iterates Eq. 4 and adds up the system-error probability to calculate the total FIT of the triplicated modules. The FIT of the target group is calculated by " ( , ) _ is divided by 'n'. This is because the − 1 times the SEUs are mitigated in Eq.5, and therefore, only the nth SEU induces a system-error. Consequently, the number of system-errors is decreased to at least . 
BRAM Memory
Eq. 6 is the FIT calculation formula for the ECC integrated memory. We assume that the algorithm for the ECC is a hamming code. Therefore, the ECC integrated memory can mitigate an SEU per word. The basic concept of this equation is the same as Eqs. 4 and 5. The back part of Eq. 6 calculates the failure probability of the ECC integrated memory. The ratio term ( ) calculates the SEU mitigation probability when − 1 times the SEUs occurs. The denominator means the whole pattern of the positions of the SEUs. The numerator calculates the permutation of memory words which are induced by an SEU. At this time, all of the − 1 SEUs occur in different word. The product term ( − 1) × ( − 1) represents the number of bits that are vulnerable to the nth SEU. This is because − 1 words were already attacked by an SEU. The system-error probability is then calculated from these components. At the front part of Eq.6, the FIT ECC_BRAM about each 'n' are calculated, and these results are added up.
Eq.7 shows the reliability estimation formula for the triplicated BRAM memories. The outputs of these are decided by majority voting for each bit. This equation has the same concept as Eq. 6.
Bit Count Calculation
The bit-counts of each module are significant in our formulas. We targeted the system which was constructed using the Xilinx FPGA and EDA tools. Although the BRAM bit-counts were calculated by the number of BRAM blocks, the counts for the configuration memories cannot be obtained from the Xilinx Synthesis Technology (XST) reports. The bit-counts calculation method is necessary.
The Xilinx placement and routing (PAR) report only shows the slice usage of the entire system. It does not show the details such as the bit-counts of routing memories, look up tables (LUTs) and multiplexers. Not surprisingly, the bit-counts of each module are also not supplied. Therefore, we calculate the bit-counts of the system from the total amount of configuration memory and the LUT usage rates which were reported at the mapping phase. We consider the bit-counts, which were calculated from the slices may be less accurate than those calculated from the LUTs. This is because a slice consists of 2 LUTs and their LUTs are not always used. Eq. 8 shows the bit-count calculation formula for the entire system.
= [ (%)] ×
The TB represents the total bit-counts in the FPGA device, and excludes the BRAM. We assume that the usage rates of configuration memories which include the memories of the routing and multiplexer are nearly equal to the LUT usage rate. This means that if the LUT usage rate is 20 %, then 20 % of the entire configuration memories are used. To calculate the reliability of the TMR circuit, the bit-counts of each module are required. Eq. 9 shows the bit-counts calculation formula for a module.
The LUT usage rate of each module is not reported at the technology-mapping phase. We refer to it
from the logical-synthesis report. In Eq. 9, "LUT Usage Rate/module" and "LUT Usage Rate/entire" are taken from the logical synthesis reports. We consider that the LUT usage rate of each module may not change significantly through the implementation flow. Design tools
System Evaluation
Embedded development kit 9.1i(EDK9.1i) Integrated software environment (ISE) 9.1.02i_PR10 PlanAhead 9.1.4
Evaluation tools Placement and routing (PAR) 9.1.02i_PR10 Timing analyzer 9.1.02i_PR10 Table 4 shows the system implementation environment, design tools, and evaluation tools. We evaluated the implementation results, recovery time, and dependability. To evaluate the implementation results, we used a placement-and-routing (PAR) tool and a timing analyzer. These were applications of the integrated software environment 9.1.02i_PR10 (ISE 9.1.02i_PR10). Their input was the design file that was created after place and routing.
Firstly, we show the implementation results such as resource usage and operating frequency. To analyze the resource usage, we created a "placement and routing report" using the PAR. The critical path delay was analyzed using the timing analyzer. The overall operating frequency was calculated from this critical path delay. Secondly, we evaluated the execution time of proposed recovery technique and compared it to the execution time for full reconfiguration. The time of the proposed recovery technique was evaluated using the timer function of the TOPPERS/JSP kernel. Finally, we estimated the FIT of the proposed system using Eqs. 1 -9. Table 5 shows the system implementation result. In the table, "base" refers to the base processor structure which is shown in Figure 1 , and "Basic TMR" refers to the TMR structure which did not implement ECC, and all modules were triplicated. The BRAM was triplicated to ensure the reliability, and the voters that were placed between the BRAM and Mem_cntlr were also triplicated. "PR" and "ECC" refer to partial reconfiguration and ECC, respectively. The difference between the "Basic TMR" and "TMR + ECC" is only in the configuration of the BRAM. The configuration of the proposed system is shown in Figure 2. 1) Resource Utilization: The "Basic TMR" requires about 700 slices in addition to 3 times the "base". The "TMR + ECC" requires more than about 10 % of the slices when compared to the "Basic TMR." The "Basic TMR" requires more than 3 times the slices because of the inclusion of the voters and the detectors. They occupy about 1,000 slices at the logical-synthesis phase, and will be optimized at a latter phase in response to other modules. For example, however the voters between the BRAM and memory-controller are designed to deal with all signals, some signals, such as part of the address signals, are not used. The "TMR + ECC" needs more than 575 slices compared to the "Basic TMR." The voters between the BRAM and memory-controller are not triplicated in the "TMR + ECC," but the area of the ECC is larger than the triplicated voters. The voters around the data port of the BRAM occupy about 300 slices in the "Basic TMR", and that of the "TMR + ECC" is about 120 slices. However, an ECC circuit requires about 420 slices. The ECC circuit and voters are implemented for both the instruction and data port of the BRAM. Therefore, the ECC circuits require 840 slices and the area difference between the "Basic TMR" and "TMR + ECC" is about 480 slices. These results show that their areas are consistent.
Implementation results
For applying the PR to the "TMR + ECC," about 1,600 additional slices are required. This increase in resources is induced by the bus-macro and the area optimization problem. The bus-macro is the interface between the PRR and static region. To implement an MB in the PRR, 50 bus-macros are needed. Since each bus macro consists of 8 slices, a total of 400 slices are used for each MB. Thus, the total number of slices for the bus macros is 1,200 in the proposed system. Area optimization might be restricted because the PRR is dedicated to perform partial reconfiguration. The PRR is placed at design-time, and is not moved by a design tool such as the PAR. Moreover, the module in the PRR is separated from other modules. The area optimization such as resource sharing with other modules is not performed. Owing to TMR structure and partial reconfiguration, the proposed system requires 8,124 slices, which is about 4.67 and 1.38 times the number of slices in the "base" and "Basic TMR", respectively. The proposed system can be used on medium-sized FPGAs such as XC4VFX60. In addition, the proposed system can be implemented on a Xilinx Artix-7 FPGA which is the lowest cost model in the 7-series FPGA [21] . We thus confirm that the proposed system can be put to practical use.
2) Operating frequency: The evaluated systems satisfied a clock net timing constraint of 20 ns, which corresponds to a frequency of 50 MHz. Table 5 shows that the operating frequency of the "Basic TMR" decreased to about 70.3 % of the operating frequency of the base processor. The critical path of the "Basic TMR" is the path from the memory-controller (the ready signal for the local memory bus) to the BRAM via the MB. The optimization of this path is difficult for the following two reasons: the signals that access the BRAM are gathered because they have to go through the voter, and the BRAM blocks are hard-macros, which cannot be changed by the designer, and which are placed discretely in the FPGA. These make it difficult to optimize the routing delay of the "Basic TMR".
In the case of the "TMR + ECC," the operating frequency decreased to about 63.9% of the operating frequency of the base processor. The critical path of the "TMR + ECC" is from the BRAM to the ECC-RAM. Since this path goes through the ECC decoder and encoder, the signals must pass through many XOR gates in order to determine the ECC code word. Moreover, it is difficult to optimize the routing delay since both the source and destination are hard macros. However, this critical path does not overlap that of the "Basic TMR." Consequently, the "TMR + ECC" has only a small degradation in performance.
The "TMR + ECC + PR" can operate 57.9 % and 82.4 % of the operating frequency of the "base" and "Basic TMR", respectively. The critical path of the "TMR + ECC + PR" is also the path from the memory-controller to the BRAM via the MB. The reason that the critical path becomes longer is the constraint of the PRR. In the EA PR design flow, the PRRs are designed before the mapping phase and their placements are not moved at the subsequent design phase. The MBs that are placed in the PRRs are not able to be moved at the placement and routing phase. This incurs the path delay to signals between the MBs and BRAMs. On the other hand, the difference in the operating frequency between the "TMR + ECC + PR" and "TMR + ECC" is small. This is because the increase in delay due to the ECC implementation and that due to the implementation of the PRR do not affect each other. Consequently, the proposed system can decrease the BRAM usage with small performance degradation and sufficient reliability when compared to the "Basic TMR" scheme.
Evaluation of recovery time
We evaluated the performance of the fault recovery process in the case of the proposed system operating at 50 MHz. We compared the time required for the proposed recovery technique with that required for recovery by full reconfiguration.
The time required for recovery in the proposed system is equivalent to the synchronization process time, because the proposed system hides the time required for partial reconfiguration in the TMR scheme. We evaluated the synchronization process time using the system time reference function of the JSP kernel. The synchronization process time obtained was 6 μs, which indicates that the proposed technique can be adopted if a system allows a recovery time of 6 μs.
On the other hand, the time required for full reconfiguration depends heavily on the size of configuration data and data transfer rate. The FPGA used in this evaluation is Virtex-4 XC4VFX60-FF1152 with 2,625,439 bytes (= 21,003,512 bits) of configuration data, and the default transfer rate of Platform Cable USB is 6 Mbit/s. Consequently, the time required for full reconfiguration is 3,500.585 ms. In the case of minimum transfer rate (0.75 Mbit/s) or maximum transfer rate (24 Mbit/s) of Platform Cable USB, the time required is 28,004.683 ms and 875.146 ms, respectively. When we use the SelectMAP to perform parallel reconfiguration, its maximum transfer rate is 60 MByte/s, and the time required for full reconfiguration is 43.757 ms. The time of these recovery techniques is significantly long compared to the proposed technique. Moreover, in order to restart the system from the point of start of full reconfiguration, the register store or load time is required. Hence, the proposed recovery technique helps in reducing the recovery time when compared to full reconfiguration. Table 6 shows the result of the dependability estimates. In this work, we assigned N=15 to the proposed equations. The FITs of the configuration memory and BRAM were 263 FIT/Mbit and 484 FIT/Mbit [20] . Moreover, the SEUs on the MB were not considered when evaluating the proposed system. This is because they were eliminated by the proposed recovery technique. The "sensitive" in Table 6 means the sensitive part of a dependable system such as the voter and ECC-RAM's controller.
Dependability Evaluation
FIT BRAM is small in the dependable designs because the ECC and TMR can mitigate an SEU. The BRAM had 16,384 words in the "TMR + ECC", and the "Basic TMR" had 524,288 groups of triplicated memory bits. Therefore, the SEUs did not accumulate on the same word or same group in most cases.
Despite the "TMR + ECC" having more LUTs, its FIT TMR is smaller than that of the "basic TMR". The "TMR + ECC" was integrated with the triplicated ECC circuits, and therefore the SEU probability may increase. However, the SEU mitigation rate may also increase. This means that the SEUs may be distributed into separate TMR groups. We consider that the increase of the SEU mitigation capability may improve the FIT TMR .
As shown in Table 6 , the proposed system can reduce the FIT to 57.6 % of the "TMR + ECC" and achieves about 13 times the reliability compared with that of the "base". There are two reasons for the reduction in the FIT: the SEU on the MB can be recovered using PR, and the area of the MB is larger than other modules. Although the SEU probability for the MB is high due to its area, the SEU effect on the MB is eliminated by PR. In addition, the FITs of the other modules are very small compared with that of the MB. As a result, the FIT of the proposed system becomes smaller. However, the proposed system only reduces the FIT to 76.3 % of that of the "Basic TMR". The FIT sensitive of the "TMR + ECC + PR" is significantly larger than that of the "Basic TMR". If the system has enough BRAM resource, the FIT system becomes more dependable by applying the proposed recovery technique to the "Basic TMR". If not, we must consider the SEU handling technique for the sensitive parts.
Conclusion
In this paper, we have presented a technique for ensuring reliable soft-core processor implementation on FPGAs. This technique involves the use of TMR schemes coupled with partial reconfiguration (PR). They can eliminate the effects of SEUs from the configuration memory of the FPGA, but the states of the sequential circuits are not recoverable. Our technique resolved this problem by performing the synchronization process after PR using an interrupt process. To validate the impact of the proposed system, we constructed the reliability evaluation technique based on a mathematical model. The reliability of the system is represented by failure in time (FIT). Our evaluation techniques handle the SEU mitigation capability of the TMR and ECC by considering the conditional probability.
Owing to these schemes, the proposed system requires about 1.38 times the resources, and 82.4 % of the maximum operating frequency of the Basic TMR structure. However, the proposed system can recover a faulty soft-core processor on-the-fly. Furthermore, the recovered MB becomes synchronized with the other soft-core processors after a recovery time of 6 μs. In conclusion, a soft-core processor can recover from an SEU by dynamic partial reconfiguration and synchronization. As a result, we reduced the FIT of the system to 76.3 % for the "Basic TMR" and 57.6 % for the "TMR + ECC."
