Abstract This paper presents an approach for increasing the lifetime of systems implemented on SRAM-based FPGAs, by introducing fault tolerance properties enabling the system to autonomously manage the occurrence of both transient and permanent faults. On the basis of the foreseen mission time and application environment, the designer is supported in the implementation of a system able to reconfigure itself, either by reloading the correct configuration in case of transient faults, or by relocating part of the functionality in presence of permanent faults. The result is a system implementation offering good performance and correct functionality even when faults occur. The proposed approach is evaluated in a case study to highlight the overall characteristics of the final implementation.
Introduction
The adoption of SRAM-based Field Programmable Gate Arrays (FPGAs) for the implementation of embedded systems targeted for long-term missions, today is quite common also in critical application scenarios, because of their limited cost and flexibility provided by the remote reconfiguration capability. Such an opportunity is interesting for the space-related application environment, where updates/upgrades to the functionality and/or protocols can be performed on-line, from remote, as discussed for example in [8] . The main issue in a straightforward adoption of this kind of technology for the mentioned application scenario consists in the high susceptibility of SRAM elements to radiation-induced effects, found in harsh environments such as the space one. To this end, in the last decade a number of studies focused on the design of robust systems implemented on SRAM-based FPGAs, able to deal with the occurrence of transient, soft-errors, induced by the radiations, such as Single Event Upsets (SEUs) and MultiCell Upsets (MCUs) (e.g., [5, 7, 12, 13, 19] ).
Nevertheless, as highlighted in 2003 by the International Road Map for Semiconductors [10] , permanent faults are increasingly having an impact on the overall system reliability, because of the smaller device and wire dimensions, associated with higher operational temperatures. Furthermore, when the mission is expected to last several years, aging and wear-out effects will seriously affect the device lifetime, a severe issue when considering the current and near-future space missions. As a result, systems able to self-recover by autonomously tolerating the effects of both transient and permanent faults, are of great interest, because they allow the increase of the overall lifetime and availability of the system.
Few are the studies available in literature for dealing with permanent failures occurring in FPGAs, especially when considering up-to-date fault models, offering a solution to the design of systems able to autonomously recover from the occurrence of faults to continue being operational [11, [16] [17] [18] . The common idea at the basis of these solutions is to mitigate the effects of permanent faults by relocating the portion of the system affected by the problem onto a spare area. In order to do so, a different configuration, called bitstream, needs to be (pre-)computed and loaded on the FPGA, placing the components in different positions. Most of the attention of the work presented in literature focus on the computation of such bitstreams, to be performed either off-line, at design time, or on-line. However, no effort is devoted to the design of the hardened system, not taking into account the possibility to explore different granularity levels in the application of the hardening strategy, but rather performing a straightforward application of a well-know technique (such as Triple Modular Redundancy -TMR). More precisely, the authors do not put in relation the different alternatives in hardening the system against the reconfiguration costs (in terms of times, bitstream number and size, as well as number of permanent faults that can be tolerated).
In this paper we present a methodology and enhanced design flow for the design and implementation of fault tolerant systems on SRAM-based FPGAs for the space scenario. The final system can mitigate both transient and permanent faults, so increasing the system lifetime, and the optimization of the solution can be pursued by using two different figures of merit. The work is an improvement with respect to the preliminary approach presented in [3] and, in detail, the main contributions are the following ones:
-an approach for designing hardened systems onto SRAM-based FPGAs, that selects the most convenient solution in terms of applying fault detection and tolerance with respect to the reconfiguration and relocation requirements, driven by two possible designer's input: the platform to be adopted or the number of permanent faults to be tolerated (in relation with the expected system lifetime), and -an automated, enhanced design flow to implement not only the hardened payload (i.e., optimized system hardening and floorplanning), but also the remaining elements of the final system (e.g., bitstream computation).
The rest of the paper is organized as follows. The next section presents the past approaches dealing with the implementation of fault-tolerant FPGA systems, in particular able to deal with permanent faults, and discusses their limitations in order to identify the gap the proposed approach aims at filling. The considered system's architecture is introduced in Section 3, where the architectural platform and the adopted fault detection, mitigation and recovery strategies are discussed. The proposed design flow is described in Section 4, detailing the two supported optimization processes. Then, in Section 5, the proposed approach is applied to a real case study to show its effectiveness, and, finally, Section 6 draws the conclusions.
Related Work
The idea of exploiting (partial) reconfiguration to deal with faults affecting a system implemented on a SRAM-based FPGA is not new, and in the past several studies have been performed (e.g., [1, 5, 13] ). In the recent past, the attention has been focused mainly on transient faults, Single Event Upsets (SEUs) and/or Multiple Cell Upsets (MCUs), soft errors for which a recovery action can be planned. In particular, once a fault has been detected, it is possible to trigger the reloading of the configuration memory (called bitstream) to obtain the original, correct functionality [1] .
As an alternative, rather than performing the reconfiguration "on-demand", it is possible to schedule periodical refresh of the configuration, or scrubbing, to correct any possibly occurred problem [5, 13] . In both cases, the loaded configuration is the one that specifies the same functionality and the same distribution on the reconfigurable fabric, as before the occurrence of the transient fault.
When the nature of the occurred and detected fault is such that a portion of the fabric becomes unusable, possibly due to electromigration or other aging effects, a different configuration needs to be loaded, avoiding the permanently damaged area, also referred to local permanent damage, (LPD) [9] . This recovery action requires the problem to be identified as permanent, and the corrupted area to be delimited and localized. Provided these two pieces of information are available, there are a few works in literature that propose a strategy to mitigate the effects of such permanent faults, and they can be classified into two categories, based on the adopted strategy to define the alternative necessary bitstreams:
-off-line bitstream computation, that creates the alternative configurations during the design phase, as in [15, 16, 18, 22] , and -on-line bitstream computation, that dynamically creates the recovery configurations, as in [9, 11, 17] .
The former strategy requires the storage of all the bitstreams that might be needed during the system's lifetime, thus a large memory is a-priori required, based on the number of permanent faults that can be tolerated in relation with the fault localization strategy. Then, the relocation of a portion of the system functionality into another spare portion of the FPGA depends on the possibility to localize and isolate the faulty portion of the device. Lach et al. [15] proposed a "tile-based" approach for reconfiguring an FPGA design to avoid permanent damaged areas. The strategy tolerates a limited number of faults per tile, requires a memory holding precompiled configurations and a robust fault location mechanism to localize the problem. The repair mechanism requires tiles to be off-line during reconfiguration. The methodology works at a very fine-level granularity, imposing several constrains on the size of the fault localization mechanism and on the size of the robust memory. A similar approach [22] likewise requires a set of golden pre-compiled configurations and a golden fault diagnosis system hosted on an extra FPGA. In particular, the work proposed in [22] presents a permanent fault repair scheme, where the original design is reconfigured into another fault tolerant design that has smaller area, so the damaged element can be avoided. The authors focus on the possibility to reconfigure the system to avoid the damaged area of the platform, by loading a modified, reduced system still performing the desired functionality but without the same level of robustness (due to the availability of a limited amount of resources on the board). The authors propose a solution based on a graceful degradation to improve the overall lifetime. The schema proposed by the authors achieves fault tolerance when no permanent fault has occurred and is degraded to providing only fault detection when the FPGA is damaged, according to the available resources. Such hybrid schema derived from TMR is compared against the traditional TMR, Concurrent Error Detection (CED) combined with Duplication With Comparison (DWC) and another schema to evaluate the benefit of this architectural solution. The analysis is precise at this architectural level, of the expected lifetime and costs, however there is no discussion on how to implement the proposed solution, and in particular, how to manage the implementation of the TMR architecture, its degraded version, how to exploit the localization of the fault and all the issues related to the actual realization of such kind of system. The approach presented in [16] requires the designer to manually specify a partitioning of the system into selfchecking areas, hardened by means of DWC to detect the occurrence of a fault, and to set a maximum number of spare areas available for recovery. The paper does not consider any systematic and automated design space exploration that analyses and compares different partitioning and hardening strategies to identify the most promising solution, leaving the decision to the designer. In general, the choice of these two elements generates a large space of different hardened implementations of the system, each one characterized by different area overheads, and thus different spare area availability. Furthermore, the approach has been defined for older Xilinx Virtex-II devices, where reconfiguration is performed on an entire column, with a height equal to the device's one. Modern devices, such as Xilinx Virtex 4 and 5, not only support 2D reconfiguration, but are also characterized by a heterogeneous distribution of the columns of resources (DSPs and BRAMs). As a result, it is not possible to simply extend the existing approach, but new considerations need to be taken into account. Similar limitations can also be noted in the solution proposed in [18] : the adopted model of the reconfigurable fabric consists only of basic configurable logic blocks; moreover few details are provided on the strategy for the definition of the self-checking areas and for the floorplanning. When considering the last aspect no automated design flow is proposed, such that the result the designer can only explore a few alternatives, among the numerous ones.
The latter strategy against permanent faults is based on the on-line computation of the new bitstream. An example is presented in [9] , not moving the functionality hosted on the corrupted area to another one, but rather repairing the modified functionality by implementing it in a different way. More precisely, the approach introduces the Jiggling architecture, extending TMR+scrubbing to mitigate FPGA transient and permanent faults. The TMR is applied to the nominal circuit, by triplicating and voting modules. When a mismatch is detected in a module due to a fault, scrubbing is performed to correct the potential transient fault. If the failure persists, Jiggling is initiated. It uses the remaining replicas as a template of desired behavior to guide a specific genetic algorithm, called (1+1) evolutionary strategy, towards a functional configuration which avoids or exploits the faulty element; here, the new configuration is computed on-line. The solution is applicable to combinational modules only and possibly requires the modules implementing the circuits to be implemented with additional, extra resources, to allow the evolutionary strategy to find a different way to implement the same functionality. In this way, though, the damaged portion of the fabric is not avoided, thus potentially leading to further degradation of the surrounding areas. Moreover, the authors give a generic guideline for the size of the modules to which the TMR is applied, to make the repair process feasible, which amounts to less than a thousand gate equivalents. However, no methodology is used to identify the most convenient organization of the circuit in modules, with respect to costs and performance.
The computation of the recovery bitstream on-line is also exploited by the approaches presented in [11, 17] , where partial dynamic reconfiguration and bitstream relocation are executed to cope with the occurrence of a permanent fault, using the tools provided by Xilinx for their FPGAs [21] , so that the memory space required for storing the alternative configurations can be significantly reduced. Both approaches refer to the Partial Reconfiguration (PR) flow, so that the FPGA hosts a static region not involved in the reconfiguration, and a dynamic one; in these works, no consideration is presented on the reliability aspects related to the static region, which contains the global communication infrastructure and, possibly, the internal reconfiguration controller. As a consequence, the region constitutes a single point-of-failure for the architecture, because a fault could not be recovered. In these approaches, the controller computes on-line where to move the functionality hosted by the damaged area, selecting a spare area. However, the complexity of the system implementation is such that the set of feasible relocation moves taken into consideration by the presented strategies is actually limited and only considers a few possibilities. Moreover, when moving portions of an existing design, often a new floorplanning and thus a synthesis of the overall system might be necessary, an aspect not investigated in the proposals. As for the hardening of the areas to provide fault detection capabilities, the authors in [11] do not provide any information.
The T3RSS project [20] proposes a tool for a systematic approach to design hardened systems hosted on SRAMbased FPGAs, by means of relocation and reconfiguration. The focus is on the design of a Configuration Manager for a Dynamically Reconfigurable Systems, and in particular on techniques and methods to adaptively protect the memory, therefore, the hardening of the payload FPGA is done in a straightforward fashion, by applying TMR at a fixed, pre-defined granularity level.
Finally, a completely different solution is proposed in [14] , presenting a dual-FPGA platform, where hardware and information redundancy based on error detecting codes are adopted to obtain a system able to cope with the occurrence of transient faults.
As a conclusion, the approaches presented in literature focus the attention on the recovery strategy and architectural aspects, without proposing a complete flow and tools to support the designer in the implementation of a hardened system able to cope with the occurrence of faults, both transient and permanent, to improve the overall system lifetime. The work presented in this paper builds upon our previous preliminary study, presented in [3] , and aims at filling such a gap by proposing a solution that explores at design-time the different circuit partitioning and hardening alternatives, by means of a systematic analysis of costs and benefits. Then, once the most convenient solution is selected, the flow computes the initial and recovery systems implementations (e.g., bitstreams), together with all elements necessary to define the final autonomous fault tolerant system. With respect to the preliminary proposal, a new design flow and methodology is proposed, to better support the designer in the realization of the final system optimized for the target application environment.
Reliability-Aware Architecture
The architecture of the autonomous fault-tolerant system is similar to the one considered in [1, 3] ; in particular, as shown in Fig. 1 , it is composed by three main modules:
-an SRAM-based FPGA hosting the hardened payload application, -a rad-hard FPGA hosting the reconfiguration controller, and -a hardened memory storing the FPGA configurations for recovery.
The first module constitutes the specific reconfigurable fabric on which the circuit under design is developed and deployed on, so that it can be updated and/or upgraded during its mission. The circuit under design, called payload application, usually consists of a data processing functionality. The other two elements are parametric modules, customized to support the monitoring and reconfiguration features necessary to achieve an autonomous fault-tolerant system. The requirement to use a rad-hard FPGA for hosting the reconfiguration controller can be relaxed, provided fault tolerance techniques are applied to guarantee that the controller exposes a correct behavior. In the same way the configuration memory is hardened by design or by means of standard Error Detection and Correction Codes.
Indeed, the considered architectural platform is very similar to the classical one used in satellite on-board data processing computer; the innovative aspects are the intelligent engine implemented in the reconfiguration controller supporting the management of both transient and permanent faults, and the possibility to tune and customize several architectural parameters and the application hardening during the system implementation. These aspects are discussed in the following by presenting a detailed overview of the architecture's modules, whereas the next section will focus on the design flow for defining the various implementation and tuning choices for the system realization, starting from the design flow to achieve it.
Hardened Payload Application
The payload application is a circuit specified with a structural approach and, thus, is based on a set of components interconnected to each other. The components are characterized in terms of their resource requirements (slices, BRAMs, and DSPs) and the communication between them is modeled by specifying the number of wires.
The circuit is hardened by applying a traditional fault tolerance technique, such as TMR, coupled with the FPGA dynamic reconfiguration property offered by the modern FPGAs, in particular of the Xilinx families, to achieve error mitigation capability. Borrowing from [2] , TMR is applied Fig. 1 The reliability-aware architectural platform [3] to groups of circuit's components and each of these hardened groups is mapped on a separate area: in case of fault, the area can be recovered by reconfiguration independently from the others. In addition to mitigate occurred errors, the TMR voter of each area generates error signals that allow the detection of faults and the identification of the faulty area.
The first issue arising from the considered reliabilityaware architecture is related to the identification of the independently recoverable areas: in particular it is necessary to identify a suitable solution in terms of partitioning of the circuit components in recoverable areas with respect to a selected set of figures of metric. Moreover, it is necessary to consider the FPGA's structure and properties to efficiently exploit its reconfiguration capabilities. Since each FPGA device has peculiar characteristics in terms of amount and distribution of the resources, also the selection of the device is another parameter to be specified during the system implementation.
Reconfiguration Controller
The reconfiguration controller is the engine actuating the recovery actions in case of fault on the SRAM-based FPGA, contributing to the creation of the autonomous fault-tolerant system. It checks the behavior of the hardened payload application, by monitoring the error signals from the areas, and, in case of fault, it applies the required recovery activities. Three recovery activities are envisioned, based on the type of fault. Should the fault be considered as transient, it can affect the application registers or the configuration memory. In the former case, a simple system reset is performed. In the latter, only the configuration of the faulty area is restored by reloading the related bitstream portion: this action is dubbed on-demand partial scrubbing. Otherwise, if the fault is considered as permanent, the faulty area is tagged as unusable and a relocation of its functionality is performed into a spare region. A suitable spare region is reserved on the FPGA for recovery from permanent faults; moreover, it is necessary to define an initial placement for the defined recoverable areas and a set of alternative replacements for recovering from each specific fault hitting an area.To perform the suitable recovery action (reset, on-demand partial scrubbing or functionality relocation), the reconfiguration controller implements a classification to discriminate between transient and permanent faults, by means of algorithms such as the ones presented in [4] . On the basis of an analysis of the fault's frequency, an area is considered affected by a permanent fault when it has been recorded as faulty during a pre-specified number of subsequent observations. More in detail, the recovery procedure consists of the following three phases:
1. On the occurrence of the first error in an area, an application reset is performed, assuming a transient fault in the application registers; 2. If the error persists and until the coming of the observation threshold, a partial scrubbing is triggered; 3. When reaching the observation threshold, the whole FPGA is reconfigured by using an alternative recovery bitstream.
Thus, the second issue is that for each possible sequence of faults (in particular permanent ones), the reconfiguration controller must be provided with a suitable recovery bitstream stored in the configuration memory. In particular, each of these bitstreams define a specific placement and shape of the recoverable areas avoiding a set of faulty regions of the device.
At design time, based on the selected cardinality of multiple permanent faults to be managed (max f aults), it is possible to identify all possible situations that may arise, in terms of the faulty regions that cannot host functionalities. Therefore, it is possible to envision for each possible combination of n permanent faults out of the max f aults, the most convenient distribution of the functionalities on the spare fabric. Given a circuit partitioned in #areas areas, the number of bitstreams for recovery from max f aults is #bitstreams = max f aults i=0 #areas i Figure 2 reports a partial set of configurations derived from the simple situation where max f aults = 2.
It is interesting to note that, the sequence of occurrence of permanent faults affects the final configuration to be adopted: should area 1 fail first (situation A) and area 2 follows, the final configuration of the system would be the one labeled as E. On the other hand, if area 2 fails at first (situation B), when area 1 fails next, the final configuration of the system is the one labeled as G. This strategy has been adopted in order to have, for each given situation, the most convenient system implementation "around" the faulty isles of the fabric. Therefore, for situation corresponding to a failed and unusable portion of the fabric, a new implementation is computed, by means of a new placement process, on the fabric avoiding the faulty portion. As a consequence, as shown in the example of Fig. 2 , each new organization of the areas on the fabric might also generate different shapes and therefore, given a configuration corresponding to a certain sequence of permanent faults, the next configuration is not an evolution of the previous one. Fig. 2 The adopted recovery strategy for permanent faults These elements have been considered in order to determine the most interesting solution for computing and storing the necessary configuration files, for recovering in the various situations. More precisely, we have adopted the off-line bitstream computation since it is more reliable than the online counterpart, as discussed in Section 2 due to the fact that the static region represent a single point of failure. Moreover, the on-line bitstream computation would make the reconfiguration controller heavily dependent on the specific FPGA device technology. As a last consideration, since the next configuration to be applied is not necessarily an evolution (thus requiring only a partial reconfiguration) of the existing one, the static design-time approach provided the most convenient solution. Possibly, future work could address optimisations towards a more efficient storing of the precomputed bitstreams, by adopting some mechanisms to determine the new configuration, starting from the current one and the history of the faults. Indeed such an approach is feasible (in terms of complexity and computational time, since it will occur at run-time) only when the number of configurations to be taken into account (and the possible evolutions) are limited.
Configuration Memory
A hardened memory stores the bitstreams for the recovery actions. In order to apply the off-line bitstream computation strategy, a recovery bitstream must be stored in the memory for each possible sequence of permanent faults (with reference to Fig. 2) . Indeed, the higher the number of areas composing the hardened payload application, the higher the number of different recovery bitstreams is. Moreover, the overall number of permanent faults to be recovered and the number of different recovery bitstreams present a similar trend. Hence, the size of the configuration memory is a constraint when hardening the application.
In conclusion we can notice that the considered platform is tunable and various choices need to be carried out for implementing the autonomous fault-tolerant system: -the partitioning of the circuit components in recoverable areas, -the selection of the FPGA device, -the definition of the number of permanent faults to be tolerated -the specification of the memory size.
In the next section, we cope with the issues arising from the considered architecture, by presenting the proposed design flow.
Design Flow
The proposed design flow for the implementation of the autonomous fault-tolerant system on the described FPGAbased platform is shown in Fig. 3 . It takes as input the specification of the payload application's circuit and a specific architectural requirement. Then, the output is the hardened implementation of the payload application on the considered platform. The design flow consists of three main phases:
-circuit modeling, that builds a representation of the circuit based on graph and annotated with implementation costs, -circuit hardening, that defines an autonomous faulttolerant solution based on self-checking and independently recoverable areas, by applying a suitable strategy depending on the selected requirement, and -system implementation, that synthesizes and implements the system to obtain the bitstreams of the defined solution.
Two different architectural requirements can be specified in input in a mutually exclusive fashion:
-the FPGA device to be used, as a reference for the amount of available resources, or -the number of permanent faults to be tolerated, as a reference to the harshness of the application scenario.
Thus, the circuit hardening phase, that is the core of the design flow, will execute specific activities according to the specified parameter. The former case has been presented in our preliminary study in [3] . In this scenario, the effort was direct towards a hardening of the circuit that maximized the number of tolerated permanent faults. Alternatively, in the latter approach, the designer can set the number of permanent faults to be recovered as input requirement. This value can be possibly derived by considering the lifetime desired for the circuit and its operating conditions. In this alternative scenario, the minimum requirements for the FPGA device implementing the circuit are identified according to the circuit resource requirements. In the following, we report some main concepts from [3] , with the aim of providing a complete overview of the design flow. Then, we focus on the hardening task for given number of permanent faults, constituting the novel extension of the flow.
Circuit Modeling
The first phase of the design flow performs the synthesis of the system onto a selected FPGA device, big enough to host the nominal circuit, without any hardening. The goal is to extract the information on the implementation costs of each Fig. 3 The proposed design flow component, in terms of slices, BRAMs, DSPs, etc. When the fixed-platform approach is adopted, the specific FPGA device is known and can be used during the preliminary synthesis of the circuit. Whereas, when the approach with a fixed number of permanent faults is pursued, a generic, temporary FPGA is used, to derive a rough estimation of costs, necessary to foresee the costs of the hardened version of the system. The synthesis is performed by means of a commercial tool, usually the one adopted in the standard design flow.
The output of the synthesis tool reporting the implementation costs is parsed, together with the circuit description, to generate an agile representation of the circuit based on graph and annotated with the components' costs. This is the model taken as input by the circuit hardening phase.
Circuit Hardening
The second phase, that represents the core of the proposed methodology, is devoted to the tuning of the various architectural parameters and the definition of an optimal application of the hardening strategy. The two defined versions of the circuit hardening considering different input requirements (fixed platform or fixed number of permanent faults) are discussed in the following.
Circuit Hardening for a Fixed Platform
As a first approach, the FPGA device hosting the circuit is selected by the designer, as initially proposed in [3] . The resources available on the FPGA and the capacity of the memory storing the recovery bitstreams (necessary to apply the off-line bitstream computation strategy) are constraints for the hardening activity. By considering these constraints, it is necessary to maximize -the number of tolerated permanent faults, called max f aults, and -the number of self-checking and independently recoverable areas (called max areas)
both to detect faults at finer granularity and to reduce the scrubbing time in case of transient faults. The first step of the hardening phase defines these two parameters. To compute max f aults, a hardened version of the circuit, where TMR is applied to single components, is considered; this hardened version is the one with the finest area granularity. max f aults is the number of times the greatest area can be moved onto the spare region of the FPGA. By defining this parameter, we cope with the constraint given by the resources available on the FPGA. To handle the other constraint, i.e. the memory capacity, it is necessary to remember the formula for computing the overall number of bitstreams according to the defined recovery strategy presented in Section 3.2; in particular, by inverting the formula it is possible to state that max areas is computed as: By considering the two identified parameters and the circuit model obtained by the previous phase, the second step of the hardening task defines the independently recoverable areas composing the hardened circuit. The definition of the areas has been modeled as a Mixed Integer Linear Programming (MILP) problem, described by the constraints reported in Table 1 . The constraints are associated with areas definition, wires between areas, and resources distribution. To help the reader in understanding the constraints, we also provide a description of the parameters and the decision variables in Table 2 . To solve the proposed problem, the following metrics have been considered: This approach is immediate but requires the designer to know in advance what device it will use. However, this information might not be an input constraint for the designer, who could be more interested in being able to implement a circuit able to survive the occurrence of max f aults failures before being unable to perform correctly. This might be the case when the designer is actually dealing with the expected lifetime of the system (deriving max f aults from the expected application scenario FIT characteristics), and needs to find the implementation that achieves it, being the device family a secondary output of the hardening design process. The next subsection proposes this new flow, guided by a different constraint.
Circuit Hardening for a Fixed Number of Permanent Faults
In this approach, the designer selects the number of permanent faults that the circuit should autonomously tolerate, in order to guarantee a certain lifetime. Here, the hardening task performs first of all the definition of the self-checking and independently recoverable areas, then it identifies a FPGA device suitable for hosting the payload application's circuit.
For the areas definition, the same MILP model described above can be adopted. Indeed, here the model is not subject anymore to the limits imposed by a selected platform, i.e., FPGA resources and memory capacity. Thus, by referring to Table 1 , the following remarks must be considered: -For the FPGA resources, it is possible to have no boundaries by ignoring Constraint C14. This constraint sets the number of resources available on the FPGA and holds the resource occupation of the hardened circuit, considering also the relocation actions, to be lower than this number.
-For the memory capacity, it is possible to have no limits when max areas is not set to fulfill the memory constraint (as done above). The maximum value of max areas is the number of components of the circuit, corresponding to the finest granularity of fault detection and recovery. We can here set the value of max areas to this number. Should the obtained number of areas be too high, it is possible to set a lower value by computing max areas as explained above. In this case, the bitstream dimension required for the computation must be estimated by considering a target FPGA family (e.g., about 3 MB for Virtex4 FPGAs).
By solving the MILP problem, we obtain the hardened version of the circuit, described in terms of mapping between circuit's components and areas. Finally, by knowing the size and number of areas, it is possible to derive i) the minimum resources required for the FPGA hosting the hardened circuit and ii) the minimum capacity required for the memory storing the recovery bitstreams. These information are used to select a suitable FPGA device, possibly by choosing among available commercial solutions. Indeed, if the identified minimum requirements reveal themselves being too high, it is possible to re-execute the areas definition step with the suitable constraints. Should the FPGA resource requirement be too high, Constraint C14 of MILP model can be introduced, with a selected number of resources for the FPGA (dev r ). Should the required memory capacity be too large, a lower value of max areas can be set.
Finally, the output of this hardening phase is the hardened version of the circuit and the selected FPGA device.
System Implementation
The last phase of the design flow consists in the implementation of the identified solution on the target platform and in particular the synthesis of the hardened circuit on the selected FPGA.
The first step of the phase consists in the identification of a suitable placement of the hardened circuit on the FPGA and defines the set of recovery re-placement for each possible sequence of area failures (refer to Fig. 2 ). This placement activity of the various areas is commonly called floorplanning and it has been automated in the presented flow by means of the approach proposed in [2] . In order to enable the partial scrubbing and to identify feasible (re-)placements, the floorplanning step has to take into account the FPGA's resource organization and also further constraints on the rectangular shape of the areas and their specific positions due to the FPGA partial reconfiguration features. In particular, in this work, we consider Xilinx Virtex 4, Virtex 5 and Virtex 6, which represent up-to-date device families offering a large amount of resources and 2D partial reconfiguration features. Thus, the floorplanning step is devoted to the identification that fulfills the placement constraints discussed above, and at the same time, to the optimization of the overall wire length to maximize system performance. The output of this step is a set of placement constraints for the initial placement and for each recovery re-placement.
Finally, in a second step, the hardened circuit is synthesized and implemented on the selected FPGA by considering the various sets of placement constraints to obtain the initial bitstream and the recovery ones to be stored in the platform memory.
Experimental Results
The design flow has been applied to a case study considering a video processing circuit. The goal of this case study is to evaluate the benefits of the approach in identifying different interesting solutions on the basis of the input requirements specified by the designer, and to compare the specific characteristics of such solutions achieved with the two design space exploration flows. The comparison of the proposed flow with the existing alternative solutions has been addressed in [3] , with respect to the fixed-platform flow, achieving better results. Here we want to focus on the benefits of the hardening solutions, analyzed with respect to the two alternative hardening design paths.
The circuit selected for the case study is an H.264 video encoder. In Fig. 4 , the circuit's structure is shown, annotated with the number of wires between components, while in Table 3 , the resources required by the components are reported, in terms of slices, BRAMs, and DSPs. A Xilinx FPGA xc4vsx55 has been considered for the preliminary estimation, considering the availability of resources of different kinds to host both the nominal circuit and its TMR version. A prototype framework based on both commercial and proprietary tools has been developed, to automate the flow as much as possible, thus supporting the designer. For the preliminary synthesis of the nominal circuit and for the final implementation of the hardened one, Xilinx ISE 12.1 has been used. Then, to execute the MILP model defining the self-checking areas, the CPLEX 10.0 commercial tool has been exploited. Finally, for the reliability-aware floorplanner necessary to identify a possible placement of the areas on the portion of fabric not affected by faults, a proprietary tool has been used [2] . The resulting prototype framework allows the designer to identify a suitable reliable solution, by considering either a selected platform or a given number of permanent faults to be tolerated as the main constraint for the design space exploration. Moreover, a prototype of the described architectural platform has been also developed [6] ; in particular, the reconfiguration controller has been preliminarily implemented on a Virtex 5 XUPV5-LX110T board. This prototype has been used to measure the average As described in Section 4, the circuit hardening is based on three main metrics: i) distribution uniformity, ii) number of areas maximization, and iii) number of wires minimization. In the objective function of the MILP model, a weight is associated with each metric. To set the suitable weights, a tuning of the parameters has been performed, allowing us to identify the thresholds for privileging a particular metric with respect to the others. This tuning phase is necessary to select, for a given application, the values of the weights that drive the search space in a wide search, without computing only trivial solutions. The MILP model has been executed for different values of weights to harden the case study circuit. Each weight has been evaluated by varying its value in the interval [0.1, 1] to identify the threshold to privilege the related metric with respect to the others. For the given application, the following values have been identified as suitable: w res ≥ 0.7 for distribution uniformity, w areas ≥ 0.3 for the maximization of the number of areas, and w wires ≥ 0.6 for the minimization of the number of wires.
In the following, we report the results obtained when maximizing the number of areas (to reduce the average reconfiguration time) and when achieving distribution uniformity; the minimization of the number of wires has been considered also, to reduce the resource requirements related to the voters, however it has not been privileged with respect to the other metrics. Both the cases of hardening for selected platform and for given number of permanent faults are considered. For the fixed-platform hardening approach, the selected FPGA is a Xilinx xc4vsx55 and the memory size is set to 128 MB; according to the formulas presented in Section 4, we computed the values of max f aults and max areas, that are 2 and 6, respectively. For the approach considering a fixed number of permanent faults, the same maximum number of permanent faults (i.e., 2) is considered as input.
Reducing Reconfiguration Time
In the first experimental session, the maximization of the number of areas has been privileged with respect to the other metrics. In this way, it is possible to reduce the size of the areas, so minimizing the reconfiguration time in case of fault and increasing the availability of the system. The following weights have been adopted: w res = 0.6, w areas = 0.3, and w wires = 0.1.
When considering the fixed-platform approach, the maximization of the number of areas leads to exploit all the 6 areas defined by max areas value. The identified areas are reported in Table 4 , together with their components, the resource requirements, and the reconfiguration times necessary in case of transient fault. We considered the reconfiguration times of the initial configuration and the recovery ones. In fact, the reconfiguration times of the initial and recovery configurations are different, since the areas can possibly have different shapes and different positions, thus implying a variable amount of frames to be reloaded. In Table 4 , the minimum and maximum reconfiguration times among the configurations are reported.
When considering the other approach (i.e. with a fixed number of permanent faults), the maximization of the number of areas naturally leads to place each component on a different area, thus obtaining 15 self-checking areas. The number of recovery bitstreams corresponding to the possible sequences of 2 permanent faults is 241, indeed requiring too large a memory. Therefore, we set a maximum number of areas by considering a maximum memory size of 256 MB and a bitstream dimension of about 3 MB (that is the one of the xc4vsx55, the largest in Virtex 4 SX family). The obtained maximum number of areas is 8. In this scenario, 73 bitstreams are required and the circuit can be implemented on the same device of the previous scenario. The areas identified by the tool are reported in Table 5 .
In both the hardening strategies (fixed-platform and fixed-number of permanent faults), the obtained circuits present performance between 75 and 79 MHz, depending on the specific initial or recovery placements identified with the floorplanner. In the fixed-platform strategy, the average reconfiguration time is 130µs to cope with transient faults, whereas in the other strategy it is 110µs, with a reduction of 18 %. When coping with permanent faults, the reconfiguration time is the same for both the strategies, since the same device is considered; for xc4vsx55 FPGA, the time required to load a new complete bitstream (in order to recover from a permanent fault) is 1.2ms. As a conclusion we can state that the time spent for the recovery actions is very low, thus allowing a high availability of the system. 
Achieving Distribution Uniformity
In a second test we considered the uniform area distribution as the privileged metric. The following weights have been adopted: w res = 0.7, w areas = 0.2, and w wires = 0.1. In this scenario both the hardening flows lead to the same solution. In fact, the FPGA selected for the first hardening strategy is not large enough to constrain the definition of the areas; the 5 areas identified by the framework are reported in Table 6 . The obtained system is able to recover from 2 permanent faults also in this second scenario.
The maximum resource gap among the areas of the identified solution is 807 slices, whereas, in the previous two solutions, it was 1485 and 2202 slices for 6 and 8 areas, respectively. This reduced variance in the area sizes leads to a reduced variance in the reconfiguration times; indeed, the standard deviation of the identified solutions is 14µs, whereas in the previous two solutions it was 27µs and 34µs respectively. Despite this advantage, the solution presents a higher average reconfiguration time than the previous ones (145µs).
When considering the approach with a fixed number of permanent faults, the only device suitable to host the hardened circuit and supporting the recovery from 2 permanent faults is again the xc4vsx55, that is the largest device in its family. For the selected case study, the approach is not able to select a smaller device due to the characteristics of the circuit, that has many components requiring a large amount of resources. Thus, only the largest device of the considered FPGA family is the only one able to host the overall circuit.
In conclusion, the proposed design flow has allowed to identify three suitable solutions for hardening the payload application circuit. For the selected case study, the main difference among the identified solutions is related to the reconfiguration times required in case of transient faults. The designer can choose the most suitable hardening approach based on his/her requirements.
Conclusion
The paper presents a complete design flow for the implementation of self-healing systems on SRAM-based FPGAs, able to autonomously recover from transient and permanent faults, in order to improve their lifetime. The approach is suited for long-term missions, requiring the opportunity for upgrading/updating during their operational life, also characterized by high maintenance costs, such as applications for the space. Two alternative approaches can be pursued, according to the constraint driving the hardening of the circuit: the area available on the selection platform of the number of permanent faults to be tolerated. The proposed flow supports the designer in the implementation of the system in an automatic way, which is an improvement with respect to previous approaches. Experimental results reported in the case study provide the details of the efficiency of the design flow in terms of design effort and the quality of the achieved system with respect to fault tolerance properties.
