Abstract-This paper presents a technique to improve the reliability and the mean time to failure (MTTF) of hardware task graphs (TGs) running on reconfigurable computers. This technique, which has been named task early fetch, can be applied to a sequence of one or several applications, represented as TGs. It consists in carrying out the reconfiguration of some tasks within the execution of the previous TG, plus increasing the redundancy level of the early fetched tasks. Experimental results on actual TGs show the positive impacts of the proposed technique. Thus, without deteriorating the execution time (makespan), on average, a 114% MTTF improvement is achieved for no-fault-tolerant TGs, and the improvement is more significant when applying to fault-tolerant TGs. Finally, this paper presents a hardware implementation of a manager that applies these techniques at runtime and steers the execution of the running TGs. It demonstrates that, with 0.03% consumption of flip-flops and look-up tables and also 1.22% occupancy of block random access memory available on the Xilinx Virtex UltraScale XCVU095-2FFVA2104E field programmable gate array, the required runtime computations can be carried out in negligible delays.
I. INTRODUCTION

S
RAM-BASED field-programmable gate arrays (FPGAs) have recently drawn the attention of researchers and manufacturers of complex electronic systems in fields, such as avionics and aerospace [1] . The reason is that, unlike application-specific integrated circuits (ASICs), FPGAs can be reconfigured multiple times during the mission and also feature lower cost than ASICs, as well as less time to market [2] . Partial reconfigurability makes FPGAs able to configure only a portion of the device while the remaining resources continue their normal operation. In order to execute multiple functionalities in a time-multiplexed manner, a scheduler is required to steer the execution of the hardware tasks [3] .
Partially runtime reconfigurable FPGAs suffer from reconfiguration delay and also susceptibility to the consequences of the single event effects (SEEs) [4] . To alleviate the susceptibility to consequences of SEE, fault tolerance (FT) techniques are required to increase the reliability of a given design, but in most of the cases, they also come at the cost of degrading the system's performance. Therefore, reliability and performance should be optimized simultaneously. This paper aims at improving the reliability of applications, represented as task graphs (TGs), running on FPGA-based reconfigurable computers, without deteriorating their execution time, which is known as makespan. For this purpose, a novel technique, named task early fetch, is presented. It consists in carrying out two modifications on a pair of consecutive TG schedules. On the one hand, it loads the configuration data of some tasks of a given TG within the execution of the previous one. On the other hand, it increases the redundancy level of the involved tasks to improve their reliability, without deteriorating the makespan. In this paper, it is assumed a dynamic environment with one or more known TGs at design time, but an unknown execution order at runtime.
The experiments on actual TGs show that the proposed technique improves the reliability and mean time to failure (MTTF) of the TGs without deteriorating their makespan. Additional experiments on hardware tasks in fault-varying environments show that the proposed technique outperforms other state-ofthe-art FT techniques [5] , [6] . Finally, the hardware implementation that has been presented demonstrates that, with a very affordable hardware cost, the runtime computation required to implement the proposed technique is negligible.
The remainder of this paper is organized as follows. Section II introduces some related work, and Section III shows illustrative examples. Then, Section IV describes the proposed early fetch technique. Experimental results are shown in Section V, and finally, this paper concludes in Section VI.
II. RELATED WORK
Many researchers have investigated the FT issues in FPGAs, which can be categorized into three groups of mitigation approaches, namely: design-based methods, placement-and routing-based methods, and recovery-based ones.
Design-based methods are typically built upon redundancy, which is a very effective approach to mitigate soft errors [7] , especially in environments with dynamic fault rates [5] . These methods use different replications, at different granularities, to increase the system reliability [8] . In this regard, different fine and coarse grain redundancy-based FT techniques for spacecomputing systems have been investigated in [9] and [10] . As an example of the application of FT techniques in FPGAs, a system-level duplication with compare (DWC) FT technique has been presented in [11] to improve the reliability of adaptive equalizers, implemented on FPGAs. In a similar approach, [12] presents a redundant FPGA-implemented speed controller core for high-speed trains. The application of both triple modular redundancy (TMR) and DWC approaches combined with a check-pointing technique to build reliable soft processors has also been investigated in [13] .
Placement-and routing-based FT techniques increase the reliability of a design by adapting traditional place and route techniques for FPGAs for harsh environments, at different design phases. For example, an interesting technique has been presented in [14] , which manages the signals between functions in such a way that multiple errors affecting two different connections are not possible. In a similar approach, [15] studies both fault occurrence and error propagation probabilities to propose a reliability-oriented placement and routing algorithm. Anyway, all these techniques can be applied to a given hardware task, and, as indicated by [16] , they can be used in combination with other design-based methods to increase the reliability of the circuits.
However, the aforementioned techniques cannot prevent fault accumulation at runtime. Recovery-based methods are designed to resolve the fault accumulation problem [17] . Most of these techniques are based on recovering the value of the faulty cells [18] . For example, the studies in [19] and [20] determine different scrubbing rates for different circuits, based on their failure rate, in such a way that the system reliability is maximized. Some other techniques are based upon replacing the faulty blocks with the previously generated ones, which are functionally equivalent block instances, that do not use the faulty resources [21] , [22] .
Combining design-based and recovery-based methods is very effective for mitigating soft errors in FPGA-based systems. For example, [23] addresses the problem of tolerating N failures in nanosatellite swarm-based systems, using spare swarms. This paper presents general ideas that are not particularly focused on a specific device, but they can be applied to reconfigurable computers. Similarly, [24] employs a redundancy-based approach to employ spare units in which each task has many redundancies, so that some of them are active, and in order to reduce power consumption, the remaining are standby. The work by Yousuf et al. [25] is another study in this area that combines hardware and software tasks to guarantee a given target reliability, while reducing the energy consumption. A similar study has been done by [26] and [27] , which introduces a task partitioning scheme to tolerate transient and permanent faults for software/hardware tasks in heterogeneous and reconfigurable platforms.
In embedded systems in general, and in FPGAs in particular, applications are usually represented as a directed acyclic graph (DAG) or a TG, whose nodes represent computational tasks and whose edges represent dependencies among tasks. When such TGs run on reconfigurable computers, they have to be scheduled in a way that both tasks precedence constraints and the resource limitations are met. This requires to consider task scheduling and task placement [28] , [29] . The performance of the scheduling methods could be improved by employing task prefetch [30] or task reuse techniques [31] . These techniques configure a given task in advance [32] , and they can be used to improve the makespan of TGs [33] , as well as alleviating the fragmentation problem of FPGAs [34] . However, as it will be discussed in Section IV-B, these techniques have adverse effects on the task reliability. By prefetching a task, its residency time increases on FPGA, which as a result, increases the time that the task is exposed to radiations. In these works, the negative effects of the prefetch technique on the task reliability have not been evaluated nor considered. The scheduling methods can also been enhanced to consider FT requirements. These techniques aim at guaranteeing a given system performance whereas the system reliability is increased as well. For example, a primary/backup scheme is proposed in [35] in which two versions of a task run with minimum time overlap. In [36] , a real-time fault-tolerant scheduling algorithm is proposed, which schedules hybrid tasks able to tolerate f i faults during task execution. Our previous study [6] showed that, by using optimization methods and choosing, for each task, the proper FT technique from the Pareto set (referred to as Pareto-based FT techniques), it is possible to increase the reliability of TGs without deteriorating their makespan. The application of different FT strategies on different realtime scheduling algorithms in reconfigurable computers has been investigated in [37] . In order to manage these issues at runtime, some operating systems have been introduced in [38] and [39] , which provide an environment for the execution of hardware tasks by considering task communication, task placement, and especially task FT [40] . This paper presents a novel technique, named task early fetch, which aims at increasing the reliability of applications, represented as TGs scheduled on an FPGA-based reconfigurable computer, without deteriorating their makespan.
III. MOTIVATIONAL EXAMPLES
In order to better clarify the proposed technique, this section presents a couple of illustrative examples. In this paper, applications are modeled as DAGs and a nonpreemptible as soon as possible (ASAP) scheduling strategy is used to manage configurations and executions of tasks. It is assumed there exists a set of one or more TGs in the system and they are executed serially. In this paper, the concept of stage refers to a complete execution of a TG.
The first example assumes the TG with five tasks, shown in Fig. 1 , which is executed periodically. Its characteristics have been detailed in Table I . Fig. 2 shows a simple schedule of the TG of Fig. 1 . In this case, the time and area occupied in the reconfigurable computer are presented in the horizontal and vertical axes, respectively. In Fig. 2 , the gray-color boxes denote the task configuration delay, whereas the dotted ones indicate that the task has finished its configuration, but it is waiting for its execution to start. Therefore, in this paper, it is assumed that the configuration and execution of a given task do not overlap, and a task can start its execution only when it is configured completely.
For the sake of simplicity, in this paper, it is also assumed that, at any point, the total area occupied in the reconfigurable computer is simply the addition of the resource consumption of all the tasks that are simultaneously under execution or being reconfigured. This actually depends on many factors, such as the partial reconfiguration model and granularity of the target device, or whether the hardware multitasking system that runs the tasks but implements some sort of task defragmentation. In any case, the technique presented in this paper is orthogonal to all these issues, and one of the many systems that have been proposed in the literature for managing the TG execution in reconfigurable computers can be used to run the tasks [29] , [38] , [39] , in combination with the presented approach.
As Fig. 2 shows, task prefetch allows hiding the reconfiguration delay of some tasks by overlapping them with the execution of other tasks. In addition, an active redundancybased FT strategy has been applied to tasks τ 2 , τ 4 , and τ 5 . In a prefetch-aware scheduling algorithm, there is a time point in which all the tasks have been configured completely, but the execution of the TG is not finished yet. In this paper, we have referred to this point as LastConfigTime, and, in the example of Fig. 2 , this value is 771 ms (i.e., the end of the reconfiguration of τ 5, 2 ). In order to define time margins within the schedule to apply the proposed early fetch technique, a boundary value is defined so that: LastCon f igT i me ≤ Boundar y < Makespan. The time margin between boundary and makespan can be used to configure some tasks of the next TG. The criteria of choosing an appropriate value for boundary will be discussed in Section IV-C. Now, let the left side (LS) and the right side (RS) of the schedule of TG T G i be defined as follows.
1) L S(T G i ) is the sequence of scheduling orders (i.e., starting of reconfiguration and starting of execution) comprised between t = 0 until t = Boundar y(T G i ). 2) RS(T G i ) is the sequence of scheduling orders comprised between t = Boundar y(T G i ) and t = Makespan(T G i
). Therefore, if there are enough available resources in the target FPGA, the time elapsed within RS(T G i ) is a good time margin to carry out the reconfigurations of the early fetched tasks belonging to the TG running immediately after T G i , because no task of T G i is configured within this time margin. In this example, let us assume Boundar y(T G i ) =
LastCon f igT i me(T G i ).
In order to illustrate the early fetch technique, Fig. 3 shows two successive executions of the TG presented in Fig. 1 . In this case, the configuration of one replica of Task τ 2 (τ 2,1 ) of Stage 2 has been early fetched. In other words, its reconfiguration now takes place within RS of the first stage of the execution. In addition, the configuration delay hidden by this early fetch has been used to configure another replica of Task τ 2 (τ 2,3 ) at Stage 2. Thus, this technique does not increase the total makespan of the TG execution in that stage (x-axis). In addition, it does not violate the FPGA size limitation either (y-axis). In this example, further stages of this TG execution are identical to the second stage of the TG execution in Fig. 3 . Finally, note that the RS of both stages is identical, although the redundancy level applied to Task τ 2 is different for each stage. The reason is that the early fetched task (τ 2 ) is completely executed within L S(T G i ). Let us bear in mind this fact for the next example. Fig. 4 shows another example in which Task τ 5 , whose execution time falls within the RS of the schedule, is early fetched instead. In this case, the configuration of τ 5,1 is early fetched within Stage 1, and a new replica of that task (τ 5, 3 ) is configured at Stage 2. However, as Fig. 4 shows, as a consequence of this, there do not exist sufficient resources at the RS of Stage 2 to early fetch Task τ 5,1 from an additional execution of the same TG (in Stage 3, which is not shown in Fig. 4 for simplicity). Therefore, Stage 3 cannot benefit from this technique and its execution would be identical to that of Stage 1. In fact, the reason of this has been the modification of the RS of the schedule at Stage 2, due to the addition of another instance of Task τ 5 . In particular, if the following condition is true:
then the applicability of the early fetch technique in Stage i is uniquely dependent on the TG that runs at Stage i − 1. However, if this does not happen, that is
then the applicability of the early fetch technique in Stage i is dependent to the TG execution sequence in Stages [1 .
Section IV-C will explain in detail the consequences of this important fact. It is also noteworthy to remember that the TGs running in different stages can be the same, or completely different. At any rate, there exists a set of TGs that can run in the system (which is known in advance), but their execution order is completely unknown at runtime. This assumption is consistent with modern FPGA-based systems, which are dynamically adaptable depending on the runtime requirements [3] . Finally, it is very important to mention that all the modifications introduced in the original schedules are carried out at design time, and no modifications on such schedules are carried out at runtime. As a consequence, the presented approach always works with static schedules. The reason is that, if dynamic schedules were used instead, the described modifications should be computed at runtime, and, as will be described in Section IV-C, they are very computationally intensive. Hence, they may incur into unaffordable runtime delays.
IV. EARLY FETCH AND RELIABILITY IMPROVEMENT A. Scoring Function
In addition to the condition defined in (1), in order to decide if task τ is an appropriate candidate to apply the Early fetch technique, a scoring function has also been defined
where the following holds. (1) are considered as candidates for early fetch. The objective of this technique is to improve the reliability (R T G ) and MTTF (MT T F T G ) of the TG. These metrics will be further elaborated in Section IV-B.
B. Reliability and Fault Model
In this paper, the failures induced by soft errors, and in particular, by single event upsets (SEUs), are the object of concern. As indicated by [41] , different altitudes above the Earth surface have different soft error rates (SERs). In this paper, the reliability model presented in our previous work [42] has been used to estimate the reliability of a hardware task τ (denoted as R τ ). R τ is the probability that the task executes from its start time to its finish time without any failure, with the condition that the task had no error when starting its execution.
This model assumes that at most one SEU occurs at a time, but one or more upsets might occur during task execution. Soft errors follow the Poisson distribution and they can be regarded as independent and random statistical events. Thus, the probability of an SEU in the sensitive bits of task τ , occurring j times, can be obtained as
where
in which ρ is the SER expressed in #SEUs per bit per time unit [41] , T S τ is task size in configuration memory, S B τ indicates the percent of sensitive bits of Task τ [43] , CT τ is task computation time, and RT τ is residency time of Task τ , indicating the time elapsed from when it is configured until it starts its execution. As this shows, despite the prefetch techniques increase the system performance, they also increase the probability of upsets in the task, which leads to the reliability degradation. The SER can be estimated by some modeling tools, such as CREME96 [44] . Let P(F τ ) indicate the probability of failure of task τ given j SEUs during task execution, j ranging from 1 to ∞. Therefore, we have
By having P(F τ ), the reliability of task τ is obtained as
In this paper, it is assumed that an active redundancy-based FT technique is used for increasing task reliability [40] . With this technique, by replicating Task τ for r times, using the 1-out-of-r scheme, the reliability of the fault-tolerant task τ f t is given by [45] 
Hence, the reliability of TG, after applying FT techniques, is obtained as [26] 
Finally, MTTF of the TG is calculated as inversely proportional to the TG probability of failure [43] 
where M S T G is the makespan of T G. This reliability model has been validated and discussed in more detail in [42] . In spite that it assumes that only SEUs can occur, this is a simplification that many authors make in their assumptions [43] . However, it would be easy to extend this model to k-bit multiple-cell upsets (MCUs), since for each multiplicity k, their value of P(F τ, j ) would be calculated exactly as in (4), but with a different value for the SER (ρ). It is even possible to model the occurrence of MCUs and SEUs altogether, but the demonstration is too long to be included in this paper. In addition, any other reliability estimation methods (analytical, fault-injection, accelerated radiation tests, and so on) can be used instead [46] , [47] , since they would be completely orthogonal to the methodology that this paper presents.
C. Proposed Early Fetch Technique
The motivational examples of Section III have compared two possible modes of application of the proposed early fetch technique between the involved Stages i − 1 and i . (Fig. 3 ), where T G(i ) indicates the TG executed at Stage i . As a consequence, the early fetch between Stages i − 1 and i does not impact the applicability of this technique between Stages i and i + 1. (Fig. 4) . In this case, due to the modifications introduced in RS(T G(i )), the early fetch between these two stages does impact the applicability of the technique between Stages i and i +1. In the aforementioned examples, it was assumed that the same TG is executed twice in the system. However, it is clear that, if another different TG T G j runs at Stage 2 (both in Figs. 3 and 4) , the modifications carried out at the schedules of both TGs could be completely different. Without losing generality, if n TGs can be executed after the TG of Fig. 1 , n different pairs of modifications can be introduced at the schedules of the involved TGs.
1) Some modifications are carried out in just RS(T G(i − 1)) and L S(T G(i ))
2) Some modifications are carried out in RS(T G(i − 1)), L S(T G(i )), and RS(T G(i ))
In this paper, in order to apply the early fetch technique, the profiling of all the n TGs has been carried out at design time in order to obtain the modified versions of their schedules. At runtime, the proper version will be dynamically selected depending on the runtime conditions. In the previous case: 1) the profiling of T G(i ) involves examining all the n TGs that may run at Stage i − 1 and 2) however, for the previous case, such profiling would involve considering the complete sequence of TGs at Stages [1 . . . i −1]. The reader will quickly understand that, given the potentially large number of TGs and stages that may exist in an actual system, in the latter case, such profiling is absolutely unfeasible, since it would involve a combinatorial explosion of combinations. Therefore, the early fetch has been restricted to what is shown in (1) and Fig. 3 .
In other words, only tasks whose execution does not go further than boundary can be candidates to be early fetched.
Thus, given a set of n TGs (T GS) that can be executed in the system, the methodology presented in this paper carries out an n × n design-time profiling for each TG T G x ∈ T GS in order to modify the initial schedules of all the possible pairs RS(T G x ) and L S(T G y ), ∀T G x , T G y ∈ T GS, by selecting the most appropriate task(s) from T G y to be early fetched in T G x , assuming that T G y runs immediately after T G x .
In the examples of Section III, the value of boundary was set to LastCon f igT i me. However, it was also stated that this value could actually be selected, such that: LastCon f igT i me ≤ Boundar y < Makespan. The question that arises is: How to select the most appropriate value for this parameter? In the example of Fig. 4 , the only two tasks that are candidates for early fetch are τ 1 and τ 2 , since they are the two only ones whose execution time falls entirely within L S(T G). However, if boundary was set to 806 ms (i.e., the end of execution of Task τ 3 ), then τ 3 would also be eligible for early fetch, but it has the cost of reducing the time margin of RS(T G) from 335 to 300 ms to early fetch tasks from the next stage. In order to achieve a good tradeoff between these two metrics, in the presented approach, this parameter has been set as follows:
Boundar y(T G) = max(LastCon f igT i me(T G) Makespan(T G) − MaxCon f ig Delays)
MaxCon f ig Delays = max T G i ∈T GS
Con f ig Delay(T G i ) (12) and
Con f ig Delay(T G i
Con f ig Delay(τ j ) (13) which Con f ig Delay(τ j ) indicates the configuration delay of task τ j in the target device. The complete approach is described in Algorithm 1. First of all, in the proposed algorithm, the scoring function of the tasks of T G y is calculated in Lines 2-8. In the next lines (Lines 9-11), each candidate task to be early fetched is examined to obtain its MTTF difference (δ MT T F ) when applying this technique. Afterward, tasks are sorted decreasingly by their δ MT T F (Line 12). Then, the algorithm calculates the time elapsed between boundary and the makespan of the previous TG execution (T G x ), which is referred to as FreeTime (Line 13). This time will be used to know how many tasks from the current TG (T G y ) can be early fetched in the previous one (T G x ). The candidate tasks to be early fetched are selected according to their δ MT T F . At each iteration, it is assessed if each candidate task τ i can be early fetched within the FreeTime of the previous stage, and if its additional replica can be added in the current one (Line 15). If this condition is true, τ i is early fetched, then the RS of the previous schedule, the LS of the current one, and FreeTime are updated (Lines 16-19) . The algorithm returns these two new subschedules for RS(T G x ) and L S(T G y ) (Line 22). 
D. Hardware Implementation
For each pair of TGs T G x and T G y ∈ T GS, such that T G y is executed immediately after T G x , the result of the profiling described in Section IV-C is a pair schedule versions: one for RS(T G x ) and another one for L S(T G y ). Therefore, with n TGs in the system, n × n schedule versions are generated at design time for each TG. At runtime, the proper ones are selected dynamically, depending on the runtime sequence of running TGs. This is shown in Fig. 5 , where one can see that n + 1 versions of L S(T G j ) and another n + 1 versions for the RS(T G j ) are possible (the n generated schedule versions plus the by-default one). In case no information exists at runtime about the previous or next TGs, the selected schedule is just the original one (this is indicated in Fig.  5 by means of the symbol ∅). This happens, for instance, when a TG is executed after a system reset; or when at the time a TG finishes its execution, no other TG is requested for execution yet (and hence, the system remains idle for a while).
In order to carry out the proper runtime selection of the TG schedules in a transparent and efficient manner, this paper also presents a hardware architectural support (Fig. 6 ) that can be implemented using some of the reconfigurable resources of the target FPGA. In our implementation, the pair schedules of all the possible TGs are stored in a memory (see Fig. 6 ). It is assumed that a schedule is composed of a set of instructions, each of which has the following information.
1) Task ID:
The ID of the task that is going to be scheduled. The ID of a task is unique among all tasks of the TG. 2) Reconfiguration/Exec.: Indicates if the task will be reconfigured, or executed. This field is just 1 bit ("1" = reconfiguration and "0" = execution). This information corresponds to the output data port of the schedules' memory, where the instructions are read (see Fig. 6 ).
The proposed system has been designed to work autonomously, since the moment when the schedule of a TG is requested. It features a queue of TGs to be executed (TGs' queue), which has been implemented using a fixed firstin-first-out approach. This architecture is assumed to communicate with an upper layer of middleware or an operating system that dispatches the TGs at runtime.
When the TGs' queue is not empty, the system starts carrying out the proper scheduling operations assigned to the first TG in the queue. The hardware described in Fig. 6 is steered by a control unit, which has been implemented as a finite state machine. It implements the pseudocode presented in Algorithm 2. Thus, if the TGs' queue is not empty, the first step is to read the first TG from the queue (Line 2). Two pieces of information are stored for a TG: its unique ID, and its value for boundary. Both of them are read from the TGs' queue and stored in separate registers in the architecture (see Fig. 6 ). An additional register stores the ID of the TG that was executed prior to the current one (previous TG ID register). This register is used to select the appropriate schedule from the memory, as it was explained earlier.
With this information, the schedule of the current TG is retrieved from the schedules' memory, instruction by instruction (Lines 3-7 in Algorithm 2). The address port of this memory is connected to the following four pieces of information, sorted from the most significant bit (MSB) to the least significant bit (LSB).
1) 1 bit indicating if the instruction belongs to the LS or
the RS of the schedule. This is known by comparing the boundary value with the total number of clock cycles that have elapsed from the starting of the current scheduling stage (which are stored in the total cycles counter). If Total Cycles Counter < Boundary, this bit is "0"; otherwise, its value is "1." Hence, the lower Algorithm 2 Implementation of the Control Unit half of the memory stores the LSs of all the schedules, whereas the upper half of the memory does likewise with the schedules' RSs.
2) The ID from the previous (or the next) TG to be executed. This information is retrieved from the previous TG ID register, and from the data output port of the TGs' queue, respectively, and it is selected by the multiplexer that can be seen in Fig. 6 . The selection signal of this multiplexer is the bit described in the previous paragraph.
3) The ID of the TG currently under execution (current TG ID register). 4) The output of a counter that keeps track of the schedule's instructions that have been executed so far (instructions counter). This allows storing the information of the many possible schedules in the memory in a modular way: the instructions of each side (left or right) of the schedules are physically placed in adjacent positions in the memory, since the instructions counter's output is connected to the LSBs of the memory address port. The exact location of these instructions in the memory is determined by the values of the IDs of the previous and next TGs to be executed. Thus, this hardware support allows fetching the proper instructions in an automatic and transparent manner, with negligible delays and with low resources consumption. Thus, at runtime, depending if the total cycles counter is below or above the boundary value, the instructions will be fetched from the lower or upper half of the memory, respectively, in a very simple but effective manner.
When an instruction is fetched from the memory (Line 4 in Algorithm 2), the signals current Task ID, Reconfiguration/Exec., and the output of the multiplexer that is connected to the Early fetched bit are transmitted simultaneously both to the hardware multitasking system that runs the tasks and to the reconfiguration circuitry (shown in Fig. 6 ). The latter multiplexer is used to select the ID of the TG that the current task belongs to. Thus, if Early fetched = 0, then the task belongs to the TG indicated in the current TG ID register. Otherwise, it belongs to the next TG, which is indicated by the next TG ID signal (in other words, it has been early fetched).
Describing the reconfiguration circuitry and the Hardware (HW) multitasking system is out of the scope of this paper, since there are many implementation options for both of them available in the literature [28] , [29] , [39] . All of them assume that the available resources are divided into a number of partially reconfigurable regions that host the execution of the hardware tasks. That system is also assumed to manage the communications among tasks, as well as the correct execution of the tasks considering their FT technique [38] . In addition, it is assumed that the physical placement of the tasks has been decided elsewhere: the hardware depicted in this section only triggers the reconfiguration/execution of the tasks in the reconfigurable hardware, exactly on the location specified in the programming file of the task. This location has been decided at design time by the placer in another step of the flow.
The value of the total cycles counter is used to compare if the current schedule's instruction has finished or not (activation of the signal instruction complete in Fig. 6 ). Thus, in case Reconfiguration/Exec. = "0" (task execution), the following condition is checked:
T otal Cycles Counter == Star ti ng ti me + Dur ati on.
If this condition is true, the instruction complete line is activated, by selecting the result of the comparison with the multiplexer. In case Reconfiguration/Exec. = "1," an additional condition is checked: if the reconfiguration circuitry has finished carrying out the reconfiguration of the current task (by selecting the other input line of the multiplexer and the AND gate). In either of these two cases, while this condition is not true, the control unit increases the total cycles counter by one and the same comparison is made again and again, cycle after cycle (Line 6 in Algorithm 2). When this condition finally becomes true, the control unit triggers the execution of the next schedule's instruction by increasing the instructions counter by one, then by reading the next instruction from the memory, and by repeating again the process. All this is equivalent to the iterations of the FOR loop in Algorithm 2.
It is important to highlight that the control word "111……111" is used to identify the end of the schedule of the current TG. Thus, when the end schedule line in Fig. 6 is activated, the schedule finishes; the TGs' queue and the previous TG ID register is updated, and the two counters are reset (Lines 9-14). Note that the previous TG ID register is updated to the value of current TG ID only when, at that time, there is another TG in the TGs' queue. Otherwise, it is updated to a null value. This is done in order to ensure that, if the current TG was executed assuming that the following one is null, then the following one is also executed assuming that the previous one is also null and vice versa. Finally, the TGs' queue is updated only after the execution of each TG (hence, if there is a TG whose execution is requested while another one is running, the TGs' queue will be updated only at the time instant marked by boundary). Finally, when a TG finishes its execution and the TGs' queue is not empty, the algorithm will run again, otherwise it waits until the next request of a TG execution.
V. EXPERIMENTAL RESULTS
A. Experimental Setup
In order to evaluate the proposed technique, several experiments have been done on actual TGs obtained from multimedia applications. These TGs are categorized in two groups.
1) Image Applications: Two versions of the JPEG decoder (serial and parallel), an MPEG-1 encoder, and a pattern recognition application (HOUGH) [3] . (9)]. This application contains nine different TGs with two, four, five, and six consecutive tasks [3] . The model presented in Section IV-B has been employed to estimate the reliability of the tasks. For this purpose, different values for the SER have been used. As indicated by [48] , different altitudes above the Earth have different SERs, which can be measured as #SEUs per bit per time unit. In order to have realistic estimations, we have used the SERs of the following four "harsh" orbits: Geosynchronous (GEO), Global Positioning System, Molniya, and Polar. In addition a low Earth orbit (LEO) has been used as a point of reference as it features the lowest SER (see Table II ). For each orbit, the SER is estimated for different solar conditions as: worst week, worst day, peak five minutes, and solar max conditions of a solar energetic particle event [42] , [47] for the Xilinx Virtex-5 XUPV5LX110T FPGA [49] , using the CREME96 tools [44] . We believe that the estimations are reliable, because the selected FPGA's technology has been largely studied in the literature against different sources of radiation [50] , [51] . By using the documentation provided by the manufacturer and by carrying out experimental measurements, it was possible to calculate the reconfiguration overhead of tasks in this device.
In addition, the HW manager that was described in Section IV-D has been implemented on an FPGA. In this case, the Xilinx Virtex UltraScale XCVU095-2FFVA2104E FPGA has been used. We have selected that device for implementation, because it is included in the UltraScale VCU108 evaluation kit, which is a prototyping board that includes the necessary elements to easily implement any hardware design, at a reasonable cost, on a state-of-the-art FPGA [52] .
B. Performance Evaluation for Static Soft Error Rates
In the first experiment, TGs are executed assuming that the SER does not change over the time. This experiment examines two cases: executing TGs individually and executing multiple TGs altogether. The SER that is used in this experiment is the average value of the lowest (LEO-worst day) and the highest SERs (GEO-peak 5 min) that have been tabulated in Table II. The results for individual TGs have been presented in Table III . TGs' characteristics, including task count, makespan, boundary value, and MTTF obtained by ASAP scheduling strategy, have been depicted in Table III . Then, the MTTF and the MTTF improvement of the TGs, achieved by applying the proposed early fetch technique, have been shown. Finally, the last column shows the number of early fetched tasks.
This experiment shows the positive impacts of applying the proposed technique to actual TGs, so that without deteriorating their makespan, the MTTF has been improved by 114% on average. It is noteworthy to state that using other SERs yields very similar results in terms of MTTF improvement.
The proposed technique has also been applied to sequences of multiple TGs. The obtained results have been presented in Table IV . This experiment examines three different groups of TGs: image applications, video applications, and a combination of all the TGs. In this experiment, for each set of TGs, two different cases have been examined. 1) Early Fetch: The performance of the proposed technique. Let us remember that only the tasks that finish completely before the boundary are eligible to be early fetched. Otherwise, as discussed earlier, the n × n TG profiling is unfeasible. 2) Ideal Early Fetch: An ideal scenario, where the runtime TG execution order is known in advance. In this case, a customized TG profiling has been made to obtain the modified schedules. In this case, all the tasks (even those finishing after boundary) were eligible to be early fetched. For this experiment, a sequence of 100 random stages has been generated. As the obtained results show, in this case, the proposed early fetch technique has very positive impacts on the MTTF. In addition, these results show that the ideal case yields an MTTF improvement three or four times greater than the early fetch technique. The reason is that the MTTF improvement grows exponentially when the reliability of the TGs approaches 1 [see (10) ]. In other words, the number of early fetched tasks has an exponential impact on the MTTF improvement. Thus, for instance, when one task is early fetched, the MTTF improvement is, on average, +25%. When two tasks are early fetched, this improvement becomes +110%; but when three tasks are early fetched, +415% MTTF improvement is achieved.
C. Performance Evaluation for Dynamic Soft Error Rates
In the second experiment, the proposed technique has been evaluated under a dynamic SER environment. In this case, the aforementioned TGs have been hardened with two state-ofthe-art FT techniques, and then, the early fetch technique has been applied to them. These two techniques are as follows.
1) Adaptive Technique: It is an adaptive FT technique, also known as "three-mode adaptive strategy," which has been presented in [5] . It employs different FT techniques for different ranges of SERs, but in each SER, a specific FT technique is used for all the tasks. Thus, no redundancy is applied when the SER is lower than 10% of the expected range of SERs, TMR is applied when the SER is above 50% of the expected range of SERs, and DWC is used otherwise. 2) Pareto-Based Technique: Ramezani et al. [6] addressed the problem of applying optimal FT techniques to TGs, with respect to a given schedule, using multiobjective optimization methods. This paper has shown that it is possible to increase the MTTF of a TG without deteriorating its makespan, by using some solutions of the Pareto set obtained from the optimization method. The obtained results have been shown in Fig. 7 . The experiments have been performed on SERs presented in Table II . The SERs have been categorized based on the adaptive technique, but for the sake of clarity, for each SER category, a uniformly distributed subset of three of them has been evaluated. The obtained results show that the early fetch technique outperforms both the adaptive and the Pareto-based techniques.
In addition, the results demonstrate that the improvements achieved are much more significant over the Pareto-based FT technique in environments with lower SERs. Similarly, as in the results shown in Table IV , the reason is that the MTTF increase is much faster when reliability closes to 1, and it reaches to infinite when reliability = 1 (10).
D. Hardware Implementation
Finally, the amount of hardware resources used for implementation of the proposed hardware architectural support is shown in Table V. Table V shows the number of lookup tables (LUTs) and flip-flops (FFs) used, and breaks it down into the different existing modules: the TGs' queue, the schedules' memory, and the control unit. It can be observed that the amount of consumed resources is very affordable: no more than 0.03% of the total FFs and LUTs, whereas it instantiates 1.22% of the available Block Random Access Memory (BRAMs). The latter value is reasonable, considering that the system needs to allow for space to store all the schedule versions for all the TGs, in all the possible scenarios that can exist at runtime.
These values refer to the Xilinx Virtex UltraScale XCVU095-2FFVA2104E FPGA [52] . These data correspond to a system with a maximum of 16 TGs (hence, the number of bits to represent TG ID, n TG ID = 4), at most 16 tasks in each TG (n task ID = 4), schedules with up to 32 different schedule instructions per side (n instr = 5), a TGs' queue with 256 positions, and counters with a width of n cycles = 20 bit. The latter parameter can be used to measure times for task reconfigurations and executions, for instance, ranging from 1 μs to 1048.6 ms if the tasks' running clock frequency is 100 MHz. Thus, for the sake of simplicity, in this case, the widths of the fields starting time, duration (from memory data output), as well as those of the boundary register, and the total cycles counter were set to the same value n cycles . This system scales well for different values of the parameters described earlier, but it must be considered that, every time the width of the memory's address port (2×n TG ID +n cycles +1) increases by 1, the amount of BRAMs that are needed doubles. Figs. 8 and 9 show the resources consumption for different values of this summation. As it can be seen, the FFs and LUTs consumption keeps under 0.12% in all the cases, but when 2×n TG ID +n instr +1 > 17, the %BRAMs consumption reaches double digits. However, this is still an affordable cost for a system that supports a reasonably high number of different TGs. In addition, if the length of the memory's output data port (see Fig. 6 again) , i.e., 2 × n cycles + n task ID + 2, becomes greater than 64, then the total number of BRAMs doubles as well. Nevertheless, this does not happen unless TGs with thousands of different tasks are used.
Focusing on the width of the output data port of the TGs' queue, i.e., n cycles + n TG ID , if this value becomes greater than 32, then the system will need one additional BRAM. Something similar happens when its depth is greater than 1024 positions. In these two cases, the increase of FF and LUTs consumption is negligible.
As discussed earlier, in this implementation, the bottleneck is clearly the embedded BRAMs consumption. Thus, if, for instance, a small FPGA is used, a good solution would be to store the schedules in an off-chip memory (such as a FLASH, or a DDR2, commonly available in commercial FPGA-based prototyping boards), or a memory hierarchy composed of an on-chip cache plus an off-chip memory, which is very common in computer architecture. Of course, in this case, a cost in terms of performance loss has to be paid. However, even if the performance of the proposed implementation decreased drastically, this would not involve significant runtime delays, since this system needs no more than 100 additional clock cycles to carry out the runtime computations. In addition, both this hardware and the multitasking system that steers the execution of the hardware tasks can work at different frequencies.
VI. CONCLUSION
This paper has presented a technique, named task early fetch, to improve the MTTF of hardware applications represented as TGs running on FPGA-based reconfigurable computers under harsh environments, without deteriorating their makespan. This technique receives as input a set of TGs that can potentially run in the target system, and, at design time, it applies two modifications to their schedules. On the one hand, it prefetches some tasks from a given TG within the execution of the previous one. On the other hand, it increases the redundancy level of the selected tasks. Since the actual sequence of TGs that will run in the system is not known at design time, this technique performs an n×n profiling, n being the number of TGs. This paper has also presented a hardware architecture that carries out the proper runtime management of the modified schedules in an efficient and transparent manner, and with negligible runtime overheads.
The impacts of the proposed technique have been examined using a set of actual TGs extracted from multimedia applications. Experimental results have demonstrated the positive effects of the proposed technique to improve the MTTF of hardware TGs running on FPGA-based reconfigurable computers, in environments with static and dynamic SERs. Finally, the low cost and the high performance of the presented prototype have been demonstrated.
