Abstract-Partial reconfigurable system is an architecture consisting general purpose processors and FPGAs, in which FPGA can be reconfigured in run-time. Based on the architecture, software tasks and hardware tasks that are executed on processor and FPGA respectively co-exist. In this paper, a real-time fault-tolerant scheduling algorithm is proposed to schedule software/hardware hybrid tasks. In the algorithm, the sufficient condition for schedulable hybrid tasks is derived from analyzing system operation conditions when the first deadline is missed, and rollback/recovery and TMR approaches are used respectively to schedule software subtasks and hardware subtasks for fault tolerance. The experimental results demonstrate that all deadlines of accepted hybrid tasks are met and processor's utilization ratio is increased greatly compared with that of the exiting approaches when multiple faults occur.
I. INTRODUCTION
As we all know, the FPGA's configuration information is stored in static RAM, which is easily affected by space particles and electromagnetic wave. When the static RAM is affected, the SEU (Single Event Upset) occurs and internal circuit fails. In order to avoid the serious consequences caused by system failure, we need to provide the fault-tolerant ability for FPGA to ensure that all real-time tasks' deadlines can be met even internal partial circuit failed.
The fault-tolerant solutions for FPGA are usually divided into two categories which are based on hardware and software respectively. The first one is still thinking along the hardware redundancy which set the spare resources in the FPGA chip to achieve fault tolerance. If the resources are damaged somewhere, it will be replaced by the spare resources [1] . Doumar et al [2] proposed a solution which can move the configuration data between the row, the column and the modules of I/O by the special designing of the SRAM. When a failure occurs, the configuration data of failure resource will be transferred to the adjacent free resources according to specific rules to make the system back to normal. The second approach is based on hardware/software coordination which firstly tests the FPGA chips and stores the damaged data in the database by software, and then the test results are read from the database to ensure that the damaged parts are not used and finally re-layout to resume normal operation of FPGA [3] . Above solutions require the system to stop working when detecting the FPGA chip which will reduce performance and flexibility of the system. So some researchers proposed schemes of online detection and dynamic reconfiguration while the system is still working [4, 5] . These solutions are mainly made for non-real-time systems offline or online fault detection while real-time tasks can not be guaranteed.
Real-time fault-tolerant scheduling is a technology which can achieve the ability of system fault tolerance through software and improve the reliability of the system with limited hardware spending. The proposed fault-tolerant scheduling algorithms [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] were mainly developed based on the technology of primary/backup version, and when multiple processors fail at the same time in the system, the multiple backups version is needed for each real-time task which will result in the processor utilization decrease rapidly.
To resolve the rapid decrease of processor utilization and the problem of the real-time hardware task's faulttolerance, this paper proposes a fault-tolerant scheduling algorithm FT-SSHTNB (Fault Tolerant SSHTNB) for real-time hardware/software hybrid tasks, the FT-SSHTNB algorithm is based on SSHTNB algorithm [17] .
II. TASK AND SYSTEM MODELS
In the system, hybrid tasks are presented by a set T={T 1 ,T 2 , … ,T n } and each task T i ∈T is represented by a directed acyclic graph (DAG): Not lose the general case, assume that each task T i only has an entry subtask Entry i and an exit subtask Exit i . Figure 1 shows a task's instance which contains 8 subtasks, where rectangle presents hardware subtask and circle presents software subtask.
Let Figure 2 , reconfigurable system consists of multiple processors and multiple FPGAs. Processors and FPGAs are linked together through high speed serial bus. Some free resources (Slots) are reserved for hardware tasks in FPGA, where hardware subtasks can be configured to the slots dynamically when the system is still running. The hardware subtasks can communicate with the software subtasks through the serial bus interface and hardware task interfaces (HTI).
III. FT-SSHTNB SCHEDULING ALGORITHM
FT-SSHTNB algorithm has some restrictions on the fault model and fault number. We make the following assumptions for failure model: 1) At most f computing units fail at a time, when the f+1th failure occurs, there is at least one of the former f faults has been repaired and put into operation.
2) Suppose that the interval between two failures is longer than the period of real-time tasks, namely, there is at most one failure occurs when a real-time task executes.
Definition 1: For a task T i , if it can tolerate k failures, that is, when the failures are not more than k, task T i can be fault-tolerant scheduled by a algorithm; when the k+1 failure occurs, the algorithm will go into the exception process, then the algorithm is called k-fault-tolerant for the task T i [17] . Definition 2: For a task set T = {T 1 , T 2 ,…, T n }, if a scheduling algorithm is k i -fault-tolerant for task T i , then the algorithm is (k 1 , k 2 , …, k n ) -fault-tolerant for the task set T.
We can see from the fault model that FT-SSHTNB algorithm is f-fault-tolerant for each task T i .
According to the place where SEU occurs, the system failure can be divided into two categories: software failure and hardware failure. If the SEU occurs in the main memory, the program's data flow and (or) instruction flow are prone to error and if the SEU occurs in the configuration RAM, the hardware error will occurs. FT-SSHTNB algorithm schedules the software subtask on two processors at one time and compares subtask execution state in check point. By this way, processor error and software task error can be found. And hardware subtask error is detected and tolerated through the TMR structure.
A. Fault-tolerance for Software Subtasks
Check Points As figure 3(a) shows, each software subtask st ij is correspond to two threads thread ij 1 and thread ij 2 which were scheduled to two processors and executes simultaneously. By checking the consistency of checkpoint and the synchronization of the two threads, processor error and software task error can be detected. If a failure is detected, the threads are rolled back to the appropriate checkpoint.
From the point view of thread ij 1 , fault detection process is shown in Figure 3 ( In Figure 3 , the checkpoint number is generated dynamically during the execution of thread, each checkpoint number increases 1 when the thread encounters a checkpoint. The checkpoint's state is stored as an information block and expressed as a 4-tuple ckp = {id 1 (3) If thisid < otherid, the task T sa is started to save the state of checkpoint CP i and thisid increases 1.
In algorithm CmpSync, thisid and otherid are the checkpoint number of current thread and another thread respectively, If thisid > otherid, then the interval of the two threads is greater than a detection length and we can determine that the other processor fails and recovers checkpoint's state. If thisid = otherid, then the thread is ahead of the thread on the other processor and the checkpoint is not compared, so the checkpoint will be compared and it is determined to save the new checkpoint or roll back to the previous checkpoint according to the consistency of checkpoint. If thisid < otherid, the thread is behind the thread on the other processor and the caparison of checkpoint has complete, so it is only needed to save the checkpoint's state.
Both thread ij 1 and thread ij 2 should execute algorithm CmpSync when they reach checkpoint, so the number of execution is same, but the execution load of the algorithm CmpSync is not same, because sometimes checkpoint's state need to be compared and sometimes not. The following theorem will prove that the load asymmetry does not lead threads thread ij 1 and thread ij 2 to lose synchronization. As shown in Figure 4, 
When r faults occur, the maximum execution time of task T i 's software subtask is:
Checkpoints can effectively reduce the task execution time, but if the cost of checkpoint is large or too many checkpoints are set, the total execution time may be longer than the execution time with no checkpoint [16] . According to equation (3) , when the task's execution time C ij , checkpoint recovery time C re and saving time C sa are certain, the task's maximum execution time depends on checkpoint's number m ij . C will obtain the minimum value. □
B. Fault-tolerance for Hardware Subtasks
Based on the existing technology, saving and recovering circuit's state still can not be implemented or the cost is too large, so when the hardware subtask fails, strategy of rollback/recovery should not be taken. In this paper, we use TMR technology to detect and tolerate hardware subtask's failure.
In order to increase the utilization of reconfigurable resource, hardware subtasks with precedence constrains are partitioned into the same group and all subtasks in a group are configured into one slot. The partition algorithm is described as fellows: 
C. Tasks' Schedulability Test
According to the theorem 4-1 in paper [17] , if a set of periodic tasks T = {T 1 , T 2 , …, T n } satisfies the
then T can be scheduled by SSHTNB algorithm on m processors without fault-tolerant requirements. During the execution of task T k , the critical path will be changed if some processors fail. So the critical path must be found when different processors fail. According to theorem 4-1 [17] , if the inequation (5) holds, the tasks set T can be scheduled by FT-SSHTNB algorithm on m-f processors. □
D. FT-SSHTNB Algorithm Description
The SSHTNB algorithm [17] can schedule software/hardware hybrid real-time tasks without faulttolerant requirement. Based on this, rollback/recovery mechanism is introduced to FT-SSHTNB algorithm to schedule tasks with multiple software and hardware faults. The following part only describes the content related to fault tolerance in FT-SSHTNB and the other content is similar to SSHTNB in paper [17] .
FT-SSHTNB algorithm includes two parts: static algorithm and dynamic algorithm. In static algorithm, each real-time task T i 's fault-tolerant number k i is determined and schedulability of tasks is tested according theorem 3. Dynamic algorithm includes following parts:
Scheduling hardware subtasks: 
E. The Analysis of Real-time Capability
Because of the reliability of computing units, all realtime systems can not guarantee each real-time task's deadline when some computing units fail. So we give the definition of real-time capability for a task as follows:
Definition 5: the real-time capability of a task T i is the probability P(T i ) that task T i can be finished within its deadline.
SEU is a random event with the characters: (1) in time interval [t, t+ Δt ], the probability of SEU happening k (k≥0) times only depends on interval length Δt and has no relation with interval endpoints t, t+ Δt ; (2) One SEU happens independently to the others in time intervals without overlaps; (3) the probability of SEU happening two or more times can be thought as zero when the time interval is small enough. So SEU flow can be regarded as Poisson flow. Let X i represents the times of SEU occurring to a computing unit within task T k 's period P k , and suppose the intensity of SEU flow as λ, and then the probability of the computing unit failing during task T i executing is:
Let n F represents the number of computing units that fail. Because the computing unit fails independently, the probability of q computing units failing in the same time within a system containing M computing units during task T k executing is:
When q≤f, the task T k can be finished within its deadline according to FT-SSHTNB algorithm, so the real-time capability of task T k is: Suppose the real-time system contain 50 computing units and the SEU flow intensity λ = 10 -5 , the real-time capability of task T k is shown in table I according to formula 8.
IV. EXPERIMENTAL RESULTS AND ANALYSIS
In these experiments, we have simulated processor utilization under different conditions and compared with the related algorithms, and the results are shown in table  II, table III and table IV FT-RMFF [6] algorithm is a classic fault-tolerant scheduling algorithm for periodic tasks base on RM scheduling algorithm; HTFS algorithm [12] uses FT-RMFF algorithm to test the schedulability of periodic tasks and aperiodic tasks also can be fault-tolerant scheduled; DABCBF algorithm [13] improved the FT-RMFF algorithm by deferring the execution of task's active slave copy to increase the processor utilization; Liu [8] presents a new algorithm to test the schedulability of periodic tasks with fault-tolerant requirement. As can be seen from Table II , the above mentioned algorithms can only schedule independent software tasks and tolerate one processor failure. Compared with these algorithms, FT-SSHTNB algorithm's capability is much stronger. Because FT-RMFF, HTFS, Liu and DABCBF algorithms adopted primary/slave copy technology, if each real-time task has multiple slave copies, these algorithms also can tolerate multiple processor failures. In this experiment, task's execution time accords to the uniform distribution in (0, 0.5Pi] and independent software periodic tasks are fault-tolerant scheduled on 32 processors, and the scheduling results are shown in Table  III . From Table III we can see that FT-SSHTNB algorithm has low processor utilization when faulttolerant number is few, however the processor utilization of FT-SSHTNB algorithm is higher than that of other algorithms with the fault-tolerant number increases. Table IV shows the scheduling results of software/hardware hybrid tasks with parameters same to SSHTNB algorithm. As can be seen from Table IV, the average utilization ratio of processor decreases with the fault-tolerant number increase, however the decrease becomes unobvious with the processor number increase. For example, when there are 8 processors in the system, the processor's utilization ratio decreases by 24.7%, however when there are 64 processors, the ratio only decreases 2.8%. The reason is that if the fault-tolerant number is f, the task set T must be scheduled on m-f processors according theorem 3. When m is few, the task's load will too large to be scheduled on m-f processors, so the processor utilization ratio is lower.
V. CONCLUSIONS
In this paper, a method of processor and software task fault detection and tolerance is given firstly. When there are multiple processor failures, this method can effectively improve the processor utilization. Secondly, the hardware subtask fault detection and tolerance issues are researched, and each hardware subtask is configured to 3 slots in FPGA and fault tolerance is realized by TMR technology. Finally, a real-time fault-tolerant algorithm (FT-SSHTNB) is proposed to schedule software/hardware hybrid tasks. The experimental results show that FT-SSHTNB algorithm can tolerate multiple hardware failures and guarantee all real-time task deadlines to be met with low hardware cost.
