We propose an approach to build fault-tolerant distributed real-time embedded systems. From a given system description (application algorithm and architecture) and a given fault hypothesis (type and number of faults to be tolerated), we generate automatically a static fault-tolerant multiprocessor schedule of the algorithm components on the target architecture, which minimizes the schedule length, and tolerates transient faults of both processors and communication media. Our approach is dedicated to heterogeneous architectures with multiple processors linked by several shared buses. It is based on hybrid redundancy and data fragmentation strategies, which allow fast fault detection and handling. This scheduling problem is NP-hard and we rely on a heuristic algorithm to obtain efficiently an approximate solution. Our simulation results show that our approach generally reduces the schedule length overhead.
Introduction
Today, embedded real-time systems invade many sectors of human activity, such as transportation, robotics, and telecommunication. The progresses achieved in electronics and data processing improve the performances of these systems. As a result, the new systems are increasingly small and fast, but also more complex and critical, and thus more sensitive to faults. Due to catastrophic consequences (human, ecological, and/or financial disasters) that could result from a fault, these systems must be fault-tolerant. This is why faulttolerant techniques are necessary to make sure that the system continues to deliver a correct service in spite of faults [1] . A fault can affect either the hardware or the software of the system. Thanks to formal validation techniques, such as model-checking and theorem proving, a lot of software faults can be prevented. Although software faults are still an important issue, we chose to concentrate on hardware faults. More particularly, we consider processor and bus faults. A bus is a multipoint connection characterized by a physical medium that connects all the processors of the architecture. As we are targeting embedded systems with limited resources (for reasons of weight, volume, energy consumption, or price constraints), we investigate only software redundancy solutions based on scheduling algorithms.
The paper is organized as follows. Sections 2 and 3 describe respectively related work and system models. Section 4 states the faults assumptions and our fault-tolerance problem. Section 5 presents our approach for providing faulttolerance, and Section 6 details the performances of our approach. Finally, Section 7 concludes the paper and proposes future research directions.
Related work
The literature about fault tolerance of distributed embedded real-time systems is very abundant. Yet, there are very few methods that manage to tolerate both processor and bus faults. Here, we present related work involving scheduling heuristics to tolerate processor faults, bus faults, or both.
Processor faults. Several scheduling heuristics have been proposed to tolerate exclusively processor faults. They are based on active software redundancy [2, 3] or passive software redundancy [4] [5] [6] . In active redundancy, multiple replicas of a task are scheduled on different processors, which are run in parallel to tolerate a fixed number of processor faults. [2] presents an off-line scheduling algorithm that tolerates a single processor faults in multiprocessor systems, while [3] tolerates multiple processor faults. In passive redundancy, also called primary/backup approach, a task is replicated into one primary and several backup replicas, but only the primary replica is executed. If it fails, one of the backup replicas is selected to become the new primary. For instance, [5] presents a scheduling algorithm that tolerates one processor fault.
Bus faults.
Techniques proposed to tolerate exclusively buses faults are based on proactive or reactive schemes. In the proactive scheme [7, 8] , multiple redundant copies of a message are sent along distinct buses. In contrast, in the reactive scheme [9] , only one copy of the message, called primary, is sent; if it fails, another copy of the message, called backup, will be transmitted.
Processor and bus faults. Few techniques have been proposed to tolerate both processor and bus faults [10] [11] [12] . In [10] , faults of buses are tolerated using a TDMA (Time Division Multiple Access) communication protocol and an active redundancy approach, while faults of processors are tolerated using a hardware redundancy approach. The approach proposed in [11] tolerates only a specified set of processor and bus permanent faults. The method proposed in [12] is only suited to one class of algorithms called fan-in algorithms. Our approach is more general since it uses only software redundancy solutions, i.e., no extra hardware is required, because hardware resources in embedded systems are limited. Moreover, our approach can tolerate up to a fixed number of arbitrary processor and bus transient faults. This is important since transient faults [13] are increasingly the majority of faults in logic circuits, due to radiation, energetic particles, and so on.
System description
In this section, we present the system models (algorithm and architecture), and define the execution characteristics of the algorithm on the architecture.
Algorithm model. The algorithm is modeled by a data-flow graph, called algorithm graph and noted Alg. Each vertex of Alg is an operation and each edge is a data-dependency. A data-dependency (o 1 o 2 ) corresponds to a data transfer from a producer operation o 1 to a consumer operation o 2 , defining a partial order on the execution of operations. We say that o 2 is a successor of o 1 , and that o 1 is a predecessor of o 2 . An operation of Alg can be either an external input/output operation or a computation operation. Operations with no predecessor (resp. no successor) are the input interfaces (resp. output), handling the events produced by the sensors (resp. actuators). The inputs of a computation operation must precede its outputs. Moreover, computation operations are side-effect free, i.e., the output values depend only of the input values. Architecture model. The architecture is composed of two principal components: a processor and a bus. A processor P i consists of an operator op i , a memory resource m i of type RAM (Random Access Memory), and several communicators c ij . A bus B i consists of one communicator for each existing processor and one memory resource s i of type SAM (Sequential Access Memory). Each operator executes sequentially a set of operations of Alg, and reads and writes data from and into its local memory. Each communicator of each processor cooperates with each other in order to execute sequentially transfers of data stored in the memory between processors through a SAM. The architecture is modeled by a non-directed graph, called architecture graph and noted Arc. Vertices of Arc are: operators, communicators, and memory resources. Edges of Arc are connections between these components. Figure 1 (right) gives an example of Arc, with three processors P 1 , P 2 , and P 3 , and two buses B 1 ={s 1 , c 11 , c 21 , c 31 } and B 2 ={s 2 , c 12 , c 22 , c 32 }, where each processor P i is made of one operator op i , one local memory m i , and two communicators c i1 and c i2 .
Execution characteristics. We target systems based on a cyclic execution model; this means that a fixed schedule of the operations of Alg is executed cyclically on Arc at a fixed rate. This schedule must satisfy one real-time constraint Rtc and a set of distribution constraints Dis. In our execution model Exe, we associate to each operator op a list of pairs o, d/op , where d is the worst case execution time (WCET) of the operation o on op. Also, we associate to each communicator c a list of pairs dpd, d/c , where d is the worst case transmission time (WCTT) of the data-dependency dpd on c. Since we target heterogeneous architecture, WCET (resp. WCTT) for a given operation (resp. data-dependency) can be distinct on each operator (resp. communicator). Specifying the distribution constraints Dis amounts to associating the value "∞" to some pairs of Exe: o, ∞/op meaning that o cannot be executed on op. Finally, since we produce static schedules, we can compute their length and compare it to the real-time constraint Rtc.
Fault model and scheduling problem definition
In our fault hypothesis, we assume only hardware faults and a fault-free software. We consider only transient processor and bus faults. Transient faults, which persist for a "short" duration, are significantly more frequent than other faults in systems [13] . Permanent faults are a particular case of transient faults. We assume at most Npf processor faults and Nbf buses faults can occur in the system, and that the architecture includes at least Npf +1 processors and Nbf +1 buses. Our problem is therefore formally stated as:
a distributed heterogeneous architecture Arc composed of a set P of processors and a set B of buses: 
The proposed approach
Our solution is based on hybrid redundancy and data fragmentation techniques. In the aim to minimize communication overhead, we use active redundancy to tolerate processor faults, and passive redundancy to tolerate bus faults. The reason why to use data fragmentation is to minimize the fault detection latency, i.e, the time it takes to detect a fault.
Hybrid redundancy and data fragmentation.
In order to tolerate Npf processor and Nbf bus faults, each operation is replicated in Npf +1 replicas scheduled on Npf +1 distinct processors. The replica with the earliest ending time is the primary replica, while the other ones are the backup replicas. The earliest ending time is the sum of the earliest starting time (computed in absence of faults) plus the operation's WCET. The data of each data dependency is fragmented into Nbf +1 packets, sent by the primary replica of the datadependency source via Nbf +1 distinct buses to each of the Npf +1 replicas of the data-dependency destination. For example, in the schedule of Figure 2b Communication mechanism. Each operation receives each of its data inputs via Nbf +1 buses; when it has received all the packets of each data input, it defragments these packets and starts its execution. In some cases, the replica of an operation will only receive some of its inputs once, through an intra-processor communication; this will occur whenever one of its predecessor operations has one of its replicas scheduled on the same processor.
. . . . . . Figure 3 . Tolerating Npf processors and Nbf buses faults.
Transient fault recovery and handling. In Figure 3 , three cases can occur: 1. All the packets data m sent by o 1 j are received: in this case, each replica of o i defragments these packets and starts its execution. Also, each replica of o j receives a copy of these packets, which it ignores.
2.
None of the packets data m sent by o 1 j are received: this concerns Nbf +1 packets, and as no more than Nbf buses faults may occur in the system (by hypothesis), this means the failure of the processor P 1 executing the replica o 1 j . To deal with this failure, one backup replica among the Npf other replicas of o j is selected to re-send all the packets data m via the same buses. Since the fault of processor P 1 can be transient, it is not marked as faulty by the other processors. This scheme can be improved by deciding that, if a processor remains faulty during some number of consecutive executions of the schedule (e.g., 5), then its fault is permanent and this processor is permanently removed from the schedule.
3.
Some packets {data m , . . . , data k } sent by o 1 j are not received: let data − be this set of missing packets, and B − ={B m , . . . , B k } be the set of the buses that were supposed to transmit them. Since other packets have been received, it means that P 1 , the processor executing o 1 j is not faulty, and hence that the buses of B − are faulty. Therefore, the same replica o 1 j re-sends the packets data − via other buses chosen among the set B \ B − . Since the fault of the buses of B − can be transient, they are not marked as being faulty. This scheme can be improved with a similar approach as in step 2.
In summary, this communication mechanism yields three advantages: fast fault detection; fast distinction between processor and bus faults; and fast fault recovery.
We have implemented these principles in a greedy list scheduling heuristic, called FT-AAA (Fault-Tolerant Adequation Algorithm Architecture). In the following algorithm of FT-AAA, the superscript numbers in parentheses refer to the steps of the heuristic, e.g., O 
end While
END OF THE ALGORITHM
The algorithm of FT-AAA is divided in four main steps:
Initialization step. The set of candidate operations O (1) cand is initialized as the operations without predecessor. Later, an operation is said to be a candidate if all its predecessors are already scheduled. The set of scheduled operations
sched is initially empty.
Selection step.
For each candidate operation o cand ∈ O (n)
cand , a set P best of Npf +1 processors is selected among all the processors of P to schedule Npf +1 replicas of o cand . The selection rule is based on the dependable schedule pressure function, noted σ (n) . It is computed, for each operation o i ∈O (n) cand and each processor P j ⊂ P, as follows:
where
is the earliest time at which operation o i can start its execution
o i is the latest start time from end of o i (defined to be the length of the longest path from the output operations to o i ), and R (n−1) is the schedule length at step (n−1). The set P best of each o cand ∈O (n) cand is composed of the Npf +1 processors that minimize σ (n) . Then, among all O (n) cand , the most urgent candidate o best , with a processor P best ∈ P best (o best ) that maximizes this function, is selected to be replicated and scheduled.
Distribution and scheduling step.
This step involves first replicating the best candidate o best into Npf + 1 replicas, and second scheduling each replica o k best of o best respectively on the processor P k best of P best . Before scheduling each of these replicas, the data of each data-dependency are fragmented into Nbf +1 packets that are scheduled on Nbf +1 distinct buses.
Updating step. The scheduled operation o best is removed from O (n)
cand , and the operations of Alg which have all their predecessors in the new set of scheduled operations are added to this set.
Simulations
To evaluate FT-AAA, we have implement it in SYNDEX, a CAD tool for optimizing and implementing real-time embedded systems (http://www. syndex.org). Then, we have applied the FT-AAA heuristic to a set of randomly generated algorithm graphs and an architecture graph composed of five processors (|P| = 5) and four buses (|B| = 4). In our simulations, we study the impact of Npf , Nbf , the number of operations N , and CCR (Communication to Computation Ratio) on the schedule length overhead introduced by FT-AAA, computed by Equation ( 
where FT-AAA takes as parameter the numbers of processor and bus faults (Npf , Nbf ), AAA is exactly FT-AAA(0, 0), and "length" is a function that computes the schedule's length.
Impact of Nbf and N . We have plotted in Figure 4 the average overheads on the schedule length of 100 random algorithm graphs for each N , Npf =0, CCR=1, and Nbf =1, 2, 3. This figure shows that the average overhead is very low (between 6% and 18%) and increases slightly with N . This is due first to Npf =0, i.e., operations of Alg are not replicated, and second to the use of passive redundancy of communication. Also, for the three values of Nbf , the heuristics FT-AAA(0,1), FT-AAA(0,2) and FT-AAA(0,3) bear almost similar results with no significant advantage between the three variants.
Impact of Npf and N . We have plotted in Figure 5 the average overheads on the schedule length of 100 random Alg for each N , Nbf =0, CCR=1, and Npf =1, 2. This figure shows that the average overhead when Npf =1 is 45%, while for Npf =2 it is 75%. These figures are much lower than the expected 100% when all computations are scheduled twice, and 200% when all computations are scheduled thrice. It also shows that the performances of FT-AAA decrease when Npf increases. This is due to the fact that FT-AAA uses the active redundancy of operations. However, for the two values of Npf , FT-AAA(Npf ,0) produces almost no significant difference between the overheads obtained for the different values of N .
Impact of CCR.
We have plotted in Figure 6 the average overheads on the schedule length of 100 random Alg for N =40, Npf =1, Nbf =1,2,3, and each CCR. Thanks to the data fragmentation, this figure shows that, when the communications are less expensive than the computations (CCR <1), the performances are almost identical for Nbf =1 to 3. In contrast, when the communications are more expensive (CCR >1), the performances decrease when Nbf increases. Also, for Nbf ≤2, CCR has no significant impact on the performances of FT-AAA; again this is due to the data fragmentation. It is not true anymore when Nbf ≥3, because the number of buses, 4, becomes limitative. Figure 4 . Impact of Nbf and N 
F I 9 I I 9 I I 9 I P 9 P P 9 P P 9 P P 9 P P 9 P P 9 P P 9 P P 9 P P 9 P P 9 P P 9 P P 9 P P 9 P P 9 P P 9 P P 9 P 
Conclusion
We have proposed in this paper a solution to tolerate transient faults of both processors and communication media in distributed heterogeneous architectures with multiple-bus topology. Our solution, based on hybrid redundancy and data-fragmentation strategies, is a list scheduling heuristic, called FT-AAA. It generates automatically a multiprocessor static schedule of a given algorithm on a given architecture, which minimizes the schedule length, and tolerates up to Npf processors and up to Nbf buses faults, with respect to real-time and distribution constraints. The communication mechanism, based on data-fragmentation, allows the fast distinction between processor and bus faults, the fast detection of faults, and the fast handling of faults. Simulations show that our approach can generally reduce the schedule length overhead. Currently, we are working on an improved solution to take sensors/actuators faults into account.
