Abstract: We propose a safe design method for safe execution systems, based on faulttolerance techniques: it uses optimal discrete controller synthesis (DCS) to generate a correct-by-construction fault-tolerant system. The properties enforced concern consistent execution, functionality fulfillment (whatever the faults, under some failure hypothesis), and several optimizations (of the tasks' execution time). We propose an algorithm for optimal DCS on bounded paths. We propose model patterns for a set of periodic tasks with checkpoints, a set of distributed, heterogeneous and fail-silent processors, and an environment model that expresses potential fault patterns. The implementation is illustrated using the Sigali symbolic DCS tool and the Mode Automata programming language. Copyright c 2007 IFAC Keywords: Optimal discrete controller synthesis, fault-tolerance, checkpointing, reactive systems, embedded systems.
MOTIVATION
Reactive systems must respond continuously to their environment, at a speed imposed by the latter. This property imposes a real-time constraint: the environment sends stimuli to the system, to which it must react before the next stimuli. In safety critical systems, this constraint is strict, and hence it must be guaranteed on the implementation. We wish to design multitask multi-processor reactive systems. As processors are subject to failures, we make sure that our systems are fault-tolerant (safety critical constraint). We also guarantee that they react within some fixed temporal bound (real-time constraint). Our proposal is to use discrete controller synthesis (DCS) to obtain automatically a controlled system that guarantees its faulttolerant and real-time constraints by construction.
We represent a system by its behavior in terms of starting and reconfiguring lower-level computations [Altisen et al. (2003) ]. Each computation is represented as a periodic task, whose period matches the reaction time imposed by the environment. Each task is itself decomposed into several successive phases, separated
The major advantage of using DCS for fault-tolerance is that, if DCS succeeds, then the resulting controlled system will dynamically reconfigure itself upon the occurrence of a failure, while being guaranteed by construction that it satisfies its real-time constraints whatever the failures. This paper builds up upon previous results where DCS was applied with objectives of invariance, cost bounding, and one-step optimal DCS [Girault and Rutten (2004) ]. Here, we have a richer task model with phases and checkpoints, and we apply optimal DCS on finite paths. To achieve this, we propose a variant of the classical optimal DCS algorithm of Bellman [Bellman (1957) ] in order to cope with path having infinite loops, as it occurs with reactive systems.
In the remainder, we first describe our model of distributed systems with faults, then we describe our DCS technique for fault-tolerance, and particularly optimal DCS on paths dealing with self-loops encountered in reactive systems. Next we define fault tolerance in terms of properties on the model, which are used as objectives for DCS, and lead to the automatic derivation of a controller for fault tolerance.
DCS ALGORITHMS AND MODELS
We adopt an existing DCS framework ] and propose a modeling methodology. Thus we only introduce the useful definitions or technical aspects of the tools, and summarize the functionality.
Labeled transition systems (LTS)
Formally, an LTS is a tuple Q, q 0 , I, O, δ , where Q is a finite set of states, q 0 is the initial state, I is a finite set of input signals (produced by the environment), O is a finite set of output signals (issued to the environment), and δ is the transition function, i.e., a mapping from Q × Bool(I) × O * → ×Q. Each transition has a label of the form g/a, where g ∈ Bool(I) must be true for the transition to be taken (g is the guard of the transition), while a ∈ O * is a conjunction of outputs that are issued when the transition is taken (a is the action of the transition). A transition (s, g, a, s ) will be graphically noted s g/a → s . We use this level of definition for our modelling work, in a graphical form in the Figures of this paper. A path is a sequence (possibly infinite) of transitions starting from the initial state q 0 . In this paper, we only focus on LTSs which are deterministic and reactive:
• determinism guarantees that the system always reacts in the same manner to the same sequence of input events; • reactivity guarantees sensitivity to any event feed from its environment. 
Discrete controller synthesis (DCS)
DCS emerged in the 80's [Ramadge and Wonham (1987) ], with foundations in language theory. Its purpose is, given two languages P and D, to obtain a third language C such that P ∩ C ⊆ D. This is a kind of inversion problem, since one wants to find C from D and P. Here, P is called the plant, D the desired system or objective, and C the controller. Several teams proposed extensions and applications of this language theory technique to labeled transition systems (LTS).
In our approach, P is specified as a LTS, and D is an objective to be satisfied by the controlled system, typically making a subset of states invariant in the controlled system, or keeping it always reachable. The controller C obtained with DCS is a constraint restricting the transitions of P, i.e., inhibiting those that would jeopardize the objective. The key point is that the set of inputs I is partitioned into two subsets, I c and I u , respectively the set of controllable and uncontrollable inputs. The controller C can only inhibit those transitions of P for which the guard contains at least one controllable signal, i.e., in I c . As illustrated in Figure 1 , the objective is expressed in terms of the system's outputs. The controller is obtained automatically from a user specified LTS and objective, according to ].
Optimal discrete controller synthesis
It is also possible to consider weights assigned to the states and/or inputs/outputs of P, and to specify that some upper or lower bound must never be reached. Optimal DCS [Kumar and Garg (1995) ; Tronci (1996) ; Sengupta and Lafortune (1998) ; Marchand and Samaan (2000)] can then be used to control transitions so as to minimize/maximize, in one step, some function w.r.t. these weights; i.e., go only to next states with optimal weight [Girault and Rutten (2004) ].
In this paper, the problem of finding an optimal strategy on a path is considered. For a system M together with a complete cost function C, the optimal strategy leads M from its initial state q 0 to some final states Q f . The execution path E which is followed from q 0 to q f ∈ Q f must have the lowest execution cost that can be guaranteed. Indeed, if one or more minimal cost paths exist, it cannot be guaranteed that they are systematically followed. Uncontrollable events might drive the system into "bad" cost states such that a global minimum is not reached.
In terms of the concrete transition systems seen previously, we define C(q) be a cost function mapping each potential state of an LTS to a strictly positive integer cost: C : Q → N . The execution cost EC of a path of length k starting at state q 1 , E(q 1 ) = (q 1 , . . . , q k ) is obtained by adding the static costs of the states in E.
Bellman's algorithm [Bellman (1957) ] takes into account this aspect. It operates following two steps. At the first step, each state q of the transition system M is mapped to the best execution cost achievable to reach Q f , by taking into account the worse-case plays of the adversary. Note that this cost value is not necessarily the minimal execution cost achievable. If such an execution path does not exist, then the best cost achievable equals +∞. Let W : B p → N be the mapping function. W is defined as the greatest fixed point of the following recurrent equations:
The second step uses W to generate the best trajectory reaching Q f . For any state q, compute the best immediate successor q such that:
According to Bellman, this algorithm is guaranteed to find a greatest fixed point corresponding to the optimal solution of the synthesis problem. The actual implementation of this algorithm is presented in [Marchand and Le Borgne (1998) ].
Dealing with self-loops. A characteristic of the Bellman solution is that the set of paths that is solution of the optimal synthesis problem are direct paths (i.e., without loop), meaning that all the loops that are present in the model are pruned during the computation of the solution, either because they are controllable (one can avoid by control to enter this loop), or because the best execution cost of the states that belongs to this loop is infinite (meaning that there is no way to prevent the system to go through this loop); such loops are said to be uncontrollable.
As mentioned, Bellman's algorithm marks all uncontrollable loops with an infinite best execution cost. We argue that the particular case of 0-length loops (self-loops) deserves special processing. Uncontrollable self-loops are a common artifact for modeling the waiting for an event occurrence, which is one of the basic mechanisms in reactive systems. Indefinite waiting for an event occurrence should not penalize the best execution cost of an execution path.
To overcome this problem, we restrict Bellman's algorithm so that self-loops only count once in the computation of the execution cost of E l . This is done by considering in the fix-point computation that the target and source states have to be different (i.e., q = q in the equation below):
Examples of how this algorithm is actually working and the differences between the classical and adapted Bellman's algorithm will be presented further in the paper.
3. DISTRIBUTED SYSTEM MODEL This section outlines the different parts of our model. Details can be found in a companion research paper [Dumitrescu et al. (2007) ]. Figure 2(a) , where OK i means that the processor i is running fine, while ERR i means that it has crashed. We assume that only the processors can fail, with a fail-silent behavior. Recent studies have shown that a fail-silent behavior can be achieved at a reasonable cost [Baleani et al. (2003) ]. Failures are also permanent, hence a processor cannot go back from the ERR to the OK state. Modelling intermittent failures or degraded modes (e.g. at a slower speed, overloaded) would also be possible [Dumitrescu et al. (2007) ]. Processors can be used by tasks in a time-sharing manner, so that several tasks can be active on the same processor at the same time.
Architecture model Local processor model. It is given by the LTS of
Heterogeneous architecture model. The processors are embedded inside a fully connected network of pointto-point communication links. We note S the set of all these processors. We assume that the communication links cannot fail. One processor is dedicated to executing the controller, P 0 , and only the other processors are available for executing the system's tasks. The model consists of the composition of all LTSs as above. In our running example, we have three of them, one for each of the processors P 1 , P 2 , and P 3 , for which capacity bounds b i w.r.t. power consumption are, respectively, 5, 3, and 6.
Environment or fault models specify what failures can occur in the system. For instance, how many failures should the system tolerate? Can failures occur in any order or are there known sequences or patterns? In terms of our processor model of Figure 2 , the question is how can the f i events occur? It seems natural that all the f i events be uncontrollable (i.e., ∈ I u ), since a failure is a event intrinsically uncontrollable. But this would mean that there would be no constraints whatsoever on them. In particular, all events f i could occur, meaning that all processors could fail at the same time. Of course, this would result in a total failure of the system, with no possibility at all to ensure the faulttolerance of the system. No one expects a system to tolerate a failure of all the processors it is made of. Therefore, we need to specify the failure patterns that we consider.
To model this, we choose to have one LTS modeling the environment. Its purpose is to issue the signals f i from signals e i produced by the environment. These signals e i will be uncontrollable (i.e., ∈ I u ), reflecting the fact that a failure can occur at any time, while the signals f i will be local, i.e., neither in I u nor in I c , and will be used only for computing the synchronous product of all the LTSs. the The simple environment model of Figure 2 (b)(a) allows only one failure to occur in the system. According to the available knowledge about the system, one can directly specify the failure patterns by giving directly the LTS producing the local signals f i from the input signals e i [Dumitrescu et al. (2007) ]. Providing such an environment model is part of the design work. For convenience, we introduce a failure event that signals the occurrence of at least one
When the LTS of the fault model is acyclic, then by composition with the other parts of the model it will result in an acyclic global LTS. This corresponds to the intuitively satisfying fact that we want to design systems tolerating bounded failure patterns. The technique of optimal DCS on paths is then applicable, and can enforce guarantees on the performance of the resulting fault-tolerant system.
Task model
Basic control structure pattern. Each task j is formally modeled by a LTS, which describes how the control of the activity of the task is done in reaction to events. For example, we assume that the task can be executed on three processors, as in Figure 3 (some transitions have been omitted or dotted for readability; for clarity, transitions going to "..." are meant to go to the mentioned state at the other side of the Figure) . It features an initial idle state I j , a ready state R j after reception of the request signal r j , a terminal state T j , and several active states A j i , representing task configurations, one for each processor in the system. By convention, subscripts/superscripts refer to processors/tasks. In the state A j i , task j is executed on processor i, until the occurrence of the control event c j . Such tasks can be modeled by Mode Automata [Maraninchi and Rémond (2003) ]. A transition from state A j i to state A j k represents the reconfiguration of the system, by stopping task j on processor i and migrating and restarting it onto processor k. Cycles of such reconfigurations could make sense, especially w.r.t. load balancing issues, but as such they would introduce cycles in the global LTS modelling the system, and this would prevent us from using the application of optimal DCS on paths. We therefore consider reconfigurations only upon control event (e.g., c j a j 3 ), or upon failure (e.g., a j 2 f ), and condition all reconfiguration transitions with the f event introduced earlier in the fault model: a j i ∧ f . Hence, the optimization on paths is applicable.
We introduce a notion of phases and checkpoint. Going from one phase A j 2 to the next in sequence B j 2 is acknowledged with an uncontrollable checkpoint event c j . The last checkpoint is actually the termination. When a task is migrated, it is restarted from the beginning of the current phase and not from the very start of the task. In that sense, each phase transition is a control checkpoint, and when a task is migrated it rolls-back to the latest checkpoint.
In terms of controller synthesis, the signals r j , c j , and e i will be uncontrollable (i.e., ∈ I u ), while the signals a j i will be controllable (i.e., ∈ I c ). Quantitative characteristics. Interesting characteristics can be modeled as weights associated with states [Marchand and Rutten (2002) ]; we consider just simple mappings from states to integers. Execution time is the CPU load required by each task's phase, as measured by a Worse Case Execution Time (WCET) analysis. When a task migrates, it rolls-back to its latest checkpoint. Hence its new processor must fully accept the CPU load of the restarted task's phase. Power consumption C j i of a task j is given relatively to each processor i. It is related to the WCET, but not in a linear way [Chandrakasan et al. (1992) ].
System model
The model of the multi-processor, multi-task system is built by composing the different local models introduced previously: one for the environment model, one for each processor, one for each task, one for the scheduler. This defines a global LTS, of all possible reconfigurations of the system under study. The application of DCS provides us with the controller that restrains its behaviours to those respecting the properties required for distributed fault-tolerant systems.
Finally, a scheduler or program schedules the tasks according to the precedence graph specified by the user: it must issue the signals r j in the correct order, so that the tasks become ready (in the R j state) in such a way that the precedence constraints are satisfied. In our setting, the scheduler always models a finite acyclic computation sequence. Hence, its composition with the tasks and fault models yields a globally acyclic model. The schedler is represented by a LTS, which features a terminal state T . Such schedulers or programs can be obtained from higher-level, domainspecific languages [Delaval and Rutten (2006) ].
FAULT-TOLERANCE PROPERTIES
Fault-tolerance is specified declaratively by a set of properties and objectives. The fault-tolerance specificity of these properties is twofold. On the one hand, they are meant to be considered upon models as described above, where all faults, recoveries or failures behaviors are represented. On the other hand, they characterize failed states (e.g., consistent placement constraints characterize states where the system is not viable), as well as the tolerance, meaning the notion of fulfilling functionality whatever the faults.
Invariance properties
We briefly recall basic properties used as objectives for DCS [Girault and Rutten (2004) ; Dumitrescu et al. (2007) ].
Insuring consistent execution is formulated as the fact that no task is active on a failed processor:
It is contradicted whenever a task T j is active on processor P i (i.e., in state A j i ) while P i is in Err i . The synthesis objective is to make it invariantly true. Also, it is required that tasks active are within processor capacity: ∀i, C i ≤ b i . It is contradicted whenever the accumulated cost of all tasks active on a given processor exceeds its capacity bound. The synthesis objective is to make it invariantly true.
Insuring functionality is not just a state property: it involves paths. We want to avoid that DCS inhibits indefinitely the start of a task: tasks are activated only when "the path is clear and wide enough all the way down" to termination, even in case of failures. The functionality is fulfilled iff from all reachable states, the terminal state T of the program is reachable.
For the case of a tasks server, without a scheduler or program, one should make reachable the state (T 1 , . . . , T n ). This property is instrumental in characterizing fault-tolerance, as it excludes behaviors where all activity would be frozen in the waiting states in order to avoid jeopardizing previous properties
Optimizing costs
Optimizing costs along paths through phases consists of choosing the sequence of phases where migrations minimize the cost between the ready and termination states. In the example of Figure 3 , different paths across checkpoints and migrations can have different costs, and the choice of migration must be made according to the other constraints of the application: the invariance properties must always hold, and the schedule must have the best cost that can be achieved despite the worst scenarios driven by the uncontrollable events. The optimal synthesis algorithm achieves the cost optimization through phases. It computes W Q f , the best cost function, that maps each state of M to the best execution cost achievable to reach the target If an optimal solution exists (W Q f (q 0 ) < +∞) then the reachability of the target state is guaranteed and hence, the functionality fulfillment property is also satisfied. Figure 4 illustrates the execution of the synthesized controller, according to the particular scenario of two tasks running on three processors, each task having two phases. The controller is generated under the assumption that both tasks start at the same moment, i.e., events r 1 and r 2 are received simultaneously. Besides, we assume that both tasks execute only once, i.e., r 1 and r 2 are only received once. The fault model used is presented in Figure 2(b) . Static costs are represented as integer numbers next to their corresponding states.
EXAMPLE
In this example, the best execution cost for task T 1 would be 1 + 1 + (2 + 1) + 1 = 6, which corresponds to executing its first phase on P 3 and its second phase on P 1 . The best execution cost for task T 2 is 1 + 1 + (1 + 1) + 1 = 5, which corresponds to executing its first phase on P 2 and its second phase on P 1 .
The run proceeds as follows. At first, T 1 is scheduled on P 3 and T 2 is scheduled on P 2 . At that moment, processor P 2 fails. T 2 must migrate immediately and the best cost solution is offered by processor P 3 . Task T 1 remains on processor P 3 . The tasks can execute their own checkpoint independently of each other, when receiving the corresponding uncontrollable event c 1,2 . Just after a checkpoint, processor migrations can also occur for optimality reasons: both T 1 and T 2 migrate respectively from P 3 to P 1 in order to achieve their best execution cost. Each task terminates when receiving an uncontrollable event t 1,2 . Thus, T 1 ends with
