Abstract For autonomous critical real-time embedded systems (e.g., satellite), guaranteeing a very high level of reliability is as important as keeping the power consumption as low as possible. We propose an off-line scheduling heuristic which, from a given software application graph and a given multiprocessor architecture (homogeneous and fully connected), produces a static multiprocessor schedule that optimizes three criteria: its length (crucial for real-time systems), its reliability (crucial for dependable systems), and its power consumption (crucial for autonomous systems). Our tricriteria scheduling heuristic, called TSH, uses the active replication of the operations and the data-dependencies to increase the reliability and uses dynamic voltage and frequency scaling to lower the power consumption. We demonstrate the soundness of TSH. We also provide extensive simulation results to show how TSH behaves in practice: first, we run TSH on a single instance to provide the whole Pareto front in 3D; second, we compare TSH versus the ECS heuristic (Energy-Conscious Scheduling) from the literature; and third, we compare TSH versus an optimal Mixed Linear Integer Program.
Introduction

Motivations
Autonomous critical real-time embedded applications are commonly found in embedded devices such as satellite systems. Because they are real-time systems, their execution time must be as low as possible to guarantee that the system interacts with its environment in a timely way. Because they are critical, their reliability must be as close as 1 as possible, typically above 1-10 −9 . And because they are autonomous, their power consumption must be as low as possible. The main problem when addressing these issues is that they are antagonistic. Intuitively, lowering the probability of failure requires some form of redundancy, meaning more computing load. This is antagonistic to achieving the lowest possible execution time. In the same manner, lowering the power consumption is usually achieved by lowering the voltage and frequency operating point of the processors, which means that the same software function will take more time to execute. Finally, lowering the voltage and frequency operating point also has an impact of the failure rate of processors, because lower voltage leads to smaller critical energy; hence the system becomes sensitive to lower-energy particles. As a result, the failure probability increases. These three antogonisms make the problem very challenging.
In order to offer the best compromises between these three measures, we present an off-line scheduling heuristic that, from a given software application graph and a given multiprocessor architecture, produces a static multiprocessor schedule that optimizes three criteria: its schedule length (crucial for real-time systems), its reliability (crucial for dependable systems), and its power consumption (crucial for autonomous systems). We target homogeneous distributed architecture, such as multicore processors. Our tricriteria scheduling heuristic uses the active replication of the operations and the data-dependencies to increase the reliability and uses dynamic voltage and frequency scaling (DVFS) to lower the power consumption.
Multicriteria optimization
Let us address the issues raised by multicriteria optimization. Figure 1 illustrates the particular case of two criteria, Z 1 and Z 2 , that must be minimized. For the clarity of the presentation, we stick here to two criteria but this discussion extends naturally to any number of criteria. In Fig. 1 , each point x 1 to x 7 represents a solution, that is, a different tradeoff between the two criteria. The points x 1 , x 2 , x 3 , x 4 , and x 5 are called Pareto optima [28] . Among those solutions, the points x 2 , x 3 , and x 4 are called strong Pareto optima (no other point is strictly better on all criteria) while the points x 1 and x 5 are called weak Pareto optima (no other point is better on all criteria, possibly not strictly). The set of all Pareto optima is called the Pareto front.
It is fundamental to understand that no solution among the points x 2 , x 3 , and x 4 (the strong Pareto optima) can be said to be the best one. Indeed, those three solutions are noncomparable, so choosing among them can only be done by the user, depending on the precise requirements of his/her application. But such a user-dependent choice can only be made if we are able to compute the whole Pareto front. If we compute only a single solution, then obviously no choice is possible. This is why we advocate producing, for a given problem instance, the whole Pareto front rather than a single solution. Since we have three criteria, it will be a surface in the 3D space (execution time,reliability,power consumption). Now, several approaches exist to tackle bicriteria optimization problems (these methods extend naturally to multicriteria) [28] :
1. Aggregation of the two criteria into a single one, so as to transform the problem into a classical single criterion optimization one. 2. Hierarchization of the criteria, which allows the total ordering of the criteria and then the solving of the problem by optimizing one criterion at a time. 3 . Interaction with the user to guide the search for a Pareto optimum. 4 . Transformation of one criterion into a constraint, which allows the solving of the problem by optimizing the other criterion under the constraint of the first one.
Any multicriteria optimization method that aggregates the criteria (for instance with a linear combination of all the criteria) can only produce one point of the Pareto front, leaving no choice to the user between several tradeoffs. Of course, such a method could be run several times (for instance by changing the coefficients of the linear combination), but there is no way to control what part of the Pareto front will be produced by doing so and the Pareto front is likely to be far from complete.
Similarly, any multicriteria optimization method that hierarchizes the criteria can only produce one point of the Pareto front. For instance, in the case of Fig. 1 , we could first minimize Z 1 and obtain the subset {x 4 , x 5 } of solutions, and then minimize Z 2 among the solutions in {x 4 , x 5 }, thereby obtaining the point x 4 . Alternatively, we could first minimize Z 2 and then Z 1 , thereby obtaining the point x 2 . In both cases, only one point of the Pareto front is obtained.
Finally, we do not want to consider the third class of methods (interaction with the user) because it too would produce a single point of the Pareto front and also because we want to provide a stand alone multicriteria optimization method.
Contrary to the three first classes of methods, the transformation approach allows one to produce the whole Pareto front, for instance, by choosing to take the Z 1 criterion as a constraint, by fixing its maximum value, by minimizing Z 2 under the constraint that Z 1 remains below this value, and by iterating this process with different maximum values of Z 1 so as to produce each time a new point of the Pareto front. This is why the proposed method follows this approach.
Contributions and outline
The main contribution of this paper is TSH, the first tricriteria scheduling heuristic able to produce, starting from an application algorithm graph and an architecture graph, a Pareto front in the space (schedule length, reliability, power consumption), and taking into account the impact of voltage on the failure probability. Thanks to the use of active replication, TSH is able to provide any required level of reliability. TSH is an extension of our previous bicriteria (schedule length, reliability) heuristic called BSH [12] . The tricriteria extension presented in this paper is necessary because of the crucial impact of the voltage on the failure probability.
We first present in Sect. 2 an overview of TSH. Then, in Sect. 3 we introduce the models that we used, regarding the target architecture, the software application that must be scheduled on it, the execution characteristics of the software elements onto the processing elements, the failure hypothesis, and the power consumption. TSH itself is presented in detail in Sect. 4 . In particular, we prove the soundness of TSH by demonstrating that the produced schedules always meet the desired constraint on the reliability and on the power consumption. Then, in Sect. 5, we define a mixed integer linear programming model (MILP) for our scheduling problem, to compute the optimal Pareto front. Section 6 presents our simulation results, including the comparison with the EnergyConscious Scheduling heuristic (ECS [18] ), and with the optimal Pareto front in the case of small problem instances. Finally, we review the related work in Sect. 7 and provide concluding remarks in Sect. 8.
Principle of the method and overview
The approach we have chosen to produce the whole Pareto front involves (1) transforming all the criteria except one into as many constraints, then (2) minimizing the last remaining criterion under those constraints, and (3) iterating this process with new values of the constraints. Figure 2 illustrates the particular case of two criteria Z 1 and Z 2 . To obtain the Pareto front, Z 1 is transformed into a constraint, with its first value set to K 1 1 = +∞. The first run involves minimizing Z 2 under the constraint Z 1 < +∞, which produces the Pareto point x 1 . For the second run, the constraint is set to the value of x 1 , that is, K 2 1 = Z 1 (x 1 ): we, therefore, minimize Z 2 under the constraint Z 1 < K 2 1 , which produces the Pareto point x 2 , and so on. This process converges provided that the number of Pareto optima is bounded. Otherwise, it suffices to slice the interval [0, +∞) into a finite number of contiguous sub-intervals of the form
, resulting in one point for each such interval. That way, the grain of the Pareto front can be improved by reducing the size of the intervals, at the cost of more iterations of the method. Note that each point obtained in this way is not necessarily a point of the Pareto front since it may be dominated by other points. Now, the application algorithm graphs we are dealing with are large (tens to hundreds of operations, each operation being a software block), thereby making infeasible exact scheduling methods, or even approximated methods with backtracking, such as branch-and-bound. We therefore chose to use list scheduling heuristics, first introduced in [15] , and which have demonstrated their good performances for scheduling large graphs [19] . We propose in this paper a tricriteria list scheduling heuristic, called TSH, adapted from [12] . TSH improves on [12] by working with three criteria, the schedule length, the reliability, and the power consumption.
Using list scheduling to minimize a criterion Z 2 under the constraint that another criterion Z 1 remains below some threshold K i 1 (as in Fig. 2 ) requires that Z 1 be an invariant measure, not a varying one. For instance, the energy is a strictly increasing function of the schedule, in the mathematical sense: if S is a prefix schedule of S, then the energy consumed by S is strictly greater than the energy consumed by S . Hence, the energy is not an invariant measure; more precisely, it is additive. Figure 3a illustrates this fact. The operations are scheduled in the order 1, 2, and so on. Up to the operation number 6, the energy criterion is satisfied: ∀1 ≤ i ≤ 6, E(S (i) ) ≤ E obj , where S (i) is the partial schedule at iteration (i). But there is no way to prevent S (7) from failing to satisfy the criterion, because whatever the operation scheduled at iteration (7), E(S (7) ) > E obj . And with list scheduling, it is not possible to backtrack.
As a consequence, using the energy as a constraint (i.e., Z 1 = E) and the schedule length as a criterion to be minimized (i.e., Z 2 = L) is bound to fail. Indeed, the fact that all the scheduling decisions made at the stage of any intermediary schedule S meet the constraint E(S ) < K cannot guarantee that the final schedule S will meet the constraint E(S) < K . In contrast, the power consumption is an invariant measure (being the energy divided by the time), and that is why we take the power consumption as a criterion instead of the energy consumption (see Sect. 3.5).
The reliability too is not an invariant measure, because the contribution of each scheduled operation i is a probability in [0, 1], which is multiplied to the reliability of the partial schedule computed so far, R(S (i−1) ). The consequence is a "so far so good" situation, which results in a "funnel" effect on the replication level of the operations. This is illustrated by Fig. 3b up to operation 4, the replication level is 1 because this choice minimizes the increase in the schedule length, and the reliability objective is satisfied: R(S i ) > R obj for i ≤ 4. But at this point, it is not possible to schedule operation five with no replication and at the same time satisfy the reliability objective. However, replicating this operation on all the processors of the target architecture (say four for the sake of the example) results in a probability very close to 1, therefore allowing the reliability to decrease only very slightly ( Fig. 3b shows an horizontal line for R(S) after the fifth operation, but actually it decreases very slightly). That is why we take instead, as a criterion, the global system failure rate per time unit (GSFR), first defined in [12] . By construction, the GSFR is an invariant measure of the schedule's reliability (see Sect. 4.1). For these reasons, each run of our tricriteria scheduling heuristic TSH minimizes the schedule length under the double constraint that the power consumption and the GSFR remain below some thresholds, noted, respectively, P obj and Λ obj . By running TSH with decreasing values of P obj and Λ obj , starting with (+∞, +∞), we are able to produce the Pareto front in the 3D space (length, GSFR, power). This Pareto front shows the existing tradeoffs between the three criteria, allowing the user to choose the solution that best meets his/her application needs. Finally, our method for producing a Pareto front could work with any other scheduling heuristic minimizing the schedule length under the constraints of both the reliability and the power.
Models
Application algorithm graph
Embedded real-time systems are reactive, and therefore consist of some algorithm executed periodically, triggered by a periodic execution clock. We follow the periodic task model of [17] , shown in Fig. 4b . Our model is, therefore, that of a synchronous application algorithm graph Alg, which is repeated infinitely to take into account the reactivity of the modeled system, that is, its reaction to external stimuli produced by its environment. In other words, the body of the periodic loop of Fig. 4b is captured by the Alg graph.
Alg is an acyclic-oriented graph (X , D) (See Fig. 4a ). Its nodes (the set X ) are software blocks called operations. Each arc of Alg (the set D) is a data-dependency between two operations. If X Y is a data-dependency, then X is a predecessor of Y , while Y is a successor of X . The set of predecessors of X is noted pr ed(X ) while its set of successors is noted succ(X ). X is also called the source of the data-dependency X Y , and Y is its destination.
Operations with no predecessor are called input operations (I 1 , I 2 , and I 3 in Fig. 4a) ; they capture the "Read Inputs" phase of the periodic execution loop, each one being a call to a sensor driver. Operations with no successor are called O 2 ) ; they capture the "Update Outputs" phase, each one being a call to an actuator driver. The other operations (A to G) capture the "Compute" phase and have no side effect.
Architecture model
We assume that the architecture is an homogeneous and fully connected multi-processor one. It is represented by an architecture graph Arc, which is a non-oriented bipartite graph (P, L, A) whose set of nodes is P ∪ L and whose set of edges is A (see Fig. 5 ). P is the set of processors and L is the set of communication links. A processor is composed of a computing unit, to execute operations, and one or more communication units, to send or receive data to/from communication links. Typically, communication units are DMAs, which present the advantage of sending data in parallel with the processor. A point-to-point communication link is composed of a sequential memory that allows it to transmit data from one processor to another. Each edge of Arc (the set A) always connects one processor and one communication link. Here we assume that the Arc graph is complete, that is, there exists a communication link between any two processors.
Execution characteristics
Along with the algorithm graph Alg and the architecture graph Arc, we are also given a function Exe nom : (X × P) ∪ (D × L) → R + giving the nominal worst-case execution time (WCET) of each operation onto each processor and the worst-case communication time (WCCT) of each data-dependency onto each communication link. An intraprocessor communication takes no time to execute. Since the architecture is homogeneous, the WCET of a given operation is identical on all processors (similarly for the WCCT of a given data-dependency). We call Exe nom the nominal WCET because we will see in Sect. 3.5 that the actual WCET varies according to the voltage / frequency operating point of the processor.
The WCET analysis is the topic of much work [29] . Knowing the execution characteristics is not a critical assumption since WCET analysis has been applied with success to real-life processors actually used in embedded systems, with branch prediction, caches, and pipelines. In particular, it has been applied to one of the most critical embedded system that exists, the Airbus A380 avionics software [6, 27] running on the Motorola MPC755 processor [10, 26] .
Static schedules
The graphs Alg and Arc are the specification of the system. Its implementation involves finding a multiprocessor schedule of Alg onto Arc. This consists of four functions: the two spatial allocation functions Ω O and Ω L give, respectively, for each operation of and each data-dependency of Alg, the subset of processors and of communication links of Arc that will execute it; and the two temporal allocation functions Θ O and Θ L give, respectively, the starting date of each operation and each data-dependency on its processor or its communication link:
In this work we only deal with static schedules, for which the functions Θ O and Θ L are static, and our schedules are computed off-line; i.e., the start time of each operation (resp. each data-dependency) on its processor (resp. its communication link) is statically known. A static schedule is without replication if for each operation X and each data-dependency
is called the replication factor of X (resp. of D). A schedule is partial if not all the operations and data-dependencies of Alg have been scheduled, but all the operations that are scheduled are such that all their predecessors are also scheduled. Finally, the length of a schedule is the max of the termination times of the last operation scheduled on each of the processors of Arc (in the literature, it is also called the makespan). For a schedule S, we note it as L(S):
In the sequel, we will write X ∈ P instead of X : P ∈ Ω O (X ) for the sake of simplicity. We will also number the processors from 1 to |P| and use their number in index, for instance p j (and similarly for the communication links).
Voltage, frequency, and power consumption
The maximum supply voltage is noted V max and the corresponding highest operating frequency is noted f max . The WCET of any given operation is computed with the processor operating at f max and V max (and similarly for the WCCT of the data-dependencies). Because the circuit delay is almost linearly related to 1/V [5] , there is a linear relationship between the supply voltage V and the operating frequency f . From now on, we will assume that the operating frequencies are normalized, that is, f max = 1 and any other frequency f is in the interval (0, 1). Accordingly, we define in Eq. (2) a new function Exe that gives the execution time of the operation or data-dependency X placed onto the hardware component C, be it a processor or a communication link, which is running at frequency f . In other words, f is taken as a scaling factor:
The power consumption P of a single operation or datadependency placed on a single hardware component is computed according to the classical model found for instance in [21, 30] :
where P s is the static power (power to maintain basic circuits and to keep the clock running), h is equal to 1 when the circuit is active and 0 when it is inactive, P ind is the frequency-independent active power (the power portion that is independent of the voltage and the frequency; it becomes 0 when the system is put to sleep, but the cost of doing so is very expensive [9] ), P d is the frequency dependent active power (the processor dynamic power and any power that depends on the voltage or the frequency), C ef is the switch capacitance, V is the supply voltage, and f is the operating frequency. C ef is assumed to be constant for all operations; this is a simplifying assumption since one would normally need to take into account the actual switching activity of each operation to compute accurately the consumed energy. However, such an accurate computation is infeasible for the application sizes we consider here. For processors, this model is widely accepted for average size applications, where C ef can be assumed to be constant for the whole application [30] . For communication links on a multicore platform, this model is also relevant, as communication links are specialized processing elements [21] . Of course, the coefficients in Eq. (3) should be distinct for processors and communication links. We use the following notations:
Since the architecture is homogeneous, each processor (resp. communication link) has an identical value P p ind (resp. P ind ) and similarly an identical value C p ef (resp. C ef ). In contrast, since the voltage and frequency vary, each processor p j For a multiprocessor schedule S, we cannot apply directly Eq. (3) because each processor is potentially operating at a different V and f , which vary over time. Instead, we must compute the total energy E(S) consumed by S and then divide by the schedule length L(S):
We compute E(S) with Eq. (5) below, by summing the contribution of each processor and of each communication link:
The first sum over |P| accounts for the processors while the second sum over |L| accounts for the communication links. Irrespective of whether a processor is active or idle, it always consumes at least P p ind watts; hence the first term P (6):
Failure hypothesis
Both processors and communication links can fail, and they are fail-silent (a behavior that can be achieved at a reasonable cost [3] ). Classically, we adopt the failure model of Shatz and Wang [25] : failures are transient and the maximal duration of a failure is such that it affects only the current operation executing onto the faulty processor, and not the subsequent operations (same for the communication links); this is the "hot" failure model. The occurrence of failures on a processor Modern fail-silent processors can have a failure rate around 10 −6 /h [3] . Failures are transient. Those are the most common failures in modern embedded systems, all the more when processor voltage is lowered to reduce the energy consumption, because even very low-energy particles are likely to create a critical charge leading to a transient failure [30] . Besides, failure occurrences are assumed to be statistically independent events. For hardware faults, this hypothesis is reasonable, but this would not be the case for software faults [16] .
The reliability of a system is defined as the probability that it operates correctly during a given time interval [1] . According to our model, the reliability of the processor P (resp. the communication link L) during the duration d is R = e −λd . Conversely, the probability of failure of the processor P (resp. the communication link L) during the duration d is F = 1− R = 1−e −λd . Hence, the reliability of the operation or data-dependency X placed onto the hardware component C (be it a processor or a communication link) is
From now on, the function R will either be used with two variables as in Eq. (7), or with only one variable to denote the reliability of a schedule (or a part of a schedule).
Since the architecture is homogeneous, the failure rate per time unit is identical for each processor (noted λ p ) and similarly for each communication link (noted λ ). Figure 6 shows a simple schedule S where operations X and Z are placed onto P 1 , operation Y onto processor P 2 , and the data-dependency X Y is placed onto the link L 12 . We detail below the contribution of each hardware component to the consumed energy according to Eq. (6): 
it affects only the current operation executing onto the faulty processor, and not the subsequent operations (same for the communication links). Single-event upsets (SEUs), which are the most common failures affecting hardware elements, fall in this category. 6. Failure occurrences are statistically independent events.
For hardware faults, this hypothesis is reasonable, but this would not be the case for software faults [16] . 7. The occurrence of failures on a hardware element follows a Poisson law with a constant parameter λ. Over the life, λ changes according to a "bathtub" curve, with a "flat" portion in the middle. Thanks to this flat portion, a constant λ can be reasonably assumed for the processors usually deployed in safety critical systems.
The tricriteria scheduling algorithm TSH
Global system failure rate
As we have demonstrated in Sect. 2, we must use the global system failure rate (GSFR) instead of the system's reliability as a criterion. The GSFR is the failure rate per time unit of the obtained multiprocessor schedule, seen as if it were a single operation scheduled onto a single processor [12] . The GSFR of a static schedule S, noted Λ(S), is computed by Eq. (8):
Equation (8) uses the reliability R(S), which, in the case of a static schedule S without replication, is simply the product of the reliability of each operation and data dependency of S (by definition of the reliability, Sect. 3.6):
Equation (8) also uses the total processor utilization U (S) instead of the schedule length L(S), so that the GSFR can be computed compositionally:
Thanks to Eqs. (8), (9), and (10), the GSFR is invariant: for any schedules S 1 and S 2 such that S = S 1 • S 2 , where "•" is the concatenation of schedules, if
Finally, it is very easy to translate a reliability objective R obj into a GSFR objective Λ obj : one just needs to apply the formula Λ obj = − log R obj /D, where D is the mission duration. This shows how to use the GSFR criterion in practice.
Decreasing the power consumption
Two operation parameters of a chip can be modified to lower the power consumption: the frequency and the voltage. We assume that each processor can be operated with a finite set of supply voltages, noted V. We thus have V = {V 0 , V 1 , . . . , V max }. To each supply voltage V corresponds an operating frequency f . We choose not to modify the operating frequency and the supply voltage of the communication links.
We assume that the cache size is adapted to the application, therefore ensuring that the execution time of an application is linearly related to the frequency [22] (i.e., the execution time is doubled when frequency is halved).
To lower the energy consumption of a chip, we use dynamic voltage and frequency scaling (DVFS), which lowers the voltage and increases proportionally the cycle period. However, DVFS has an impact of the failure rate [30] . Indeed, lower voltage leads to smaller critical energy, and hence the system becomes sensitive to lower energy particles. As a result, the fault probability increases both due to the longer execution time and to the lower energy: the voltage-dependent failure rate λ( f ) is
where λ 0 is the nominal failure rate per time unit, b > 0 is a constant, f is the frequency scaling factor, and f min is the lowest operating frequency. At f min and V min , the failure rate is maximal:
We apply DVFS to the processors and we assume that the voltage switch time can be neglected compared with the WCET of the operations. To take into account the voltage in the schedule, we modify the spatial allocation function Ω O to give the supply voltage of the processor for each operation:
To compute the number of elements in Q, we count the number of sets of pairs p, v for each element of 2 P except the empty set. Each element E ∈ 2 P accounts for |V| |E| elements in Q. Take, for example, P = {p 1 
Decreasing the GSFR
According to Eq. (8), decreasing the GSFR is equivalent to increasing the reliability. Several techniques can be used to increase the reliability of a system. Their common point is to include some form of redundancy (this is because the target architecture Arc, with the failure rates of its components, is fixed) [11] . We have chosen the active replication of the operations and the data-dependencies, which consists in executing several copies of a same operation onto as many distinct processors (resp. data-dependencies onto communication links). Adding more replicas increases not only the reliability, but also, in general, the schedule length: in this sense, we say that the two criteria, length and reliability, are antagonistic.
To compute the GSFR of a static schedule with replication, we use Reliability Block-Diagrams (RBD) [2, 20] . An RBD is an acyclic-oriented graph (N , E), where each node of N is a block representing an element of the system, and each arc of E is a causality link between two blocks. Two particular connection points are its source S and its destination D. An RBD is operational if and only if there exists at least one operational path from S to D. A path is operational if and only if all the blocks in this path are operational. The probability that a block be operational is its reliability. By construction, the probability that an RBD be operational is thus the reliability of the system it represents.
In our case, the system is the multiprocessor static schedule, possibly partial, of Alg onto Arc. Each block represents an operation X placed onto a processor P i or a datadependency X Y placed onto a communication link L j . The reliability of a block is, therefore, computed according to Eq. (7) .
Computing the reliability in this way requires the occurrences of the failures to be statistically independent events. Without this hypothesis, the fact that some blocks belong to several paths from S to D makes the reliability computation infeasible. At each iteration of the scheduling heuristic, we compute the RBD of the partial schedule obtained so far, then we compute the reliability based on this RBD, and finally we compute the GSFR of the partial schedule with Eq. (8) .
Finally, computing the reliability of an RBD with replications is, in general, exponential in the size of the schedule. To avoid this problem, we insert routing operations so that the RBD of any partial schedule is always serial-parallel (i.e., a sequence of parallel macro-blocks), hence making the GSFR computation linear [12] . The idea is that, for each data dependency X Y such that it has been decided to replicate X k times and Y times, a routing operation R will collect all the data sent by the k replicas of X and send it to the replicas of Y (see Fig. 7 ). This scheme, known as "replication for reliability" [13] , has a drawback in terms of schedule length, because the routing operation R cannot complete before it has received the data sent by all the replicas of X . However, it has been shown in [12] that, on average, the overhead of inserting routing operations on the schedule length is less than 4 %.
Principle of the scheduling heuristic TSH
To obtain the Pareto front in the space (length,GSFR, power), we predefine a virtual grid in the objective plane (GSFR,power), and for each cell of the grid we solve one different single objective problem constrained to this cell, using the scheduling heuristic TSH presented below. The single objective is the schedule length that TSH aims at minimizing.
TSH is a ready list scheduling heuristic. It takes as input an algorithm graph Alg, a homogeneous architecture graph Arc, the function Exe giving the WCETs and WCCTs, and two constraints Λ obj and P obj . It produces as output a static multiprocessor schedule S of Alg onto Arc, such that the GSFR of S is smaller than Λ obj , the power consumption is smaller than P obj , and such that its length is as small as possible. TSH uses active replication of operations to meet the Λ obj constraint, dynamic voltage scaling to meet the P obj constraint, and the power-efficient schedule pressure as a cost function to minimize the schedule length.
Besides, TSH inserts routing operations to make sure that the RBD of any partial schedule is serial-parallel (otherwise, computing the reliability is exponential in the size of the schedule-see Sect. sched . For the ease of notation, we sometimes write P (n) for P S (n) , and similarly for the schedule length L, the energy E, or the GSFR Λ.
Power-efficient schedule pressure
The power-efficient schedule pressure is a variant of the schedule pressure cost function [14] , which tries to minimize the length of the critical path of the algorithm graph by exploiting the scheduling margin of each operation. The schedule pressure σ is computed for each ready operation o i and each processor p j by Eq. (12): (12) where CPL (n) is the critical path length of the partial schedule at step (n) composed of the already scheduled operations, neither what their execution time will be (this will only be known when these future operations will be actually scheduled). Hence, for each future operation, we compute its average WCET for all existing supply voltages. Equation (13) generalizes the schedule pressure to a set of processors:
Then, we consider the schedule length as a criterion to be minimized, and the energy increase and the GSFR as two constraints to be met: for each ready operation o i ∈ O 
where Q is the set of all subsets of pairs p, v such that 
is equivalent to a constraint on the power consumption,
The local constraint on the current macro-block of the RBD, Λ B (o i , Q k ) ≤ Λ obj , guarantees that the global constraint on the schedule at iteration (n + 1), Λ (n+1) ≤ Λ obj , is met. This will be formally established by Proposition 1 (see Sect. 4.7).
Similarly, we would like the local constraint on the energy increase due to
to guarantee that the global constraint at iteration (n + 1) on the full schedule P (n+1) ≤ P obj is met. Unfortunately, we can show a counter example for this.
Consider the case when the operation o i scheduled at iteration (n) does not increase the schedule length, because it fits in a slack at the end of the previous schedule:
In contrast, the total energy always increases strictly because of o i : E (n+1) − E (n) > 0. It follows that, whatever the choice of processors, voltage, and frequency for o i , it is impossible to schedule it such that the energy increase constraint
Over-estimation of the energy consumption
To prevent this and to guarantee the invariance property of P, we over-estimate the power consumption by computing the consumed energy as if all the ending slacks were "filled" by an operation executed at ( f over , V over ). We choose the largest frequency and voltage ( f over , V over ) such that
We start with ( f max , V max ). If the Condition (15) is not met, then we select the next highest operating frequency,
is not the GSFR of the partial schedule S (n+1) but only of the macro-block of o i , it does not bear the superscript (n+1). (15) is met. Thanks to this overestimation, even if the next scheduled operation fits in a slack and does not increase the length, we are sure that it will not increase the power-consumption either. This is illustrated in Fig. 8 .
Formally, we now compute the total energy consumed by the schedule S with Eq. (16) instead of Eq. (6). We call E + the over-estimated energy consumption:
where L(S)− M j is the slack available at the end of processor p j , for all processor (6), we see that the over-estimating term V 2 over L(S) − M j has been added. Accordingly, we now compute the power-efficient schedule pressure with Eq. (17) instead of Eq. (14):
Once we have computed, for each ready operation o i of O (n) ready , the best subset of pairs processor, voltage to execute o i , with the power-efficient schedule pressure of Eq. (17), we compute the most urgent of these operations with Eq. (18):
Finally, we schedule this most urgent operation o urg on the processors of the set Q 
Soundness of TSH
The soundness of TSH is based on two propositions. The first one establishes that the schedules produced by TSH meet their GSFR constraint. Its proof can be found in [12] :
Proposition 1 Let S be a multiprocessor schedule of Alg onto Arc. If each operation o of Alg has been scheduled according to Eqs. (17) and (18) such that the reliability is computed with Eq. (8), then the GSFR Λ(S) is less than Λ obj .
The second proposition establishes that the schedules produced by TSH meet their power consumption constraint: Proposition 2 Let S be a multiprocessor schedule of Alg onto Arc. If each operation o of Alg has been scheduled according to Eqs. (17) and (18) such that the energy consumption is computed with Eq. (16) , then the total power consumption P(S) is less than P obj .
Proof First, we observe that, for any non empty schedule S, P(S) ≤ P obj is equivalent to E(S) ≤ P obj L(S). Moreover, since E(S) ≤ E + (S), it is sufficient to prove the inequality of Eq. (19):
We prove Eq. (19) by induction on the scheduling iteration (n). The induction hypothesis [H ] is on the energy consumed by the partial schedule S (n) :
[H ] is satisfied for the initial empty schedule S (0) because E 
Thanks to [H], we thus have
As a conclusion, [H ] holds for the schedule S (n+1) .
The TSH algorithm
The TSH scheduling heuristic is shown in Fig. 9 . Initially, O
sched is empty and O
ready is the list of operations without any predecessors. At the end of each iteration (n), these lists are updated according to the data-dependencies of Alg.
At each iteration (n), one operation o i of the list O (n) ready is selected to be scheduled. For this, we select at the microsteps ➀ and ➁, for each ready operation o i , the best subset of processors Q (n)
best (o i ) to replicate and schedule o i , such that the GSFR of the resulting partial schedule is less than Λ obj and the power consumption is less than P obj ; at this point, each processor is selected with a voltage. Then, among those best
, we select at the micro-step ➂ the one having the biggest power-efficient schedule pressure value, i.e., the most urgent pair o urg , Q
In Sect. 6, we will present a complete set of simulation results, first involving TSH alone, then comparing TSH with a multicriteria heuristic from the literature, and finally comparing TSH with an optimal Mixed Linear Integer Program.
Mixed integer linear programming approach
In this section, we define a mixed integer linear programming model (MILP) for our scheduling problem. Our goal is to compare the optimal results obtained by our MILP program with those achieved by TSH on small Alg graphs. Comparisons will be shown in Sect. 6.4.
For each operation t i of Alg, let s ik ∈ R + be the starting execution time of its kth replica:
Let p ik ∈ N be the processor index where the kth replica of operation t i is to be executed. The value 0 indicates that no processor is selected:
Let x ik be 1 if the kth replica of operation t i is assigned to processor number , and 0 otherwise: Fig. 9 The TSH tricriteria scheduling heuristic
The constraints (24) link the mapping variables x with the processors indices p:
Let x ik m be 1 if the kth replica of operation t i is assigned to processor number and runs with frequency m, and 0 otherwise:
We can then define W as the schedule length of Alg on Arc (the makespan):
Let U be the total utilization of the processors of Arc:
The two global objectives Λ obj and R obj are related by the reliability formula:
In order to model the non-overlapping of operations and to reflect the fact that the multiprocessor schedule must enforce the precedence of the Alg graph, we define two sets of binary variables σ ik jk and ε ik jk such that -for each i, j, σ ik jk is equal to 1 if the kth replica of operation t i ends before the k th replica of operation t j starts, and 0 otherwise:
-for each i, j, ε ik jk is equal to 1 if the index of the processor of the replica of operation t i is strictly less that the processor index of the replica of operation t j , and 0 otherwise:
These two variables σ and ε must satisfy the following constraints:
We define the time order on operations in terms of the σ variables in (32), and similarly we define the processors indices order on operations in terms of the ε variables in (33) where |P| is the number of processors in Arc. By (34), we ensure that operations do not overlap on a processor. By (35), we ensure that an operation cannot be scheduled both before and after another operation. Similarly, by (36), an operation cannot be placed both on a higher and on a lower processor index than another operation. Finally, (37) enforces the task precedence constraints.
Let Y i K be a binary variable equal to 1 if the replication level for operation t i is K , with 1 ≤ K ≤ R max , and 0 otherwise. Here, R max is the maximal allowed replication level for the operations:
∀i,
We constrain the power and the reliability of Alg on Arc in (40) and (41), respectively. Here, R B i K is the reliability of the operation i when replicated exactly K times on processors identified by the set L composed of K processor indices, with frequencies identified by the set M composed of K frequency values:
Based on these definitions, the formulation of the MILP is to minimize the execution length W under the constraints specified by Eqs. (20) to (41). This formulation is a bilinear programming where the bilinearities arise because of the reliability constraints. We have linearized this model by simply introducing a new set of variables which replace the bilinear terms.
In Sect. 6.4, we compare, on a given instance, the schedules obtained with this MILP and with TSH. 
Examples of Pareto fronts produced by TSH
The aim of our first simulations is to produce Pareto fronts. Figures 10 and 11 show the Pareto fronts produced by TSH for a randomly generated Alg graph of 30 operations and a fully connected and homogeneous Arc graph of, respectively 3, and 4 processors; we have used the same random graph generator as in [12] . The nominal failure rate per time unit (i.e., the λ 0 of Eq. (11) The virtual grid of the Pareto front is defined such that both high and small values of P obj and Λ obj are covered within a reasonable grid size. Hence, the values of P obj and Λ obj , from the less to the most constrained, are selected from two sets of values: P obj ∈ {3.0, 2.8, 2.6, . . . 1.0} and Λ obj ∈ {8.10 −1 , 4.10 −1 , 8.10 −2 , . . . 4.10 −14 }. TSH being a heuristic, changing the parameters of this grid could change locally some points of the Pareto front, but not its overall shape.
Figures 10 and 11 connect the set of non-dominated Pareto optima (the surface obtained in this way is only depicted for a better visual understanding; by no means do we assume that points interpolated in this way are themselves Pareto optima, only the computed dots are). The figures show an increase of the schedule length for points with decreasing power consumptions and/or failure rates. The "cuts" observed at the top and the left of the plots are due to low power constraints and/or low failure rate constraints.
Figures 10 and 11 expose to the designer a choice of several tradeoffs between the execution time, the power consumption, and the reliability level. For instance, in Fig. 11 , we see that, to obtain a GSFR of 10 −10 with a power consumption of 1.5 V, we must accept a schedule three times longer than if we impose no constraint on the GSFR nor the power. We also see that, by providing a 4-processor architecture (Fig. 11) , we can obtain schedules with a shorter execution length than with only three processors, even though we impose identical constraints to the GSFR and the power ( Figure 12a shows how the schedule length varies in function of the required power consumption, with Λ obj set to 10 −5 . This curve is averaged over 30 randomly generated Alg graphs. We can see that the average schedule length increases when the constraint P obj on the power consumption decreases. This was expected since the two criteria, schedule length and power consumption, are antagonistic. Figure 12b shows how the schedule length varies in function of the required GSFR, with P obj set to 2.5 W. Again, this curve is averaged over 30 randomly generated Alg graphs. We can see that the average schedule length increases when the constraint Λ obj on the GSFR decreases. Again, this is expected because the two criteria, schedule length and GSFR, are antagonistic.
Comparison with ECS
We have compared the performance of TSH with the algorithm proposed in [18] , called ECS (Energy-Conscious Scheduling heuristic). ECS is a bicriteria scheduling heuristic that takes as input a DAG of tasks and a set of p fully connected, heterogeneous, DVFS-enabled, processors. The power consumption model is the same as ours, but the energy consumed by an application does not take into account the energy consumed by the inter-task data-dependencies on the communication links. The cost function used by ECS sums two terms, one for the energy and one for the schedule length (aggregation method). Since ECS is not tricriteria, we proceed as follows: 1. We first invoke ECS on a given instance (an Alg graph and an Arc graph). We then compute the overall reliability R ECS , the total energy E ECS , the schedule length L ECS , and the total utilization U ECS of the schedule produced by ECS. 2. We use these values to compute the objectives required to run TSH: Λ obj = − log(R ECS )/U ECS and P obj = E ECS /L ECS . And finally. we invoke TSH with these values of the objectives.
We have plotted in Figs. 13, 14, and 15, respectively the average schedule length, the average energy consumption, and the average reliability of the schedules computed by ECS and by TSH. The values have been averaged over 50 randomly generated Alg of size N varying between 10 and 100 operations. The Arc graph has P = 6 processors, and the nominal failure rate per time unit of all the processors is λ p = 10 −5 ; the nominal failure rate per time unit of all the links is λ = 5 × 10 −4 .
Our experimental results (Figs. 13, 14 and 15) show that TSH performs systematically better than ECS. This is a very good result.
MILP and TSH simulation results
For the evaluation of MILP approach, we used an algorithm graph Alg of 5 operations and an architecture graph Arc consisting of 3 fully connected processors. The execution times of the operations were assigned randomly within 10 to 30 time units. In this simulation, we assumed that the communication links were reliable.
The nominal failure rate per time unit of the processors is λ p = 10 −5 . The set of processor frequencies is set to {0. 25 We have used the CLPEX ILOG solver [7] , version 11.2, on an Intel Core-2 Duo CPU E7500 2.93GHz computer with 2 GB of RAM. Even with an Alg graph of 7 operations, a run of MILP can take more than 40 h without finding the optimal value. That is why we have limited the Alg graph to 5 operations and the Arc graph to 3 processors. The processing time of TSH is, as expected, much shorter than that of the MILP: in the order of 1 s for TSH versus between a few seconds and 40 min for the MILP.
The Pareto fronts generated by MILP and TSH are shown in Fig. 16 , where the colored surface corresponds to MILP results while the uncolored one corresponds to TSH. For small values of P obj and Λ obj (i.e., when the multicriteria problem is highly constrained), the TSH surface is significantly above the MILP one. For large values of P obj and Λ obj (i.e., when the multicriteria problem is not so constrained), the two surfaces are almost glued one to the other. The average overhead of the schedule length achieved by TSH versus the length achieved by the MILP is only 15.6 % (the exact approximation ratio is 1.1563051). This shows that TSH performs very well compared with the optimal result obtained by the MILP.
Related work
Many solutions exist in the literature to optimize the schedule length and the energy consumption (e.g., [23] ), or to optimize the schedule length and the reliability (e.g., [4, 8, 13] ), but very few tackle the problem of optimizing the three criteria (length,reliability,energy). The closest to our work are [24, 30] .
Zhu et al. have studied the impact of the supply voltage on the failure rate [30] in a passive redundancy framework (primary backup approach). They use DVFS to lower the energy consumption and they study the tradeoff between the energy consumption and the "performability" (defined as the probability of finishing the application correctly within its deadline in the presence of faults). A lower frequency implies a higher execution time and, therefore, less slack time for scheduling backup replicas, meaning a lower performability. However, their input problem is not a multiprocessor scheduling one since they study the system as a single monolithic operation executed on a single processor. Thanks to this simpler setting, they are able to provide an analytical solution based on the probability of failure, the WCET, the voltage, and the frequency.
Pop et al. have addressed the (length,reliability,energy) tricriteria optimization problem on an heterogeneous architecture [24] . Both length and reliability are taken as a constraint, respectively, with a given upper and lower bound.
These two criteria are not invariant measures, and we have demonstrated in Sect. 2 that such a method cannot always guarantee that the constraints are met. Indeed, their experimental results show that the reliability decreases with the number of processors, therefore making it impossible to meet an arbitrary reliability constraint. Second, they assume that the user will specify the number of processor failures to be tolerated to satisfy the desired reliability constraint. Third, they assume that all the communications take place through a reliable bus. For these three reasons, it is not possible to compare TSH with their method.
Conclusion
We have presented a new off-line tricriteria scheduling heuristic, called TSH, which takes as input an application graph (a DAG of operations) and a multiprocessor architecture (homogeneous and fully connected), and produces a static multiprocessor schedule that optimizes three criteria: its length, its global system failure rate (GSFR), and its power consumption. TSH uses the active replication of the operations and the data-dependencies to increase the reliability, and uses DVFS to lower the power consumption.
Since the three criteria of this optimization problem are antagonistic to each other, there is no best solution in general. That is why we use the notion of Pareto optima. To address this issue, both the power and the GSFR are taken as constraints, and TSH attempts to minimize the schedule length while satisfying these constraints. By running TSH with several values of these constraints, we are able to produce a set of non-dominated Pareto solutions, the Pareto front, which is a surface in the 3D space (length,GSFR,power). This surface exposes the existing tradeoffs between the three antagonistic criteria, allowing the user to choose the solution that best meets his/her application needs.
Transforming two criteria into constraints and minimizing the third criterion is a natural approach to produce Pareto fronts. However, some care must be taken when doing so. As we have demonstrated, each criterion that is transformed into a constraint must be an invariant measure of the schedule, not a varying one. For this reason, the two constraints imposed to TSH are the power consumption (instead of the energy consumption) and the global system failure rate (instead of the reliability).
TSH is an extension of our previous bicriteria (length, reliability) heuristic BSH [12] . Studying the three criteria together makes sense because of the impact of the voltage on the failure probability. Indeed, lower voltage leads to smaller critical energy, and hence the system becomes sensitive to lower energy particles. As a result, the fault probability increases both due to the longer execution time and to the lower energy.
To the best of our knowledge, this is the first reported method that allows the user to produce the Pareto front in the 3D space (length,GSFR,power). This advance comes at the price of several assumptions: the architecture is assumed to be homogeneous and fully connected, the processors are assumed to be fail-silent, their failures are assumed to be statistically independent, the power switching time is neglected, and the failure model is assumed to be exponential. In the future, we shall work on relaxing those assumptions.
