Synthesis of Fault-Tolerant Embedded Systems by Eles, Petru et al.
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
General rights 
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners 
and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. 
 
• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. 
• You may not further distribute the material or use it for any profit-making activity or commercial gain 
• You may freely distribute the URL identifying the publication in the public portal  
 
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately 
and investigate your claim. 
   
 
Downloaded from orbit.dtu.dk on: Dec 17, 2017
Synthesis of Fault-Tolerant Embedded Systems
Eles, Petru; Izosimov, Viacheslav; Pop, Paul; Peng, Zebo
Published in:
Design, Automation, and Test in Europe Conference
Link to article, DOI:
10.1109/DATE.2008.4484825
Publication date:
2008
Document Version
Publisher's PDF, also known as Version of record
Link back to DTU Orbit
Citation (APA):
Eles, P., Izosimov, V., Pop, P., & Peng, Z. (2008). Synthesis of Fault-Tolerant Embedded Systems. In Design,
Automation, and Test in Europe Conference (pp. 1117-1122). DOI: 10.1109/DATE.2008.4484825
Abstract
This work addresses the issue of design optimization for fault-
tolerant hard real-time systems. In particular, our focus is on the
handling of transient faults using both checkpointing with roll-
back recovery and active replication. Fault tolerant schedules
are generated based on a conditional process graph representa-
tion. The formulated system synthesis approaches decide the as-
signment of fault-tolerance policies to processes, the optimal
placement of checkpoints and the mapping of processes to pro-
cessors, such that multiple transient faults are tolerated, trans-
parency requirements are considered, and the timing constraints
of the application are satisfied.
1. Introduction
Safety-critical applications have to function correctly and meet
their timing constraints even in the presence of faults. Such
faults can be permanent (i.e., damaged microcontrollers or com-
munication links), transient (e.g., caused by electromagnetic in-
terference), or intermittent (appear and disappear repeatedly).
Transient faults are the most common, and their number is con-
tinuously increasing due to high complexity, smaller transistor
sizes, higher operational frequency, and lower voltage levels [5].
The rate of transient faults is often much higher compared to
the one of permanent faults. Transient-to-permanent fault ratios
can vary between 2:1 and 100:1 or higher [22].
From the fault tolerance point of view, transient faults and in-
termittent faults manifest themselves in a similar manner: they
happen for a short time and then disappear without causing a
permanent damage. Hence, fault tolerance techniques against
transient faults are also applicable for tolerating intermittent
faults and vice versa. Therefore, in this paper, we will refer to
both types of faults as transient faults and we will talk about
fault tolerance against transient faults, meaning tolerating both
transient and intermittent faults.
Traditionally, hardware replication was used as a fault-toler-
ance technique against transient faults [21]. However, such solu-
tions are very costly, in particular with increasing number of
transient faults to be tolerated.
In order to reduce cost, other techniques are required such as
software replication [3, 28], recovery with checkpointing [18,
27, 29], and re-execution [19]. However, if applied in a straight-
forward manner to an existing design, these techniques intro-
duce significant time overheads, which can lead to
unschedulable solutions. On the other hand, using faster compo-
nents or a larger number of resources may not be affordable due
to cost constraints. Therefore, efficient design optimization tech-
niques are required in order to meet time and cost constraints in
the context of fault tolerant systems.
Transient faults are also common for communication chan-
nels, even though, in this paper, we do not deal with them explic-
itly. Fault tolerance against multiple transient faults affecting
communications has been studied and solutions such as a cyclic
redundancy code (CRC) are implemented in communication
protocols available on the market [10, 23].
Researchers have shown that schedulability of an application
can be guaranteed for preemptive on-line scheduling under the
presence of a single transient fault [1, 2, 12].
Liberato et al. [24] propose an approach for design optimiza-
tion of monoprocessor systems in the presence of multiple tran-
sient faults and in the context of preemptive earliest-deadline-
first (EDF) scheduling.
Hardware/software co-synthesis with fault tolerance is ad-
dressed in [6] where the minimum amount of additional hard-
ware is determined in order to achieve a certain level of
dependability. Xie et al. [28] propose a technique to decide how
replicas can be selectively inserted into the application, based on
process criticality. Introducing redundant processes into a pre-
designed schedule is used in [4] in order to improve error detec-
tion. The above approaches only consider one single fault.
Kandasamy et al. [19] propose constructive mapping and
scheduling algorithms for transparent re-execution on multipro-
cessor systems. The work was later extended with fault-tolerant
transmission of messages on a time-division multiple access bus
[20]. Both papers consider only one fault per computation node.
Only process re-execution is used as a fault-tolerance policy.
Very few research work is devoted to global system optimiza-
tion in the context of fault tolerance. For example, Pinello et al.
[25] propose a heuristic for combining several static schedules
in order to mask fault patterns. Multiple failures are addressed
with active replication in [11] in order to guarantee a required
level of fault tolerance and satisfy time constraints.
None of the previous work has considered fault-tolerance pol-
icies in the global context of system-level design for distributed
embedded systems. Thus, we consider hard real-time safety-crit-
ical applications mapped on distributed embedded systems.
Both the processes and the messages are scheduled using non-
preemptive quasi-static cyclic scheduling. We consider two dis-
tinct fault-tolerance techniques: process-level checkpointing
with rollback recovery [9], which provides time-redundancy,
and active replication [26], which provides space-redundancy.
The main aspects of the work discussed here are:
• a quasi-static cyclic scheduling framework to schedule pro-
cesses and messages, that can handle transparency/perfor-
mance trade-offs imposed by the designer;
• mapping and fault tolerance policy assignment strategies for
mapping of processes to computation nodes and assignment of
a proper combination of fault tolerance techniques to pro-
cesses, such that performance is maximized;
• an approach to the optimization of checkpoint distribution in
rollback recovery.
2. System Architecture and Fault Model
We consider architectures composed of a set N of nodes which
share a broadcast communication channel. Every node Ni ∈ N
Synthesis of Fault-Tolerant Embedded Systems
Petru Eles1, Viacheslav Izosimov1, Paul Pop2, Zebo Peng1
1{petel|viaiz|zebpe}@ida.liu.se
Dept. of Computer and Information Science
Linköping University
SE–581 83 Linköping, Sweden
2Paul.Pop@imm.dtu.dk
Dept. of Informatics and Mathematical Modelling
Technical University of Denmark
DK-2800 Kongens Lyngby, Denmark
 
978-3-9810801-3-1/DATE08 © 2008 EDAA 
 
consists, among others, of a communication controller and a
CPU. The communications are scheduled statically based on
schedule tables, and are fault-tolerant, using a TDMA based pro-
tocol, such as the Time Triggered Protocol (TTP) [23].
In this work we are interested in fault-tolerance techniques for
transient faults. If permanent faults occur, we consider that they
are handled using a technique such as hardware replication. Note
that an architecture that tolerates n permanent faults, will also
tolerate n transient faults. However, we are interested in tolerat-
ing a much larger number of transient faults than permanent
ones, for which using hardware replication alone is too costly.
We have generalized the fault-model from [19] that assumes
that only one single transient fault may occur on any node during
an application execution. In our model, we consider that at most a
given number k of transient faults1 may occur anywhere in the sys-
tem during one operation cycle of the application. Thus, not only
several transient faults may occur simultaneously on several pro-
cessors, but also several faults may occur on the same processor. 
3. Fault Tolerance Techniques
The error detection and fault-tolerance mechanisms are part of
the software architecture. The software architecture, including
the real-time kernel, error detection and fault-tolerance mecha-
nisms are themselves fault-tolerant. 
We use two mechanisms for tolerating faults: equidistant
checkpointing with rollback recovery and active replication.
Once a fault is detected, a fault tolerance mechanism has to be
invoked to handle this fault. The simplest fault tolerance tech-
nique to recover from fault occurrences is re-execution [19]. In
re-execution, a process is executed again if affected by faults.
The time needed for the detection of faults is accounted for by
the error-detection overhead α. When a process is re-executed
after a fault was detected, the system restores all initial inputs of
that process. The process re-execution operation requires some
time for this that is captured by the recovery overhead µ.
3.1 Rollback Recovery with Checkpointing
The time overhead for re-execution can be reduced with more
complex fault tolerance techniques such as rollback recovery
with checkpointing [27, 29]. The basic principle of this tech-
nique is to restore the last non-faulty state of the failing process,
i.e., to recover from faults. The last non-faulty state, or check-
point, has to be saved in advance in the static memory and will
be restored if the process fails. The part of the process between
two checkpoints or between a checkpoint and the end of the pro-
cess is called execution segment.
An example of rollback recovery with checkpointing is pre-
sented in Fig. 1. We consider process P1 with the worst-case ex-
ecution time of 60 ms and error-detection overhead α of 10 ms,
as depicted in Fig. 1a. Fig. 1b presents the execution of P1 in
case no fault occurs, while Fig. 1c shows a scenario where a fault
(depicted with a lightning bolt) affects P1. In Fig. 1b, two check-
points are inserted at equal intervals. The first checkpoint is the
initial state of process P1. The second checkpoint, placed in the
middle of process execution, is for storing an intermediate pro-
cess state. Thus, process P1 is composed of two execution seg-
ments. We will name the k-th execution segment of process Pi as
. Accordingly, the first execution segment of process P1 is
 and its second segment is . Saving process states, includ-
ing saving initial inputs, at checkpoints, takes an amount of time
that is considered in the checkpointing overhead χ, depicted as a
black rectangle. In Fig. 1c, a fault affects the second execution
segment  of process P1. This faulty segment is executed again
starting from the second checkpoint. Note that the error-detec-
tion overhead α is not considered in the last recovery in the con-
text of rollback recovery with checkpointing because, in this
example, we assume that a maximum of one fault can happen.
We will denote the j-th execution of the k-th execution seg-
ment of process Pi as . Accordingly, the first execution of
execution segment  has the name  and its second execu-
tion is named . Note that we will not use the index j if we
only have one execution of a segment or a process, as, for exam-
ple, P1’s first execution segment  in Fig. 1c.
When recovering, similar to re-execution, we consider a re-
covery overhead µ, which includes the time needed to restore
checkpoints. In Fig. 1c, the recovery overhead µ, depicted with
a light gray rectangle, is 10 ms for process P1.
The fact that only a part of a process has to be restarted for tol-
erating faults, not the whole process, can considerably reduce
the time overhead of rollback recovery with checkpointing com-
pared to simple re-execution. Simple re-execution is a particular
case of rollback recovery with checkpointing, in which a single
checkpoint is applied, at process activation.
3.2 Active and Passive Replication
The disadvantage of recovery techniques is that they are unable
to explore spare capacity of available computation nodes and, by
this, to possibly reduce the schedule length. In contrast to roll-
back recovery and re-execution, active and passive replication
techniques can utilize spare capacity of other computation
nodes. Moreover, active replication provides the possibility of
spatial redundancy, e.g. the ability to execute process replicas in
parallel on different computation nodes.
In the case of active replication, all replicas are executed inde-
pendently of fault occurrences. In the case of passive replication,
also known as primary-backup, on the other hand, replicas are
executed only if faults occur. In Fig. 2 we illustrate primary-
backup and active replication. We consider process P1 with the
worst-case execution time of 60 ms and error-detection overhead
α of 10 ms, see Fig. 2a. Process P1 will be replicated on two
computation nodes N1 and N2, which is enough to tolerate a sin-
gle fault. We will name the j-th replica of process Pi as Pi(j).
In the case of active replication, illustrated in Fig. 2b, replicas
1. The number of faults k can be larger than the number of processors in the system. 
Pik
P11 P12
P12
P1 P1
1 2
 = 5 ms  = 10 ms
P1 P1/1
1 2 P1/2
2P1
C1 = 60 ms
 =10 ms
a) b) c)
Figure 1. Rollback Recovery with Checkpointing
α = 
χ = 5 ms µ   
Pi/ jk
P12 P1/12
P1/22
P11
N1
N2
N1
N2
N1
N2
N1
N2
P1
C1 = 60 ms
 =10 ms
P1(1)
P1(2)
P1(1)
P1(2)
P1(1)P1(1)
P1(2)
a)
b1)
b2)
c1)
c2)
Figure 2. Active Replication (b) and Primary-Backup (c)
α = 10 ms
P1(1) and P1(2) are executed in parallel, which, in this case, im-
proves system performance. However, active replication occu-
pies more resources compared to primary-backup because P1(1)
and P1(2) have to run even if there is no fault, as shown in
Fig. 2b1. In the case of primary-backup (Fig. 2c), the “backup”
replica P1(2) is activated only if a fault occurs in P1(1). However,
if faults occur, primary-backup takes more time to complete,
compared to active replication, as shown in Fig. 2c2 and Fig. 2b2.
In our work, we are interested in active replication. This type
of replication provides the possibility of spatial redundancy,
which is lacking in rollback recovery. Moreover, rollback recov-
ery with a single checkpoint is, in fact, a restricted case of pri-
mary-backup where replicas are only allowed to execute on the
same computation node with the original process.
3.3 Transparency
Tolerating transient faults leads to many alternative execution
scenarios, which are dynamically adjusted in the case of fault
occurrences. The number of execution scenarios grows expo-
nentially with the number of processes and the number of toler-
ated transient faults. In order to debug, test, or verify the system,
all its execution scenarios have to be taken into account. There-
fore, debugging, verification and testing become very difficult.
A possible solution against this problem is transparency.
Originally, Kandasamy et al. [19] propose transparent re-exe-
cution, where recovering from a transient fault on one computa-
tion node is hidden from other nodes. In our work we apply a
more flexible notion of transparency by allowing the designer to
declare arbitrary processes and messages as frozen (see
Section 4). Transparency has the advantage of fault containment
and increased debugability. Since the occurrence of faults in cer-
tain process does not affect the execution of other processes, the
total number of execution scenarios is reduced. Therefore, less
number of execution alternatives have to be considered during
debugging, testing, and verification. However, transparency can
increase the worst-case delay of processes, reducing perfor-
mance of the embedded system [14, 16].
4. Application Model
We consider a set of real-time periodic applications Ak. Each ap-
plication Ak is represented as an acyclic directed graph
Gk(Vk,Ek). Each process graph is executed with the period Tk.
The graphs are merged into a single graph with a period T ob-
tained as the least common multiple (LCM) of all application pe-
riods Tk. This graph corresponds to a virtual application A,
captured as a directed, acyclic graph G(V, E). Each node Pi ∈V
represents a process and each edge eij ∈ E from Pi to Pj indi-
cates that the output of Pi is the input of Pj.
Processes are non-preemptable. They send their output values
encapsulated in messages, when completed. All required inputs
have to arrive before activation of the process. Fig. 3a shows an
application represented as a graph composed of five nodes.
Time constraints are imposed with a global hard deadline D,
at which the application A has to complete. Some processes may
also have local deadlines dlocal.
The mapping of an application process is determined by a
function M: V → N, where N is the set of nodes in the architec-
ture. The mapping will be determined as part of the design opti-
mization. For a process Pi, M(Pi) is the node to which Pi is
assigned for execution. Each process can potentially be mapped
on several nodes. Let NPi ⊆ N be the set of nodes to which Pi
can potentially be mapped. We consider that for each Nk ∈ NPi,
we know the worst-case execution time (WCET) of process
Pi, when executed on Nk.
Fig. 3c shows the worst-case execution times of processes of
the application depicted in Fig. 3a when executed on the archi-
tecture in Fig. 3b. For example, P2 has the worst-case execution
time of 40 ms if mapped on computation node N1 and 60 ms if
mapped on node N2. By “X” we show mapping restrictions. For
example, process P3 cannot be mapped on computation node N2.
In the case of processes mapped on the same node, message
transmission time between them is accounted for in the worst-
case execution time of the sending process. If processes are
mapped on different nodes, then messages between them are
sent through the communication network. We consider that the
worst-case size of messages is given, which, implicitly, can be
translated into a worst-case transmission time on the bus. 
The combination of fault-tolerance policies to be applied to
each process (Fig. 4) is given by four functions:
• P: V →{Replication,Checkpointing,Replication&Checkpointing}
 determines whether a process is replicated, checkpointed, or
replicated and checkpointed.
• The function Q: V → Ν indicates the number of replicas for
each process. For a certain process Pi, and considering k the
maximum number of faults, if P(Pi) = Replication, then Q(Pi)
= k; if P(Pi) = Checkpointing, then Q(Pi) = 0; if P(Pi) =
CPi
Nk
N1 N2P2 P3
P4 P5
P1 N2
P2
P3
P4
N1
40 60
P5
60 X
40 60
40 60
P1 20 30
WCET
a) b)
Figure 3. A Simple Application and a Hardware Architecture
c)
a) Checkpointing P1 C1 = 30 ms
 1 = 5 ms
k = 2
 1 = 5 ms
 1 = 5 ms
b) Replication
N1
N2
N3
P1(1)
P1(2)
P1(3)
c) Replication and 
checkpointing
N1
N2
P1(1)
P1(2) P1(2)N1 P1 P1 P1
1 2 3 1 2
Figure 4. Policy Assignment: Checkpointing and Replication
   
χ1 = 5 s
µ1   
k = 
α1 = 5 s
Replication & Checkpointing, then 0 < Q(Pi) < k.
• Let VR be the set of replica processes introduced into the
application. Replicas can be also checkpointed, if necessary.
Function R: V ∪ VR → Ν determines the number of possible
recoveries for each process or replica. In Fig. 4a, P(P1) =
Checkpointing, R(P1) = 2. In Fig. 4b, P(P1) = Replication,
R(P1(1)) = R(P1(2)) = R(P1(3)) = 0. In Fig. 4c, P(P1) =
Replication & Checkpointing, R(P1(1)) = 0 and R(P1(2)) = 1.
• Function X: V ∪ VR → Ν indicates the number of checkpoints
to be applied to processes in the application and the replicas in
VR. We consider equidistant checkpointing, thus the check-
points are equally distributed throughout the execution time of
the process. If process Pi ∈ V or replica Pi(j) ∈ VR is not
checkpointed, then we have X(Pi) = 0 or X(Pi(j)) = 0, respectively. 
Each process Pi ∈ V , besides its worst-case execution time Ci,
is characterized by an error detection overhead αi, a recovery
overhead µi, and checkpointing overhead χi.
The transparency requirements imposed by the user are cap-
tured by a function T: V → {frozen, not_frozen} where vi ∈ V  is
a node in the application graph, which can be either a process or a
communication message. In a fully transparent system, all messag-
es and processes are frozen. If T(vi) = frozen, our scheduling algorithm
will handle this transparency requirements by allocating the same start
time for vi in all the alternative fault-tolerant schedules of application A.
5. Fault Tolerant Schedules
Our approach to the generation of fault-tolerant system sched-
ules is based on the fault-tolerant conditional process graph (FT-
CPG) representation, an application of Conditional Process
Graphs [7, 8]. The final schedules are produced as a set of sched-
ule tables that are capturing the alternative execution scenarios
corresponding to possible fault occurrences.
5.1 Fault Tolerant Conditional Process Graph
A FT-CPG captures alternative execution scenarios in the case of
possible fault occurrences. A fault occurrence is captured as a
condition, which is true if the fault happens and false otherwise.
A FT-CPG is a directed acyclic graph G(VP∪VC∪VT, ES∪EC).
We denote a node in the FT-CPG with  that will correspond to
the mth copy of process Pi ∈ A. A node  ∈ VP with simple edg-
es at the output is a regular node. A node  ∈ VC with conditional
edges at the output is a conditional process that produces a condi-
tion. A node  ∈ VT is a synchronization node and represents the
synchronization point corresponding to a frozen process or mes-
sage (i.e., T(vi) = frozen). We denote with  the synchronization
node of process Pi ∈ A and with  the synchronization node of
message mi ∈ A. Synchronization nodes take zero time to execute.
ES and EC are the sets of simple and conditional edges, respec-
tively. An edge  ∈ ES from  to  indicates that the out-
put of  is the input of . Synchronization nodes  and
 are also connected through edges to regular and conditional
processes and other synchronization nodes.
Edges  ∈ EC are conditional edges and have an associated
condition value. The condition value produced is “true” (denot-
ed with ) if  experiences a fault, and “false” (denoted
with ) if  does not experience a fault. Alternative paths
starting from such a process, which correspond to complemen-
tary values of the condition, are disjoint1. Regular and condition-
al processes are activated when all their inputs have arrived. A
synchronization node can be activated after inputs coming on
one of the alternative paths, corresponding to a particular fault
scenario, have arrived. 
Fig. 5a depicts an application A modelled as a process graph
G, which can experience at most two faults (for example, as in
the figure, during the execution of processes P2 and P4). Trans-
parency requirements are depicted with rectangles on the appli-
cation graph, where process P3, message m2 and message m3 are
set to be frozen. For scheduling purposes we will convert the ap-
plication A to a fault-tolerant conditional process graph (FT-
CPG) G, represented in Fig. 5b. In an FT-CPG the fault occur-
rence information is represented as conditional edges and the
frozen processes/messages are captured using synchronization
nodes. One of the conditional edges is  to  in Fig. 5b, with
the associated condition  denoting that  has no faults.
Message transmission on conditional edges takes place only if
the associated condition is satisfied.
The FT-CPG in Fig. 5b captures all the fault scenarios that can
happen during the execution of application A in Fig. 5a. The sub-
graph marked with thicker edges and shaded nodes in Fig. 5b
captures the execution scenario when processes P2 and P4 expe-
rience one fault each. The fault scenario for a given process ex-
ecution, for example , the first execution of P4, is captured by
the conditional edges  (fault) and  (no-fault). The trans-
parency requirement that, for example, P3 has to be frozen, is
captured by the synchronization node  inserted before the
conditional edge with copies of process P3. In Fig. 5b, process
 is a conditional process because it “produces” condition
, while  is a regular process.
5.2 Schedule Table
The output produced by the FT-CPG scheduling algorithm is a
schedule table that contains all the information needed for a distrib-
uted run time scheduler to take decisions on activation of pro-
cesses. It is considered that, during execution, a non-preemptive
scheduler located in each node decides on process and commu-
nication activation depending on the actual values of conditions. 
Only one part of the table has to be stored in each node, name-
ly, the part concerning decisions that are taken by the corre-
sponding scheduler. Fig. 6 presents the schedules for the nodes
1. They can only meet in a synchronization node.
Pim
Pim
Pim
vi
PiS
miS
eijmn Pim Pjn
Pim Pjn PiS
miS
eijmn
FPim Pi
m
FPim Pim
P11 P41
FP11 P11
P3
P4
P1
P2
P2
6
5
3
2
1
3
2
1
P1
P1
P2P2
P2P2
4
m3
m2
P4
P4
P4
P4
P4
1
2
3
4
5
6
m1
m1
m1
FP1
2
FP1
1
FP2
2
FP11
FP21
FP2
2
FP4
1
FP 42
FP1
1
FP 1
2
FP1
2
FP2
4
FP2
1
FP1
1
FP2
4
FP41
FP 4
2
FP3
1 FP3
2
FP4
4
FP12
FP4
4
1
2
3
S
S
S
m2
m3
P31 P32
P33
b)
Figure 5. Fault Tolerant Conditional Process graph
a)
P2
P1
P4m2
m1
m3
P3
P41
FP41 FP41
P3S
P11
FP11 P1
3
N1 and N2 produced for the FT-CPG in Fig. 5. In each table there
is one row for each process and message from application A. A
row contains activation times corresponding to different values
of conditions. In addition, there is one row for each condition
whose value has to be broadcasted to other computation nodes.
Each column in the table is headed by a logical expression con-
structed as a conjunction of condition values. Activation times in
a given column represent starting times of the processes and
transmission of messages when the respective expression is true. 
According to the schedule for node N1, process P1 is activated
unconditionally at the time 0, given in the first column of the ta-
ble. Activation of the rest of the processes, in a certain execution
cycle, depends on the values of the conditions, i.e., the unpre-
dictable occurrence of faults during the execution of certain pro-
cesses. For example, process P2 has to be activated at t = 30 if
 is true, at t = 100 if  is true, etc.
At a certain moment during the execution, when the values of
some conditions are already known, they have to be used to take
the best possible decisions on process activations. Therefore, af-
ter the termination of a process that produces a condition, the
value of the condition is broadcasted from the corresponding
computation node to all other computation nodes. This broadcast
is scheduled as soon as possible on the communication channel,
and is considered together with the scheduling of the messages.
In [13, 14, 17] we have presented several algorithms for the
synthesis of fault tolerant schedules. They are allowing for vari-
ous trade-offs between the worst case schedule length, the size
of the schedule tables, the degree of transparency, and the dura-
tion of the schedule generation procedure.
6. Fault Tolerant System Design
By policy assignment we denote the decision whether a certain
process should be checkpointed or replicated, or a combination
of the two should be used. Mapping a process means placing it
on a particular node in the architecture.
There are cases when the policy assignment decision is taken
based on the experience of the designer, considering aspects like
the functionality implemented by the process, the required level
of reliability, hardness of the constraints, legacy constraints, etc.
Many processes, however, do not exhibit particular features or
requirements which obviously lead to checkpointing or replica-
tion. Decisions concerning the policy assignment for these pro-
cesses can lead to various trade-offs concerning, for example,
the schedulability properties of the system, the amount of com-
munication exchanged, the size of the schedule tables, etc.
For part of the processes in the application, the designer might
have already decided their mapping. For example, certain pro-
cesses, due to constraints like having to be close to sensors/actu-
ators, have to be physically located in a particular hardware unit.
For the rest of the processes (including the replicas) their map-
ping is decided during design optimization.
Thus, our problem formulation for mapping and policy assign-
ment with checkpointing is as follows:
• As an input we have an application A (Section 4) and a system
consisting of a set of nodes N connected to a bus B (Section 2).
• The parameter k denotes the maximum number of transient
faults that can appear in the system during one cycle of execution.
We are interested to find a system configuration ψ such that the
k transient faults are tolerated, the transparency requirements T are
observed, and the imposed deadlines are guaranteed to be satisfied,
within the constraints of the given architecture N. 
Determining a system configuration ψ = <F, M, S> means:
1. finding a fault tolerance policy assignment, given by F = <P,
Q, R, X>, for each process Pi (see Section 4) in the application
A, for which the fault-tolerance policy has not been a priory
set by the designer; this also includes the decision on the num-
ber of checkpoints X for each process Pi in the application A
and each replica in VR;
2.deciding on a mapping M for each unmapped process Pi in the
application A;
3.deciding on a mapping M for each unmapped replica in VR;
4.deriving the set S of schedule tables.
Based on the scheduling approaches described in [13, 14, 17] we
have developed several heuristics that are solving the above formu-
lated design problem. In particular, we have addressed the problem
of fault-tolerant application mapping in [16], and the issue of
checkpointing optimization in [15]. An approach to optimal fault
tolerance policy assignment has been presented in [13].
The graph in Fig. 7 illustrates the efficiency of the mapping and
FP11 FP11 FP12∧
N1 true 11P
F  1
1P
F  2
1
1
1 PP
FF ∧  2
1
1
1 PP
FF ∧  4
2
2
1
1
1 PPP
FFF ∧∧  4
2
2
1
1
1 PPP
FFF ∧∧ 1
2
1
1 PP
FF ∧  2
2
1
2
1
1 PPP
FFF ∧∧  2
2
1
2
1
1 PPP
FFF ∧∧  1
2
1
1 PP
FF ∧  
P1 0 ( 11P ) 35 (
2
1P )  70 (
3
1P )        
P2   30 ( 12P ) 100 (
6
2P ) 65 (
4
2P ) 90 (
5
2P )  55 (
2
2P ) 80 (
3
2P )   
m1   31 ( 11m ) 100 (
3
1m ) 66 (
2
1m )       
m2   105 105 105       
m3    120  120 120  120 120 120 
1
1P
F  30           
2
1P
F   65          
 
N 2 true 11P
F  1
1P
F  2
1
1
1 PP
FF ∧  2
1
1
1 PP
FF ∧  4
4
2
1
1
1 PPP
FFF ∧∧ 4
4
2
1
1
1 PPP
FFF ∧∧ 1
4
1
1 PP
FF ∧ 2
4
1
4
1
1 PPP
FFF ∧∧ 2
4
1
4
1
1 PPP
FFF ∧∧  1
4
1
1 PP
FF ∧  1
3P
F  2
3
1
3 PP
FF ∧
P3    136( 83P )  136(
1
3P ) 136(
1
3P )  136(
1
3P ) 136(
1
3P ) 136(
1
3P ) 161(
2
3P ) 186(
3
3P ) 
P4   36( 14P ) 105(
6
4P ) 71(
4
4P ) 106(
5
4P )  71(
2
4P ) 106(
3
4P )     
 
Figure 6. Schedule Tables
0
10
20
30
40
50
60
70
80
90
100
20 40 60 80 100
MR
SFX
MX
Av
g.
 %
 d
ev
ia
tio
n
Number of processes
Figure 7. Efficiency of fault tolerance policy assignment
fault tolerance policy assignment approach described in [13]. Ex-
periments with applications consisting of 20 to 100 processes im-
plemented on architectures consisting of 2 to 6 nodes have been
performed. The number of tolerated faults was between 3 and 7.
The parameter we are interested in is the fault tolerance overhead
(FTO) which represents the percentage increase of the schedule
length due to fault tolerance considerations. We obtained the FTO
by comparing the schedule length obtained using our techniques
with the length of the schedules using the same (mapping and
scheduling) techniques but with ignoring the fault tolerance issues.
As a baseline in Fig. 7 we use the FTO produced by our approach
proposed in [13], which optimizes the process mapping and also
assigns a fault tolerance policy (re-execution or replication) to
tasks such as the schedule length is minimized. We compared our
approach with two extreme approaches: MX that only considers
reexecution and MR which only relies on replication for tolerating
faults. As the graph shows, optimizing the assignment of fault tol-
erance policies leads to results that are, on average, 77% and 17,6%
better than MR and MX, respectively.
In Fig. 8 we illustrate the efficiency of our checkpointing optimi-
zation technique proposed in [15]. This technique extends the one
proposed in [13] by considering re-execution with checkpointing
and by proposing an approach to optimization of the number of
checkpoints. The baseline for the graph in Fig. 8 is the FTO pro-
duced by optimizing the number of checkpoints using a technique
proposed in [27]. This technique determines the optimal number of
checkpoints considering each process in isolation, as a function of
the checkpointing overhead (which depends on the time needed to
create a checkpoint). However, calculating the number of check-
points for each individual process will not produce a solution
which is globally optimal for the whole application. In Fig. 8 we
show the average percentage deviation of the FTO obtained with
the system optimization technique proposed in [15] from the base-
line obtained with the checkpoint optimization proposed in [27] (in
this graph, larger deviation means smaller overhead).
7. Conclusions
In this paper we have addressed the issue of design optimization
for real-time systems with fault tolerance requirements. In par-
ticular, we have emphasized the problem of transient faults since
their number is continuously increasing with new electronic
technologies. We have shown that efficient system-level design
optimization techniques are required in order to meet the im-
posed design constraints for fault tolerant embedded systems in
the context of a limited amount of available resources.
References
[1] A. Bertossi, L. Mancini, “Scheduling Algorithms for Fault-Tolerance
in Hard - Real Time Systems”, Real Time Systems Journal, 7(3), 229-
256, 1994.
[2] A. Burns, R.I. Davis, S. Punnekkat, “Feasibility Analysis for Fault-
Tolerant Real-Time Task Sets”, Euromicro Worksh. on Real-Time
Systems, 29-33, 1996.
[3] P. Chevochot, I. Puaut, “Scheduling Fault-Tolerant Distributed Hard -
Real Time Tasks Independently of the Replication Strategies”, Real-
Time Comp. Syst. and Appl. Conf., 356-363, 1999.
[4] J. Conner, Y. Xie, M. Kandemir, G. Link, R. Dick, “FD-HGAC: A
Hybrid Heuristic/Genetic Algorithm Hardware/Software Co-
synthesis Framework with Fault Detection”, ASP-DAC Conf., 709-
712, 2005.
[5] C. Constantinescu, “Trends and Challanges in VLSI Circuit
Reliability”, IEEE Micro, 23(4), 14-19, 2003.
[6] B. P. Dave, N. K. Jha, “COFTA: Hardware-Software Co-Synthesis of
Heterogeneous Distributed Embedded System for Low Overhead
Fault Tolerance”, IEEE Trans. on Computers, 48(4), 417-441, 1999.
[7] P. Eles, K. Kuchcinski, Z. Peng, P. Pop, A. Doboli, “Scheduling of
Conditional Process Graphs for the Synthesis of Embedded Systems“,
DATE Conf, 132-138, 1998.
[8] P. Eles, A. Doboli, P. Pop, Z. Peng, “Scheduling with Bus Access
Optimization for Distributed Embedded Systems”, IEEE Trans. on
VLSI Syst., 8(5), 472-491, 2000.
[9] E. N. Elnozahy, L. Alvisi, Y. M. Wang, D. B. Johnson, “A Survay of
Rollback-Recovery Protocols in Message-Passing Systems”, ACM
Computing Surveys, 34(3), 375-408, 2002.
[10]FlexRay Consortium, “FlexRay Protocol Specification”, Ver. 2.0, 2004.
[11]A. Girault, H. Kalla, M. Sighireanu, Y. Sorel, “An Algorithm for
Automatically Obtaining Distributed and Fault-Tolerant Static
Schedules”, Int. Conf. on Dependable Syst. and Netw., 159-168, 2003.
[12]C. C. Han, K. G. Shin, and J. Wu, “A Fault-Tolerant Scheduling
Algorithm for Real-Time Periodic Tasks with Possible Software
Faults”, IEEE Trans. on Computers, 52(3), 362–372, 2003.
[13]V. Izosimov, P. Pop, P. Eles, and Z. Peng, “Design Optimization of
Time- and Cost-Constrained Fault-Tolerant Distributed Embedded
Systems”, DATE Conf., 864-869, 2005.
[14]V. Izosimov, P. Pop, P. Eles, Z. Peng, “Synthesis of Fault-Tolerant
Schedules with Transparency/Performance Trade-offs for Distributed
Embedded Systems”, DATE Conf., 706-711, 2006.
[15]V. Izosimov, P. Pop, P. Eles, Z. Peng, “Synthesis of Fault-Tolerant
Embedded Systems with Checkpointing and Replication”, IEEE Intl.
Worksh. on Electron. Design, Test & Appl. (DELTA), 440-447, 2006.
[16]V. Izosimov, P. Pop, P. Eles, Z. Peng, “Mapping of Fault-Tolerant
Applications with Transparency on Distributed Embedded Systems”,
Euromicro Conf. on Digital System Design (DSD), 313-320, 2006.
[17]V. Izosimov, P. Pop, P. Eles, Z. Peng, “Scheduling of Fault-Tolerant
Embedded Systems with Soft and Hard Time Constraints”, DATE
Conf, 2008.
[18]Jie Xu, B. Randell, “Roll-Forward Error Recovery in Embedded Real-
Time Systems”, Int. Conf. on Parallel and Distr. Syst., 414-421, 1996.
[19]N. Kandasamy, J. P. Hayes, B. T. Murray, “Transparent Recovery
from Intermittent Faults in Time-Triggered Distributed Systems”,
IEEE Trans. on Computers, 52(2), 113-125, 2003.
[20]N. Kandasamy, J. P. Hayes, B. T. Murray, “Dependable
Communication Synthesis for Distributed Embedded Systems”,
Computer Safety, Reliability and Security Conf., 275-288, 2003
[21]H. Kopetz, H. Kantz, G. Grunsteidl, P. Puschner, J. Reisinger,
“Tolerating Transient Faults in MARS”, 20th Int. Symp. on Fault-
Tolerant Computing, 466-473, 1990.
[22]H. Kopetz, R. Obermaisser, P. Peti, N. Suri, “From a Federated to an
Integrated Architecture for Dependable Embedded Real-Time
Systems”, Tech. Rep. 22, TU Vienna, 2003.
[23]H. Kopetz, G. Bauer, “The Time-Triggered Architecture”,
Proceedings of the IEEE, 91(1), 112-126, 2003.
[24]F. Liberato, R. Melhem, and D. Mosse, “Tolerance to Multiple
Transient Faults for Aperiodic Tasks in Hard Real-Time Systems”,
IEEE Trans. on Computers, 49(9), 906-914, 2000.
[25]C. Pinello, L. P. Carloni, A. L. Sangiovanni-Vincentelli, “Fault-
Tolerant Deployment of Embedded Software for Cost-Sensitive Real-
Time Feedback-Control Applications”, DATE, 1164–1169, 2004.
[26]S. Poldena, “Fault Tolerant Systems - The Problem of Replica
Dterminism”, Kluwer, 1996.
[27]S. Punnekkat, A. Burns, R. Davis, “Analysis of Checkpointing for
Real-Time Systems”, Real-Time Systems Journal, 20(1), 83-102, 2001.
[28]Y. Xie, L. Li, M. Kandemir, N. Vijaykrishnan, and M.J. Irwin,
“Reliability-Aware Co-synthesis for Embedded Systems”, Proc. 15th
IEEE Intl. Conf. on Appl.-Spec. Syst., Arch. and Proc., 41-50, 2004.
[29]Ying Zhang and K. Chakrabarty, “A Unified Approach for Fault
Tolerance and Dynamic Power Management in Fixed-Priority Real-
Time Embedded Systems”, IEEE Trans. on Computer-Aided Design
of Integrated Circuits and Systems, 25(1), 111-125, 2006.
0
5
10
15
20
25
30
35
40
45
40 60 80 100
Number of processes
Av
g.
 %
 d
ev
ia
tio
n
Figure 8. Efficiency of checkpointing optimization
