ORTEGA: An Efficient and Flexible Online Fault Tolerance Architecture for Real-Time Control Systems by Xue Liu et al.
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 4, NO. 4, NOVEMBER 2008 213
ORTEGA: An Efﬁcient and Flexible Online
Fault Tolerance Architecture for Real-Time
Control Systems
Xue Liu, Member, IEEE, Qixin Wang, Member, IEEE, Sathish Gopalakrishnan, Member, IEEE,
Wenbo He, Member, IEEE, Lui Sha, Fellow, IEEE, Hui Ding, and Kihwal Lee
Abstract—Fault tolerance is an important aspect in real-time
computing. In real-time control systems, tasks could be faulty
due to various reasons. Faulty tasks may compromise the perfor-
mance and safety of the whole system and even cause disastrous
consequences. In this paper, we describe On-demand Real-TimE
GuArd (ORTEGA), a new software fault tolerance architecture
for real-time control systems. ORTEGA has high fault coverage
and reliability. Compared with existing real-time fault tolerance
architectures, such as Simplex, ORTEGA allows more efﬁcient
resource utilizations and enhances ﬂexibility. These advantages
are achieved through the on-demand detection and recovery of
faulty tasks. ORTEGA is applicable to most industrial control
applications where both efﬁcient resource usage and high fault
coverage are desired.
Index Terms—Industrial control, real-time and embedded sys-
tems, reliability and robustness.
I. INTRODUCTION
R
EAL-TIME and embedded systems now play an impor-
tant role in our lives, with products covering a large va-
rietyofmarkets,suchasaerospace,communicationsystems,au-
tomobiles, healthcare, and personal electronics. Real-time and
embedded systems research is regarded as one of the next infor-
mation technology (IT) frontier [1].
A real-time system has well-deﬁned timing constraints.
Different from general purpose computer systems, a real-time
system is considered to function correctly only if it returns the
correct result within the system-wide timing constraints [2],
[3].
Reliable functioning of real-time systems is of paramount
concern to the millions of users that depend on these systems
everyday.However,faultandfailurescanoccurinreal-timesys-
tems. Failures can be caused by hardware malfunctions and/or
Manuscript received October 13, 2008; revised November 25, 2008. Current
version published January 21, 2009. Paper no. TII-08-10-0127.
X. Liu is with the School of Computer Science, McGill University, Montreal,
QC H3A 2A7 Canada (e-mail: xueliu@cs.mcgill.ca).
Q. Wang is with the Department of Computing, Hong Kong Polytechnic Uni-
versity, Hong Kong, China (e-mail: wchshapp@yahoo.com).
S. Gopalakrishnan is with Department of Electrical and Computer Engi-
neering, University of British Columbia, Vancouver, BC V6T 1Z4 Canada
(e-mail: sathish@ece.ubc.ca).
W. He is with the Department of Computer Science, University of New
Mexico, Albuquerque, NM 87131 USA (e-mail: wenbohe@cs.unm.edu).
L. Sha, H. Ding, and K. Lee are with Department of Computer Science,
University of Illinois at Urbana-Champaign, Urbana, IL 61820 USA (e-mail:
lrs@uiuc.edu; huiding@uiuc.edu; klee7@uiuc.edu.
Digital Object Identiﬁer 10.1109/TII.2008.2010774
faults(e.g.,electro-mechanicaldevice),softwarefaults(e.g.,the
processes/threads running on a computer), or communication
medium faults.
Hardware and communication medium faults are typically
tolerated by hardware redundancy [4] and techniques such as
message buffering [5]. In this paper, we focus on how to tol-
erate software faults in real-time control systems.
One of the major deployment of real-time systems is in con-
trol applications. Real-time control system software faults can
be categorized along three dimensions [6]:
1) resource sharing faults: The corruption of other module’s
code and data;
2) timing faults: failure to meet timing constraints;
3) semantic faults: producing wrong values.
Simplex [6], [7] is a software architecture which facili-
tates the building of dependable real-time control systems.
It provides dynamic toleration of software faults. In Sim-
plex, resource sharing faults are handled by address space
protections. Timing faults are handled through real-time sched-
uling methods such as generalized rate-monotonic scheduling
(GRMS) [8]. Semantic faults are handled through analytical
redundancy by running redundant high-assurance controller
to guard the system. Simplex has been successfully used in
applications such as automated aircraft maneuvering [9] and
semiconductor wafer-making [7].
However, Simplex fault tolerance architecture has two major
drawbacks.
1) Lack of efﬁciency: In Simplex, the analytical redundant
high-assurancecontrollerrunsinparallelwithhigh-perfor-
mancecontrollerevenwhentherearenofaults.Thiswastes
CPU cycles. In well-tested industrial applications, failures
are infrequent. A parallel high-assurance controller nearly
doubles the CPU execution time for a single controlled
system. This makes Simplex a high-cost scheme, and ham-
pers its application to many real-time embedded systems,
which are resource-constrained.
2) Lack of ﬂexibility: Simplex enforces the same execution
period on the high-assurance controller and the high-per-
formance controller. This simpliﬁes the real-time sched-
uling analysis of systems running under Simplex. How-
ever, a control system’s performance is affected by the
sampling/control periods1 used [10], [11]. As a result, in
practice, different controllers may use different periods for
different performance considerations. For example, when
1In this paper, we refer “sampling/control period” as “period” for brevity.
1551-3203/$25.00 © 2008 IEEE
Authorized licensed use limited to: McGill University. Downloaded on September 11, 2009 at 15:07 from IEEE Xplore.  Restrictions apply. 214 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 4, NO. 4, NOVEMBER 2008
a fault occurs in the high-performance controller, ideally
the system designer prefers to run the high-assurance con-
troller at a faster rate to help recover from the fault and
protect the system promptly. Unfortunately, this design is
not possible under Simplex because it lacks the ﬂexibility
to allow the high-assurance controller and the high-perfor-
mance controller to run at different periods.
In this paper, we present a new fault tolerance architecture
called On-demand Real-TimE GuArd (ORTEGA). ORTEGA
maintainshighfaultcoverageandreliabilityastheoriginalSim-
plex, meanwhile signiﬁcantly improves the ﬂexibility and re-
source utilization efﬁciency. Hence ORTEGA is applicable to a
wider range of real-time control systems.
In ORTEGA, the high-assurance controller is running in an
on-demandfashion.Onlywhenafaultisdetected,willthehigh-
assurance controller be activated to replace the faulty high-per-
formance controller. Since only one controller is active to con-
trol the system at any time instant, much resource is saved.
Through careful design and schedulability analysis, ORTEGA
also allows the high-assurance controller and high-performance
controller to run at different rates, which greatly facilitates the
ﬂexibilityincontrolandfaulttolerancedesign.Weimplemented
ORTEGA on a real-time inverted pendulum control system and
carried out extensive evaluations. Results conﬁrm the effective-
ness and efﬁcacy of ORTEGA.
The rest of this paper is organized as follows. Section II dis-
cusses related work. Section III presents the ORTEGA archi-
tecture. Section IV discusses in detail several design challenges
related to ORTEGA and presents our solutions. Section V de-
scribes the implementation and evaluations of ORTEGA. Fi-
nally, Section VI summarizes the paper and points out future
research directions.
II. RELATED WORK AND BACKGROUNDS
Fault tolerance is always an important issue in software sys-
tems [12]. There are well-developed fault tolerance techniques
for general software systems. They can be classiﬁed into three
general categories: fault masking (e.g., N-version program-
ming [13]); backward fault recovery (e.g., recovery blocks
[14] and/or checkpointing techniques in transaction-based
systems [15]); and forward fault recovery (e.g., roll-forward
checkpointing scheme in [16]).
A typical example of fault masking is N-Version program-
ming [13], where multiple presumably independent program
units are developed to accomplish the same task via (perhaps)
different algorithms. The multiple units are executed in parallel
and a majority determines the correct set of outputs.
A typical example of backward recovery techniques is re-
covery blocks [14], where a program unit is executed and then
an acceptance test is applied. In the event that the acceptance
test fails, the system is rolled back to a recovery point and
an alternate program unit is executed. The sequence (execute
apply acceptance test rollback execute alternative)
is repeated until either the acceptance test is passed or there
are no additional alternates. The checkpointing technique
commonly used in transaction-based database systems [15],
whose primary concern is the atomicity, consistency, isolation,
and durability (ACID) properties, also belongs to backward
recovery technique.
A typical example of forward fault recovery is roll-forward
checkpointingscheme[16].Inthisscheme,itisassumedthatthe
organizationconsistsofapoolofactiveprocessingmodulesand
either a small number of spare modules or active modules with
some spare processing capacity. The fault-tolerance scheme is
basedoncheckpoints.Unliketraditionalcheckpointingschemes
though,itrequiresnorollbacksforrecoveringfromsinglefaults.
The objectiveof this design is toachieveperformance of a triple
modular redundant system using duplex system redundancy.
However, none of the above-mentioned general fault toler-
ance schemes are readily applicable to real-time systems. For
real-time systems, as discussed in Section I, software faults
can be categorized along three dimensions: 1) resource sharing
faults; 2) timing faults; and 3) semantic faults. The general
fault tolerance schemes above do not sufﬁciently take timing
faults into consideration. What is more, how to timely recover
the system from faults is an important research problem for
real-time systems and has not been fully addressed in the
general fault tolerance schemes.
Fault tolerance for real-time systems is an active research
topic in the past decades. A comprehensive review of the re-
cent progresses in this area can be found in [17] and [18]. Here
we only list the research which is most relevant to our work.
TheresearchersoftheFORTSproject[19],[20]fromUniver-
sity of Pittsburgh address the CPU scheduling of recovery jobs
in real-time systems. They assumethat all faults can be detected
attheendofeachperiodicexecutionofthejob,andtherecovery
is done by re-executing the job before its deadline. This is ef-
fective for faulty controllers in which faults are nonrecurrent.
However, for control systems in which faults are recurrent or
persistent,usuallyarecoverycannotbedonebyre-executingthe
same job within the same period. This is because most probably
the re-execution will lead to the same fault again since it is still
executedwithinthesamefaultycontrollerprocess.Ifwereplace
the faulty controller process with a higher-assurance controller
process, the fault could be safely removed. A process replace-
ment is different from a job re-execution and brings up many
interesting research issues.
Lee et al. [21] develops a technique called process resurrec-
tion to recovera processfrom crash failure to meet the real-time
requirements. Process resurrection is tightly coupled with the
crash detection mechanism of the underlying operating system,
which offers signals as event notiﬁcation mechanism. Common
error notiﬁcation signals include SIGSEGV (segmentation
faults), SIGBUS (bus error), SIGFPE (arithmetic operation
error such as divide by zero), and SIGILL (execution of an
illegal instruction). In process resurrection, a special signal
handler is assigned for every crash related signals in order to
override the default signal handler and trigger the recovery.
However the fault coverage of this technique is limited to
process crashes. Faults such as deadline missing, inﬁnite loop,
or deadlock cannot be handled.
Simplex [6], [7] is a software architecture which facilitates
the building of dependable real-time systems. It provides dy-
namic toleration of system faults. The plant under control is
divided into a high-assurance-control (HAC) subsystem and
Authorized licensed use limited to: McGill University. Downloaded on September 11, 2009 at 15:07 from IEEE Xplore.  Restrictions apply. LIU et al.: ORTEGA: AN EFFICIENT AND FLEXIBLE ONLINE FAULT TOLERANCE ARCHITECTURE 215
a high-performance-control (HPC) subsystem. The HAC sub-
system is a control software which was proven to be reliable.
HAC’s simple construction let the system designer leverage the
power of formal methods and a rigorous development process.
From the system level, high-assurance OS kernels such as
certiﬁable runtimes are usually used in the HAC. From the ap-
plication level, well-understood classical controllers designed
to maximize the controlled plant’s stability region is also used.
The HPC subsystem complements the conservative HAC
core. From application level, an HPC can use more complex
and advanced control technologies for higher control perfor-
mance, including those difﬁcult to verify, for example, neural
network control [22]. These more complex and advanced
control schemes may yield better control performance, how-
ever, they may have bugs or faults. From system level, COTS
real-time OS and middleware designed to simplify the applica-
tion development can be used in HPC. However, these software
components may not be certiﬁable and could contain faults. In
Simplex, the HAC and HPC subsystems run in parallel, but the
software stays in separate processes. By using the redundant
HAC to guard against possible faults in the HPC, Simplex
achieves fault tolerance.
However, as we discussed in the Introduction section, Sim-
plex is not resource-efﬁcient and not ﬂexible. In comparison,
ORTEGA achieves the same fault coverage and functionalities
as Simplex, but with signiﬁcantly lower CPU resource usage
and allows ﬂexible controller implementations.
The conference version of this paper is published in [23].
III. ORTEGA ARCHITECTURE
In this section, we present the overall architecture and design
considerations of ORTEGA.
A. Components Organization and Fault Recovery Procedure
of ORTEGA
The architecture of ORTEGA is shown in Fig. 1. We call a
plant for which ORTEGA provides fault tolerance protection an
FT-enabled plant. In ORTEGA, there can be multiple FT-en-
abled plants. For each FT-enabled plant, there are three OR-
TEGA logical components: 1) a decision module; 2) an HPC
module;and3)anHACmodule.Fig.1illustratesthecasewhere
there is one FT-enabled plant.
SimilartotheSimplexarchitecture[7],theHACsubsystemis
highly reliable but only provides basic performance. The HPC
subsystem complements the conservative HAC core, but may
contain faulty software components.
For both Simplex and ORTEGA, the decision module plays
a key role in providing fault detection and recovery using its
decision logic. In Simplex, at any time, both the HPC and the
HAC are running; the decision module determines which con-
trol command among the two to be used for online control in
each period. In contrast, ORTEGA runs the HAC only when it
isnecessary,i.e.,onlywhenthedecisionmoduledetectsanHPC
fault/failure. As a result, only one instance of either the HAC or
the HPC is running at any time. By eliminating the redundant
execution of controllers, ORTEGA’s run-time overhead is sig-
niﬁcantly less than that of Simplex. The saved CPU cycles can
Fig. 1. Illustration of the ORTEGA fault tolerance architecture.
be used for other real-time or non-real-time tasks. It is worth
noting that the on-demand execution can be implemented ef-
ﬁciently under ORTEGA to remove the potential overhead (in
terms of delay) associated with controller thread/process start
and stop.
In our implementation, the decision module uses a mutex
semaphore to control which of the HAC or HPC is running.
When the HPC is running well, the HAC thread simply blocks
on the semaphore, i.e., the HAC is suspended. Only when a
fault is detected in the HPC, the decision module will release
the semaphore to let the HAC become active.
In ORTEGA, the HPC controls the plant during most of the
time. In order to detect and recover HPC faults in a timely
manner, the decision module should ensure that the HPC-con-
trolled plant state stays within the HAC-established stability re-
gion. A method to determine stability regions for digital con-
trollers will be presented in Section IV-A. When this condition
is violated, the suspended HAC will be activated and takes over
the plant to recover.
A second feature of ORTEGA is that the HAC and the HPC
can run at different sampling rates. This allows better ﬂexibility
indesigningtheHACandtheHPC.Forexample,whentheHPC
is detectedfaulty, thecarefullydesignedHAC can runata faster
rate to help recover the plant promptly.
In summary, the realization of ORTEGA is described as fol-
lows. On system start-up, all components are started but the
HAC is blocked. As soon as a fault is detected in the HPC, the
decision module deactivates the HPC and activates the HAC.
Now the plant is under the control of HAC for recovery. If the
type of fault is semantic (e.g., control command out of stability
region), the HPC is allowed to be switched back later, after the
HAC recovers the system from the previous fault and stabi-
lizes the plant. If the failure is due to an execution error such
as segmentation fault or entering inﬁnite loop, the HPC will be
restarted for later retry.
After the HAC has recovered and stabilized the plant, the
active controller will switch back to the HPC for retry or per-
formance considerations. Note unlike the recovery procedure,
which must be done in a timely fashion in order for the plant
state to stay within the HAC-established stability region,2 the
initiation time of the switch-back can be more ﬂexible. In prac-
tice, the decision module can initiate the switch-back when the
plant state is near the control set point and the CPU is in an idle
2Schedulability analysis of the controller switching will be discussed in Sec-
tion IV-B.
Authorized licensed use limited to: McGill University. Downloaded on September 11, 2009 at 15:07 from IEEE Xplore.  Restrictions apply. 216 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 4, NO. 4, NOVEMBER 2008
Fig. 2. Illustration of the extra delay in recovery using ORTEGA. ￿ : period of HPC; ￿ : period of HAC.
interval [24]. This mechanism avoids the possible mode-change
problem and simpliﬁes schedulability analysis. We will discuss
the details of schedulability analysis in Section IV-B.
B. Analysis of CPU Usage Savings of ORTEGA
ORTEGA eliminates the redundant execution of controllers
hence reduce the resource usage signiﬁcantly. In this section,
we quantify the resource savings of ORTEGA in terms of CPU
usage.
We use the commonly adopted periodic task model [25] to
model the control tasks. Each periodic task is denoted by .
The timing parameters of a task is represented by the tuple
( , ), where is the worst-case execution time, and is
the task period. For control systems, a task’s deadline is usually
the same as its period, i.e., we have .
Compared with Simplex, ORTEGA greatly reduces CPU re-
source usage due to the on-demand execution of the HAC. Sup-
pose the timing parameters of an FT-enabled plant under the
control of its HPC is , while the timing parameters of
the same plant under the control of its HAC is .W e
use to denote the percentage of time used for recovery (i.e.,
when HAC is active) during a total run time of (in the unit of
milliseconds).
In the original Simplex, the total CPU resource usage (in the
unit of milliseconds) is
(1)
While in ORTEGA, the total CPU resource usage (in the unit
of milliseconds) is
(2)
As we can see, ORTEGA saves (ms)
CPU usage during a total run time of (ms). In practical indus-
trial applications where faults are rare, is very small, thus
ORTEGA saves much of the CPU resource.
C. Extra One Period Delay
The design of the on-demand execution of HAC in ORTEGA
saves much of the CPU resource. However, we note due to the
on-demand recovery, ORTEGA may introduce up to one period
delay in the recovery procedure. In this section, we detail the
cause of the delay and provide a solution.
Fig. 2 illustrates this extra delay. Suppose at time , a fault
is detected in the HPC. The upper half of Fig. 2 shows the re-
coverytimelineunder theoriginal Simplex,while thelowerhalf
ofFig.2 showstherecoverytimelineunder ORTEGA.ForSim-
plex, since the HAC and the HPC are running in parallel, on
detection of a fault, the HAC control command is immediately
available and can take effect at the beginning of the next con-
trol period (i.e., at ). For ORTEGA, the HAC is running
on-demand. On detection of a fault, the HAC needs to carry out
the control computation and its control command only becomes
available during the ﬁrst period when the HAC begins running.
SotheHACcontrolcommandtakeseffectoneperiod( )later,
i.e., at time . As a result, compared with the recovery pro-
cedure using Simplex, the recovery procedure using ORTEGA
will incur an extra delay up to .
The extra delay incurred in ORTEGA can be compensated
usingstate projectiontechnique. Theidea is illustrated inFig. 3.
At any decision time , the decision module projects the plant
state one HAC period ahead, i.e., projects . If the pro-
jected state is still within the stability region associated with the
HAC, the HPC can still run; otherwise, the HAC will be acti-
vated to take over the plant.
It is worth noting that the ﬂexibility of ORTEGA allows the
period of the HAC ( ) to be smaller than that of the HPC ( )
for fast recovery. Hence the potential impact of the extra delay
is small compared with the beneﬁts gained in the increased sta-
bility region of the HAC in ORTEGA. The increased stability
Authorized licensed use limited to: McGill University. Downloaded on September 11, 2009 at 15:07 from IEEE Xplore.  Restrictions apply. LIU et al.: ORTEGA: AN EFFICIENT AND FLEXIBLE ONLINE FAULT TOLERANCE ARCHITECTURE 217
Fig. 3. Illustration of using state projection to handle the extra delay.
region allows for better fault-tolerance. This is discussed in
Section IV-A. A simple and computationally-inexpensive state
projection method is also presented in Section IV-A.
IV. DESIGN CHALLENGES AND SOLUTIONS
The design and implementation of ORTEGA raises several
important research challenges. In this section, we discuss two
most important challenges and present their solutions. The ﬁrst
challenge is how to determine the HAC stability region, which
is used by the decision module for fault detection and recovery.
The other is how to carry out schedulability analysis for OR-
TEGA.
A. Maximum Stability Region for Digital Controllers
ORTEGA’s forward recovery scheme is based on the max-
imum stability region of the plant under the HAC controller.
At any decision time, the decision module will check if the
plant’s(projected)stateis stillwithin thestabilityregionassoci-
ated with the HAC. If so, the HPC controls the plant; otherwise,
the HAC will be activated to take over the control. In order to
minimize the unduly restriction of the state space the HPC can
use, we prefer a large stability region associated with the HAC.
In this section, we propose a formal approach to determine the
maximum stability region for a digitally-implemented control
system.
Assume the plant to be controlled is governed by a contin-
uous-time state space model as described in the following:
(3)
where is the system state and is the control
input. are the corresponding system
matrices. Controllers are typically designed in a state feedback
form, i.e., , where is the corresponding con-
troller gain.
Modern control systems are typically implemented on digital
computers. Duetothedigitization ofcontinuous controllers, the
sampling and control of the continuous-time system (3) is en-
forced at discrete time points. As a result, for the purpose of
design and analysis, we need to convert the continuous-time
system to its discrete-time form according to the digital im-
plementation method used. For example, the continuous-time
system (3), when implemented using a zero-order hold with a
sampling period , can be represented as follows [26]:
(4)
where is the state of the plant at the th sample
time, and
(5)
Using (4) and (5), we can get a simple and computationally-
inexpensive state projection method to compensate the extra
delay when using ORTEGA as discussed in Section III-C. That
is,atcontrolinterval ,sincethedecisionmoduleknowsthecur-
rentplantstate andthecurrentcontroloutputvalue ,it
cancalculatethepredictedplantstateatinterval according
to (4) and (5).
Corresponding to the continuous-time state feedback con-
troller , the digitally-implemented state feed-
back controller is . By replacing the
term with in the discrete-time system (4), we get the
closed-loop discretized control system as
(6)
where .
To determinethe stabilityof a closed-loop discretized control
system as (6), we use the celebrated Lyapunov stability criteria
which is summarized in the following theorem [26].
Theorem 1: A discrete-time linear time-invariant (LTI)
system (6) is stable iff there exists a positive deﬁnite matrix
such that
(7)
For a real-life application, due to the limitation on physical
plant and control actuators, there are constraints on the system
statesand/orcontrolinputs.Forsystem(4),theseconstraintscan
be represented as
(8)
(9)
With digital controller , the constraints
(8)–(9) can be combined and represented as a polytope (a
multidimensional ﬁgure whose faces are hyperplanes) in the
system’s state space as follows:
(10)
Authorized licensed use limited to: McGill University. Downloaded on September 11, 2009 at 15:07 from IEEE Xplore.  Restrictions apply. 218 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 4, NO. 4, NOVEMBER 2008
Fig. 4. Illustration of a stability region under state constraints.
where , for ; for
, , and . The states inside
the polytope are called admissible states, because they obey the
operational constraints.
We formally deﬁne a stability region as a subset of the states
within the polytope such that if the closed-loop discretized con-
trol system starts from a state within the stability region, the
system states’ future trajectory will always stay within the re-
gion and ﬁnally converge to the control set point.
From Lyapunov theory, we see a Lyapunov function
inside the state constraint polytope represents a stability region
[7], [27], [28]. Geometrically, it deﬁnes an -dimensional ellip-
soid in the -dimensional system state constraint polytope, as
illustrated in Fig. 4, where the state space is two-dimensional.
An important property of a Lyapunov function is: if the system
state is within the ellipsoid associated with a controller, it will
always staywithintheellipsoidandﬁnallyconvergetotheequi-
librium position (set point) under this controller.
Mathematically, for a closed-loop discretized control system
(6), a stability region under a speciﬁc Lyapunov matrix with
state constraints is deﬁned as the following ellipsoid:
(11)
where satisﬁestheLyapunovstabilitycriteria
[i.e., (7)] and satisﬁes state constraints [i.e., (10)].
However, a Lyapunov matrix is not unique for a given
stable closed-loop discretized control system. As we discussed,
in order not to unduly restrict the state space within the opera-
tional constraints, we should ﬁnd the maximum stability region
(MSR). To get the MSR, we ﬁrst give Lemma 1.
Lemma 1: Given a discretized LTI system (6) with state con-
straints (10), the stability region deﬁned in (11) satisﬁes the
constraints in (10) iff , .
Proof: Please refer to [29, Lemma 4.1].
Notethattheareaofthestabilityregiondeﬁnedin(11)ispro-
portional to thedeterminantof matrix . BasedonLemma 1,
the determination of the MSR of a closed-loop discretized con-
trol system (6) with state constraints (10) is then reduced to the
following linear matrix inequality (LMI) problem [30].
1) Problem 1: Maximize
Further,if existsandlet ,weconverttheabove
LMI problem as the following new problem.
2) Problem 2: Maximize
Problem 2 is a MAXDET problem [30]. It is readily-solv-
able using numerical software packages such as sdpsol [31] or
YALMIP [32]. Note that the maximum stability region and its
solution via LMI formulations (i.e., Problems 1 and 2) are dif-
ferent from those presented in [29]. Here we are dealing with
discretized system under digital controllers, while in [29], the
authorswere dealingwith continuous system under continuous-
time controllers.
The resulting from the MAXDET problem above
deﬁnes the maximum stability region in the
system state polytope (see Fig. 4).
Usingthemethodpresentedabove,whenthecontinuous-time
systemmodel,theunderlyingcontinuous-timecontrollerandits
control loop period are given, a system designer can calculate
the maximum stability region of the corresponding closed-loop
discretized control system ofﬂine.
It is obvious that the maximum stability region (MSR) is a
function of the control loop period . Hence we denote the
MSR for a plant under a digital controller with respect to the
control loop period as . We further use to de-
note the area of . Since encloses the admis-
sible states of the system to keep it stable with respect to the
control loop period , we call the stability index of the
closed-loop discretized control system under control loop pe-
riod .
Now, let us look at a real-life control example. Consider
a controlled inverted pendulum. The continuous-time system
model of the inverted pendulum is shown in the following:
(12)
where the system matrices are
(13)
(14)
The designed high-assurance controller for the inverted pen-
dulum uses state feedback control [26] as shown in the fol-
lowing:
(15)
Authorized licensed use limited to: McGill University. Downloaded on September 11, 2009 at 15:07 from IEEE Xplore.  Restrictions apply. LIU et al.: ORTEGA: AN EFFICIENT AND FLEXIBLE ONLINE FAULT TOLERANCE ARCHITECTURE 219
Fig. 5. Stability index versus control loop period for the inverted pendulum
control system.
Here the plant state is represented as
. is the inverted pen-
dulum’s track position at time ; is the inverted
pendulum’s angle position at time ; is the inverted
pendulum’s track position velocity at time ; and is the
inverted pendulum’s angle velocity at time .
We implemented the method presented above to determine
the maximum stability region when varying the control loop
periodfrom0.001(seconds)to0.032(seconds).Fig.5illustrates
the corresponding stability index versus control loop period for
the inverted pendulum system.
The physical meaning of the decreasing shape of the stability
index shown in Fig. 5 is clear: as the control loop period de-
creases (i.e., controller runs faster), the system becomes more
stable, hence the stability index increases. On the other
hand, as the control loop period increases (i.e., controller runs
slower), the plant becomes less stable, hence the stability index
decreases. So in order to have a larger maximum stability
region, the HAC should run faster. This is supported in OR-
TEGA as the HAC can run at a rate faster than that of the HPC,
as long as system schedulability is guaranteed. We provide the
schedulability analysis of ORTEGA in next section.
B. Schedulability Analysis for ORTEGA Fault
Tolerance Architecture
The new ORTEGA fault tolerance architecture saves CPUre-
source compared with Simplex. The saved CPU resource could
be used for other real-time tasks. In this section, we discuss the
schedulability analysis of ORTEGA together with these real-
time tasks.
Consider an FT-enabled plant under the protection of
ORTEGA. Let us denote the task when is being controlled
by the HPC as and denote the task when is being con-
trolled by the HAC as . Their timing parameters are
and ,respectively.WhenafaultisdetectedintheHPC,
the decision module will issue a recovery request (RR) to ini-
tiate the recovery procedure. As a result, will be aborted and
the new task will be activated for recovery.
Using similar notations, the decision module is modeled as
real-time task when is controlled under the HPC and
modeled as task when is controlled under the HAC.
In ORTEGA, when controller switches, the decision module’s
period also switches accordingly. So tasks (i.e., controller
task) and (i.e., decision module task) have the same phase
and period. From a schedulability analysis perspective, and
can be modeled as one single task . Its execution time is
, its period is . Similarly tasks
and can also be modeled as one single task , where
, .
As a result, when scheduled together with other real-time
tasks, the decision modules and controllers for a single FT-en-
abled plant can be modeled as an abstract task . It has two sub-
tasks and , where is running when the plant is under the
controloftheHPCand isrunningwhentheplantis underthe
control of the HAC. Task is called an FT-enabled task. This
model abstraction simpliﬁes the following schedulability anal-
ysis. In the discussions below, without confusion, we use to
denote the combined decision and control task for an FT-en-
abled plant together with other real-time tasks.
1) Task Model and Deﬁnitions: Assume there are a total
of real-time tasks, , running on the CPU. They
are ordered in the sequence of their priorities, with having
the highest priority and having the lowest priority. Among
them, tasks are the FT-enabled tasks,
is the set of all FT-enabled tasks. Each ,
is composed of two subtasks: represents the combined
decision and control task when the plant is being controlled by
its HPC; represents the combined decision and control task
when the plant is being controlled by its HAC.
Deﬁnition1: Giventhetask set , , among
which tasks are FT-enabled tasks,
. If the task set is schedulable under the OR-
TEGA task recovery and switch-back scheme with random fail-
ures, it is called FT-schedulable.
Wewillfocusontheschedulabilityanalysiswhenthereisone
FT-enabled task in the task set, i.e., .
In ORTEGA, when a fault in the HPC is identiﬁed, the deci-
sion module will issue an RR. At the same time, task will
be aborted, and will be activated for recovery. This poten-
tially raises a mode-change problem [33], since tasks and
may have different timing parameters. We need to carry out
schedulability analysis to guarantee that the tasks are schedu-
lable during the recovery. Similarly, after the recovery, can
be switched back to for retry or performance considerations.
We also need to guarantee the tasks are still schedulable with
the switch-back.
If a task set is schedulable under the ORTEGA task re-
covery, it is called FT-schedulable for recovery. If a task set is
schedulable under the ORTEGA task switch-back, it is called
FT-schedulable for switch-back.
Authorized licensed use limited to: McGill University. Downloaded on September 11, 2009 at 15:07 from IEEE Xplore.  Restrictions apply. 220 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 4, NO. 4, NOVEMBER 2008
Sincetheschedulability analysismethods aresimilar forboth
the recovery procedure and the switch-back procedure, we only
present the analysis for recovery, i.e., under the mode-change
from to in the following discussion.
2) Schedulability Analysis for ORTEGA Recovery Scheme:
When the RR is initiated, the recovery procedure drives
the system from the old operating mode (abbreviated as
“old-mode”) to the new operating mode (abbreviated as
“new-mode”). The plant is controlled by the HPC in the
old-mode and is controlled by the HAC in the new-mode,
respectively. Tasks in the old-mode are called old-mode tasks;
while tasks in the new-mode are called new-mode tasks. In
the following discussions, when we refer to task , we mean
subtask when the context is in old-mode and we mean
subtask when the context is in new-mode.
The switching from to imposes transitional scheduling
overhead,which maymake thewholetask set unschedulable.In
real-time analysis, this is called the mode-change problem [34],
[33]. In order to determine whether the task set is FT-schedu-
lable under potential recoveries, we develop a schedulability
test. It is based on a variant of the proposal presented in [33] by
Realetal.withsimpliﬁcations.Thebasicideahereistoconsider
each real-time task which may be affected by the transitional
scheduling overhead in either old-mode or new-mode. Then we
perform an ofﬂine response time analysis to test if it is schedu-
lable in both modes.
When thereis onlyoneFT-enabled task inthetaskset,itis
theonlytaskwhichmayinitiatethemode-change.Foranyother
task ( ), the task release pattern will not change before
andafterthemode-change.Wecallthesetasksunchangedtasks.
Our schedulability test is divided into three parts.
SchedulabilityofSteadyStateTaskSet: Letusdeﬁnetwotask
sets as follows:
(16)
(17)
They are the old-mode and new-mode steady state task sets.
We ﬁrst test if both task sets are schedulable. This can be done
using the standard response time analysis [25], [35], [36] for
each task in the set. By solving the recurrence equations in the
standard response time analysis, we get (and ), the
maximalresponsetimeoftask undertheold-mode(andunder
thenew-mode)insteadystate.Itshouldbe smallerthanorequal
to the task deadline (period) for schedulability. If the recur-
rence value exceeds the period, then task is unschedulable.
Schedulability of Old-Mode Tasks With Transitional Sched-
uling Overhead: In the old-mode, ﬁrst, it is obvious to see that
for each task ( , its schedulability is not affected by the
mode-change. This is because they have higher priorities than
the FT-enabled task .
Secondly,weconsidertask ,whichisthetaskabortedupon
the RR. It cannot be affected by the new-mode task during
Fig. 6. Illustration of mode-change incurred by the recovery.
the old-mode, hence its schedulability has already been covered
by the steady state schedulability analysis above [i.e., case (I)].
Lastly, we consider every task ( , which is an old-
mode task who has lower priority than task . In order to ac-
count for the worst case phasing in the schedulability test, the
RR is assumed to occur time units after the task’s activation
We also deﬁne a temporal window , starting at the activa-
tion of old-mode task and ﬁnishing when completes. Fig. 6
illustrates and .
Now we carry out the response time analysis by quantifying
the possible interferences may receive from other tasks.
Firstly, in the old-mode, can be affected by the old-mode
aborted task . The total worst-case interferences from is
(18)
Secondly, can also be affected by the new released task .
The worst-case interferences from is
(19)
where .
Finally, can also be affected by the unchanged tasks who
have higher priorities than it. These interferences are
(20)
By summing up all the sources of interferences, we got the
total size of window
(21)
Solution to this equation is obtained by performing a recur-
rence calculation on to ﬁnd the smallest positive integer
that satisﬁesit, just as theordinary response time analysis. Then
Authorized licensed use limited to: McGill University. Downloaded on September 11, 2009 at 15:07 from IEEE Xplore.  Restrictions apply. LIU et al.: ORTEGA: AN EFFICIENT AND FLEXIBLE ONLINE FAULT TOLERANCE ARCHITECTURE 221
the worst-case response time of the old-mode task is obtained
as the duration of the largest time window of , i.e.,
(22)
In practice, when calculating using (22), we only care
about signiﬁcant values of . Those produce changes in the
values of the ceiling and ﬂoor functions in (21).
After is obtained, it will be compared with ’s deadline
to determine if is schedulable or not.
Schedulability of New-Mode Tasks With Transitional Sched-
uling Overhead: In the new-mode, ﬁrst, it is obvious to see that
for each task ( , its schedulability is not affected by
the mode-change. So this case has already been covered in the
steady state schedulability test [i.e., case (I)].
Secondly, we consider task , which is the newly released
task at RR. It can not be affected by the old-mode task in the
new-mode,henceitsschedulabilityhasalsobeencoveredbythe
steady state schedulability analysis [i.e., case (I)].
Finally, we consider every task ( , which is a
new-mode unchanged task who has lower priority than task .
As with the analysis of old-mode tasks, we deﬁne a temporal
window to enclose the response time.
In the new-mode, can be affected by the new task . The
interference is
(23)
Also, it can be affected by the unchanged tasks whose prior-
ities are higher. These interferences are
(24)
Summing up all the sources of interferences, we get the
window size
(25)
Again,recurrencecalculationof shouldbeperformeduntil
its convergence or when the recurrence value exceeds the task’s
deadline. Then the response time for task in the new-mode
can be obtained as
(26)
The whole task set is schedulable even with
randomrecoveriesifitpassestheschedulabilitytests(I-III),i.e.,
FT-schedulableforrecovery.Similartestscanbedonetoensure
Fig. 7. Inverted pendulum (IP). ￿ is the angular deviation from the vertical
position.
the task set is still schedulable with the switch-back. Then we
can determine if the task set is FT-schedulable.
We implemented the schedulability test presented above.
Below we give a numerical example.
3) Example 1: Consider three tasks , , and . Tasks
and are ordinary real-time tasks with timing parameters (2,
4) and (3, 30), respectively. Task is an FT-enabled task pro-
tected under ORTEGA. ’s timing parameters for the HPC are
.WevarytheexecutiontimeoftheHAC( )
for different values to simulate different versions of the HAC
design. For each , we ﬁnd the smallest period of the HAC
( ) such that the whole task set is still schedulable based on
the schedulabiilty analysis presented above. We denote such a
period for the HAC as . The results are shown in the fol-
lowing:
•i f ,w eh a v e ;
•i f ,w eh a v e ;
•i f ,w eh a v e ;
•i f ,w eh a v e .
We have two observations here. First, the HAC’s period can
be smaller than the HPC’s period, in the mean time, all real-
time tasks under ORTEGA are still schedulable. Second, as
decreases, has signiﬁcantly dropped. This means the HAC
can run at a much faster rate than that of the HPC during the
recovery when fault occurs.
V. IMPLEMENTATION AND EVALUATION OF ORTEGA
WeimplementedandevaluatedORTEGAonaninvertedpen-
dulum (IP) testbed [37]. Fig. 7 explains the concept of IP. An IP
is a metal rod with one end hinged onto a cart and the other end
free. The cart moves back and forth along the -axis to keep the
rod (i.e., IP) stand up vertically, i.e., to maintain angular devia-
tion around 0 degree (see Fig. 7).
IP is inherently unstable. A couple of missed control outputs
is enough to make it fall, even when the angular deviation and
angular velocity are small. As a result, the fault detection and
recovery must be carried out in a timely and predictable manner
[21].
In our testbed, ORTEGA runs on a computer with a Pentium
II 350-MHz processor, a 66-MHz memory bus, and 32 KB of
level 1 cache memory on chip. The IP uses a Quadrature optical
encoder interface for sensing input and a digital to analog con-
verter for control output.
We implementedORTEGA in C, and run ORTEGA onLinux
kernel version 2.4.18–3 with RT scheduling and kernel preemp-
tion enhancements. Our ORTEGA uses rate monotonic sched-
uling [38], a ﬁxed priority scheduling scheme widely supported
Authorized licensed use limited to: McGill University. Downloaded on September 11, 2009 at 15:07 from IEEE Xplore.  Restrictions apply. 222 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 4, NO. 4, NOVEMBER 2008
TABLE I
EXECUTION STATISTICS FOR THE NONFAULTY HPC AND THE HAC
by contemporary systems and standards, including the POSIX
real-time extension [39].
Inourtested,weuseaﬁeld-testedstatefeedbackcontrolleras
theHAC.Itwasdesignedtobesimplesoitcanbeeasilychecked
that there is no bug. For the HPC, users can upload their arbi-
trary controllers dynamically to our testbed. These controllers
maycontainfaultsorbugs.Infact,weselectedmanyfaultycon-
trollers as HPCs to evaluate the fault tolerance performance of
ORTEGA. We also provide a default nonfaulty HPC. It serves
two purposes. First, it is used as a benchmark in the evaluation
of the CPU resource savings under ORTEGA (Section V-A).
Second, it can be used as a controller template for users to in-
sert various bugs for testing (Section V-B).
The following are our evaluation results.
A. CPU Resource Savings of ORTEGA
In order to measure the CPU savings of ORTEGA compared
with the original Simplex, we collected the controllers’ running
temporal data from the testbed.
We accurately measure the nonfaulty HPC and the HAC con-
troller’s execution times by using the call [40]. Table I
shows the mean, variance, minimum, and maximum of execu-
tion times of the HPC and the HAC, respectively.
When the HPC and the HAC are running at the same fre-
quency, as we can see from Table I, the percentage of CPU
usage savings (in terms of controllers’ execution times) using
ORTEGA compared with that of using Simplex [cf. (1) and (2)]
is
where is the percentage of time used for recovery, which is
assumed to be small.
ORTEGA can run HPC and HAC at different rates. For ex-
ample,inourtests,theHPCisrunningat33.3HzandtheHACis
running at 50 Hz. Consider the difference in control frequencies
of the HPC and the HAC, the percentage of CPU usage savings
(in terms of controllers’ execution times) using ORTEGA com-
pared to Simplex is
B. Fault Tolerance Performance Under ORTEGA
For any fault tolerance scheme, two important evaluation cri-
teria are false negative (i.e., type II errors) and false positive
(i.e., type I errors) ratio [41]. For a good fault tolerance scheme,
in order to minimize false negative ratio, the scheme needs to
tolerate as many faults as possible; in order to minimize false
positive ratio, a scheme should minimize the situation of classi-
fying nonfaulty behaviors as faulty.
In order to evaluate the fault tolerance capabilities of OR-
TEGA, we conducted extensive tests on our IP testbed. The
testing procedure uses various nonfaulty and faulty HPCs to
control the inverted pendulum, and checks whether ORTEGA
lets the nonfaulty HPCs to take control, and ﬁxes the faulty
HPCs by switching to the HAC.
The way to introduce faulty HPCs is by inserting different
faults/bugs into the default nonfaulty HPC controller. Below
we list a subset of the benchmark faults/bugs tested against
ORTEGA. Not all faults/bugs of the benchmark are listed due
to space limit of this paper. The faults/bugs benchmark is not
intended to be complete, rather it is intended to show the broad
range of faults/bugs that ORTEGA can tolerate. According
to our results, ORTEGA tolerates all the faults/bugs in the
benchmark. When the HPC controlled system fails the decision
module’s test, ORTEGA replaces the HPC with the HAC.
Hence the inverted pendulum can keep running without falling
down. It is worth noting that some of the tests (such as the
“tricky design bug” discussed below) can also evaluate OR-
TEGA’s false positive ratio. A video demo of all the tests listed
here is available at http://www.cs.mcgill.ca/~xueliu/Demos/.
The video demo shows the details of the dynamic movements
of the cart and inverted pendulum before and after a bug was
introduced. It captures the negative effect imposed by each bug
on the pendulum, illustrates the transition when the HAC was
activated, and shows how the HAC can keep the pendulum
from falling down.
We also use other faulty HPCs to test the robustness of OR-
TEGA. The tested faults/bugs include (but are not limited to)
bang-bang (BangBang) bug, divided by zero (DivZero) bug,
infnite loop (InfLoop) bug, maximum control output (MaxCtrl)
bug, nonperforming (NonPerf) bug, positive feedback control
(PosFdbk) bug, and tricky designer (TrickyDsg) bug (see [23]
for more details).
For each of the above bugs, we tested ten trials using OR-
TEGA and another ten trials using Simplex. In each trial, the
corresponding bug is injected at time 0 s, and we track the IP
angular deviation ( , see Fig. 7) to evaluate the performance of
ORTEGA and Simplex.
Authorized licensed use limited to: McGill University. Downloaded on September 11, 2009 at 15:07 from IEEE Xplore.  Restrictions apply. LIU et al.: ORTEGA: AN EFFICIENT AND FLEXIBLE ONLINE FAULT TOLERANCE ARCHITECTURE 223
Fig. 8. Comparison on stabilization time cost.
Fig. 9. Comparison on peak deviation (i.e., ￿￿￿ ￿￿￿ after bug injection).
Figs. 8 and 9 compare the statistics between ORTEGA and
Simplex using the results of our 140 trials. Fig. 8 compares OR-
TEGA and Simplex’s performance on stabilization time cost.
Empirically,wecanregardtheIPtobestabilizedwhen never
exceeds1.5degree.Stabilizationtimecostreferstothetimecost
since thebugtakesplace till theIP is stabilized.Fig.9compares
ORTEGA and Simplex’s performance on peak deviation, i.e.,
the maximum of after the bug takes place.
According to Figs. 8 and 9, ORTEGA can effectively tolerate
allthefaults/bugsinits70trials,anditsperformanceisnotunac-
ceptably worse than Simplex. Given the 38.34% saving on CPU
utilization compared to Simplex (see Section V-A), ORTEGA
qualiﬁes as a useful scheme.
VI. CONCLUSIONS
In this paper we presented a new fault tolerance architecture,
ORTEGA, for real-time control systems. Similar to Simplex,
ORTEGA is reliable and achieves high fault coverage. Com-
pared with Simplex, ORTEGA has advantages including that it
allows more efﬁcient resource utilizations and enhances ﬂexi-
bility. This is achieved by running the high-assurance controller
in an on-demand fashion instead of running in parallel. We im-
plemented ORTEGA on an inverted pendulum control testbed
and carried out extensive benchmark tests to evaluate the per-
formance of ORTEGA. Results demonstrate the efﬁciency and
effectiveness of ORTEGA. We believe ORTEGA is a promising
real-timefaulttolerancearchitectureandisapplicabletoawider
range of industrial applications where both efﬁcient resource
usage and high fault coverage are desired.
In the future, we plan to make ORTEGA available to the in-
dustry and test its performance in more complex real-world de-
ployments.
REFERENCES
[1] National Research Council, Embedded, Everywhere: A Research
Agenda for Networked Systems of Embedded Computers, National
Academies Press, 2001.
[2] J. A. Stankovic, “Real-time computing systems: The next generation,”
in Tutorial: Hard Real-Time Systems, J. A. Stankovic and K. Ramam-
ritham, Eds. Washington, DC: IEEE Comput. Soc., 1998, pp. 14–37.
[3] G. Buttazzo, Hard Real-Time Computing Systems: Predictable Sched-
uling Algorithms and Applications, 2nd ed. New York: Springer,
2005.
[4] F. Cristian, “Understanding fault-tolerant distributed systems,”
Commun. ACM, vol. 34, no. 2, pp. 56–78, 1991.
[5] S. Graham, G. Baliga, and P. R. Kumar, “Issues in the convergence
of control with communication and computing: Proliferation, architec-
ture, design, services, and middleware,” in Proc. 43rd IEEE Conf. De-
cision and Control, Dec. 2004.
[6] L. Sha, “Dependable system upgrade,” in Proc. IEEE Real-Time Sys-
tems Symp. (IEEE Computer Society; RTSS ’98), 1998, p. 440.
[7] D. Seto, B. H. Krogh, L. Sha, and A. Chutinan, “Dynamic control
system upgrade using the simplex architecture,” IEEE Control System
Mag., vol. 18, no. 4, pp. 72–80, Aug. 1998.
[8] L. Sha, R. Rajkumar, and S. Sathaye, “Generalized rate monotonic
scheduling theory: A framework of developing real-time systems,”
Proc. IEEE, vol. 82, no. 1, pp. 68–82, Jan. 1994.
[9] D. Seto, E. Ferreira, and T. Marz, Case Study: Development of a Base-
line Controller for Automatic Landing of an F-16 Aircraft Using Lmis,
Carnegie Mellon Univ., Softw. Eng. Inst., 2000, Tech. Rep. CMU/
SEI-99-TR-020.
[10] D.Seto,J.P.Lehoczky,L.Sha,andK.G.Shin,“Ontaskschedulability
in real-time control system,” in Proc. 17th IEEE Real-Time Systems
Symp., 1996, pp. 13–21.
[11] L. Sha, X. Liu, M. Caccamo, and G. Buttazzo, “Online control opti-
mization using load driven scheduling,” in Proc. Conf. Decision and
Control, Sydney, Australia, 2000.
[12] Software Fault Tolerance (Trends in Software, No. 3), M. Lyu, Ed.
New York: Wiley, 1995.
[13] A. Avizienis, “The methodology of n-version programming,” in Soft-
ware Fault Tolerance, M. R. Lyu, Ed. New York: Wiley, 1995.
[14] B. Randell and J. Xu, “The evolution of the recovery block concept,”
in Software Fault Tolerance. New York: Wiley, 1995.
[15] R. Ramakrishnan and J. Gehrke, Database Management Systems, 3rd
ed. New York: McGraw-Hill, 2003.
[16] D. Pradhan and N. Vaidya, “Roll-forward checkpointing scheme: A
novel fault-tolerant architecture,” IEEE Trans. Comput., vol. 43, no.
10, pp. 1163–1174, Oct. 1994.
[17] P. M. Melliar-Smith and L. E. Moser, “Progress in real-time fault tol-
erance,” in Proc. 23rd IEEE Int. Symp. Reliable Distributed Systems
(SRDS’04), Florianpolis, Brazil, Oct. 2004, pp. 109–111.
[18] K. H. Kim, “Slow advances in fault-tolerant real-time distributed com-
puting,” in Proc. 23rd IEEE Int. Symp. Reliable Distributed Systems
(SRDS’04), Florianpolis, Brazil, Oct. 2004, pp. 106–108.
[19] H. Aydin, R. Melhem, and D. Mosse, “Optimal scheduling of impre-
cise computation tasks in the presence of multiple faults,” in Proc. 7th
Int.Conf.Real-TimeComputingSystemsandApplications(RTCSA00),
2000.
[20] H. Aydin et al., “Tolerating faults while maximizing reward,” in Proc.
12th Euromicro Conf. Real-Time Systems (Euromicro’00), Stockholm,
Sweden, 2000.
[21] K. Lee and L. Sha, “Process resurrection: A fast recovery mechanism
for real-time embedded systems,” in Proc. 11th IEEE Real-Time and
Embedded Technology and Applications Symp., San Francisco, CA,
2005.
[22] L. Sha, “Using simplicity to control complexity,” IEEE Software, vol.
18, no. 4, pp. 20–28, Jul.-Aug. 2001.
[23] X. Liu, H. Ding, K. Lee, Q. Wang, and L. Sha, “ORTEGA: An efﬁcient
and ﬂexible software fault tolerance architecture for real-time control
systems,” in Proc. 20th Euromicro Conf. Real-Time Systems (ECRTS
2008), 2008.
[24] K. Tindell and A. Alonso, “A Very Simple Protocol for Mode Changes
in Priority Preemptive Systems,” Tech. Rep., Univ. Politecnica de
Madrid, Madrid, Spain, 1996.
[25] J. Liu, Real-Time Systems. Englewood Cliffs, NJ: Prentice-Hall,
2000.
[26] K. J. Astrom and B. Wittenmark, Computer-Controlled Systems:
Theory and Design, 3rd ed. Reading, MA: Addison-Wesley, 1994.
Authorized licensed use limited to: McGill University. Downloaded on September 11, 2009 at 15:07 from IEEE Xplore.  Restrictions apply. 224 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 4, NO. 4, NOVEMBER 2008
[27] M. Roozbehani, A. Megretski, and E. Feron, “Safety veriﬁcation of
iterative algorithms over polynomial vector ﬁelds,” in Proc. 45th IEEE
Conf. Decision and Control, 2006, pp. 6061–6067.
[28] M. Roozbehani et al., “Convex optimization proves software correct-
ness,” in Proc. 2005 Amer. Control Conf., 2005, vol. 2, pp. 1395–1400.
[29] D. Seto and L. Sha, An Engineering Method for Safety Region Devel-
opment, CMU SEI, 1999, Tech. Rep. 18.
[30] S. Boyd, L. E. Ghaoui, E. Feron, and V. Balakrishnan, Linear Matrix
Inequalities in System and Control Theory. Philadelphia, PA: SIAM,
1994.
[31] S.-P. Wu and S. Boyd, “Sdpsol: A parser/solver for sdp and maxdet
problems with matrix structure,” in Recent Advances in LMI Methods
for Control, L. E. Ghaoui and S.-I. Niculescu, Eds. Philadelphia, PA:
SIAM, 1999.
[32] J. Lofberg, “YALMIP : A toolbox for modeling and optimization in
matlab,” in Proc. 2004 IEEE Int. Symp. Computer Aided Control Sys-
tems Design, Taipei, Taiwan, 2004.
[33] J. Real and A. Crespo, “Mode change protocols for real-time systems:
A survey and a new proposal,” Real-Time Syst., vol. 26, pp. 161–197,
2004.
[34] P. Pedro and A. Burns, “Schedulability analysis for mode changes in
real-time systems,” in Proc. 10th Euromicro Workshop Real-Time Sys-
tems, Berlin, Germany, 1998.
[35] M. Joseph and P. Pandya, “Finding response times in a real-time
system,” Comput. J., vol. 29, no. 5, pp. 390–395, 1986.
[36] N. Audsley, A. Burns, M. Richardson, K. Tindell, and A. Wellings,
“Applying new scheduling theory to static priority preemptive sched-
uling,” Softw. Eng. J., vol. 8, no. 5, pp. 284–292, 1993.
[37] Quanser, 2008. [Online]. Available: http://www.quanser.com.
[38] C.L.LiuandJ.W.Layland,“Schedulingalgorithmsformultiprogram-
ming in hard real time environment,” J. ACM, vol. 20, no. 1, pp. 40–61,
1973.
[39] Real-Time System Services Working Group, IEEE STD 1003.1, 1998,
1996 ed. .
[40] M. Rajagopalan, S. K. Debray, M. A. Hiltunen, and R. D. Schlichting,
“Proﬁle-directed optimization of event-based programs,” in Proc. SIG-
PLAN ’02 Conf. Programming Language Design and Implementation
(PLDI 02), 2002.
[41] P. Pelliccione, N. Guelﬁ, H. Muccini, and A. Romanovsky, Software
Engineeringand FaultTolerance. Singapore:World Scientiﬁc, 2007.
Xue Liu (M’06) received the B.S. degree in applied
mathematics and the M.Eng. degree in control theory
and applications from Tsinghua University, Beijing,
China, in 1996 and 1999, respectively, and the Ph.D.
degree in computer science from the University of
Illinois at Urbana-Champaign in 2006.
HeiscurrentlyanAssistantProfessorintheSchool
ofComputerScienceatMcGillUniversity,Montreal,
QC, Canada. His research interests include real-time
and embedded computing, performance and power
management of server systems, cyber-physical sys-
tems, real-time and embedded systems, fault tolerance, and control. He has au-
thored/coauthored more than 40 refereed publications in leading conferences
and journals in these ﬁelds.
Dr. Liu is a member of the ACM.
Qixin Wang (M’08) received the B.E. and M.E. de-
grees from the Department of Computer Science and
Technology, Tsinghua University, Beijing, China, in
1999 and 2001, respectively, and the Ph.D. degree
from the Department of Computer Science, Univer-
sity of Illinois at Urbana-Champaign in 2008.
He will join the Department of Computing in the
Hong Kong Polytechnic University in 2009 as an
Assistant Professor. His research interests include
real-time/embedded systems and networking, wire-
less technology, and their applications in industrial
control, medicine, and assisted living.
Dr. Wang is a member of the ACM.
SathishGopalakrishnan(M’06)receivedthePh.D.degreeincomputerscience
from the University of Illinois at Urbana-Champaign.
He is an Assistant Professor of electrical and computer engineering at the
University of British Columbia, Vancouver, BC, Canada. His research activities
encompass several aspects concerning the design and implementation of highly
reliable and predictable computer systems.
Dr. Gopalakrishnan is a member of the ACM.
Wenbo He (M’08) received the Ph.D. degree from
the Department of Computer Science, University of
Illinois at Urbana-Champaign in 2008.
Sheiscurrentlyan AssistantProfessorintheCom-
puter Science Department of the University of New
Mexico, Albuquerque. Her research interests include
pervasive and ubiquitous computing, cyber-physical
systems, security, trust, and privacy.
Lui Sha (F’98) received the Ph.D. degree from
Carnegie Mellon University, Pittsburgh, PA, in 1985.
He is currently Donald B. Gillies Chair Professor
of Computer Science at the University of Illinois at
Urbana-Champaign. His research area is dependable
real-time and embedded systems.
Dr. Sha was elected a Fellow of ACM in 2005.
Hui Ding received the Ph.D. degree from the Department of Computer Science
at the University of Illinois at Urbana-Champaign in 2006.
Kihwal Lee received the Ph.D. degree from the Department of Computer Sci-
ence at the University of Illinois at Urbana-Champaign in 2006.
Authorized licensed use limited to: McGill University. Downloaded on September 11, 2009 at 15:07 from IEEE Xplore.  Restrictions apply. 