Fault recovery for real-time, multi-tasking computer system by Hess, Richard et al.
(12) United States Patent
Hess et al.
(54) FAULT RECOVERY FOR REAL-TIME,
MULTI-TASKING COMPUTER SYSTEM
(75) Inventors: Richard Hess, Glendale, AZ (US);
Gerald B. Kelly, Glendale, AZ (US);
Randy Rogers, Phoenix, AZ (US); Kent
A. Stange, Phoenix, AZ (US)
(73) Assignee: Honeywell International Inc.,
Morristown, N7 (US)
(*) Notice: Subject to any disclaimer, the term of this
patent is extended or adjusted under 35
U.S.C. 154(b) by 776 days.
(21) Appl. No.: 11/058,764
(22) Filed:	 Feb. 16, 2005
(65)	 Prior Publication Data
US 2006/0195751 Al	 Aug. 31, 2006
(51) Int. Cl.
G06F 11/00	 (2006.01)
(52) U.S. Cl .	 ......................................................... 714/15
(58) Field of Classification Search ................. 714/1-52
See application file for complete search history.
(56)	 References Cited
U.S. PATENT DOCUMENTS
4,345,327 A 8/1982 Thuy
4,453,215 A 6/1984 Reid
4,751,670 A 6/1988 Hess
4,996,687 A * 2/1991 Hess et al .	 ......................	 714/15
5,086,429 A 2/1992 Gray et al.
5,313,625 A 5/1994 Hess et al.
5,550,736 A 8/1996 Hay et al.
5,732,074 A * 3/1998 Spaur et al .	 ...................	 370/313
5,757,641 A 5/1998 Minto
5,903,717 A 5/1999 Wardrop
5,909,541 A 6/1999 Sampson et al.
(1o) Patent No.:	 US 7,971,095 B2
(45) Date of Patent:	 Jun. 28, 2011
5,915,082 A 6/1999 Marshall et al.
5,949,685 A 9/1999 Greenwood et al.
6,058,491 A 5/2000 Bossen et al.
6,065,135 A 5/2000 Marshall et al.
6,115,829 A 9/2000 Slegeletal.
6,134,673 A 10/2000 Chrabaszcz
6,141,770 A 10/2000 Fuchs et al.
6,163,480 A 12/2000 Hess et al.
6,185,695 BI 2/2001 Murphy et al.
(Continued)
FOREIGN PATENT DOCUMENTS
EP	 0363863 4/1990
(Continued)
OTHER PUBLICATIONS
Lee, "Design and Evaluation of a Fault-Tolerant Multiprocessor
Using Hardware Recovery Blocks", Aug. 1982, pp. 1-19, Publisher:
University of Michigan Computing Research Laboratory, Published
in: Ann Arbor, MI.
(Continued)
Primary Examiner Scott T Baderman
Assistant Examiner 7igarPatel
(74) Attorney, Agent, or Firm Fogg & Powers LLC
(57)	 ABSTRACT
System and methods for providing a recoverable real time
multi-tasking computer system are disclosed. In one embodi-
ment, a system comprises a real time computing environment,
wherein the real time computing environment is adapted to
execute one or more applications and wherein each applica-
tion is time and space partitioned. The system further com-
prises a fault detection system adapted to detect one or more
faults affecting the real time computing environment and a
fault recovery system, wherein upon the detection of a fault
the fault recovery system is adapted to restore a backup set of
state variables.
31 Claims, 6 Drawing Sheets
-----------------------
300 354
Monitor
Monitor	 I/O Port
319	 \	 358
---------- °°--
Recovery
control logic
	
Even Mem.	 id Mem.
	
3.
	
366
-----------------------------------
352
	
Processor	 j
350
356
Memory
----------------------- -----------'
362
372
Variable
	 ! 360
identity	 J
array
368	 ` 370
https://ntrs.nasa.gov/search.jsp?R=20110013148 2019-08-30T16:01:07+00:00Z
US 7,971,095 B2
Page 2
U.S. PATENT DOCUMENTS 2003/0208704 Al 	 11/2003	 Bartels et al.
2004/0019771 Al 	 1/2004	 Quach6,189,112 B1 2/2001 Slegeletal.
tal. 2004/0098140 Al 	 5/2004	 Hess 6,279,119 B1 8/2001 Bissett 2004/0221193 Al 	 11/2004	 Armstrong et al.6,367,031 B1 4/2002
Yount
Yount 2005/0022048 Al 	 1/2005	 Crouch6,393,582 B1 5/2002
l
et al. 2005/0138485	 Al*	 6/2005	 Osecky et al .	 .................. 714/486,467,003 B1 10/2002
e	
et al.Doerenber 2005/0138517 Al 	 6/2005	 Monitzer6,560,617 B1 5/2003
5/2003
al.Winger	
1. 2006/0041776 Al*
	
2/2006	 Agrawal et al .................... 714/26,574,748 B1 Andress et
et
2006/0085669 Al 	 4/2006	 Rostron et al.6,600,963 B1 7/2003 Loire et 2008/0016386 Al 	 1/2008	 Dror et al.6,625,749 B1 9/2003 Quach
6,751,749 B2 6/2004 Hofstee et al. FOREIGN PATENT DOCUMENTS
6,772,368 B2 8/2004 Dhong et al.
6,789,214 B1 9/2004 De Monis-Hamelin et al. EP	 0754990	 1/1997
6,813,527 B2 11/2004 Hess EP	 1014237 Al	 6/2000
6,990,320 B2 1/2006 LeCren EP	 1014237 Al	 6/2000
7,003,688 B1 2/2006 Pittelkow et al. OTHER PUBLICATIONS7,062,676 B2 6/2006 Shinohara et al.
7,065,672 B2 6/2006 Long et al. Racine, "Design of a Fault-Tolerant Parallel Processor", 2002, pp.7,178,050 B2 2/2007 Fung 13.D.2-1-13.D.2-10, Publisher: IEEE, Published in: US.7,320,088 B1 1/2008 Gawali eti  al.
7,334,154 B2 2/2008 Lorch et al. Dolezal, "Resource Sharing in a Complex Fault-Tolerant System",
7,401,254 B2 7/2008 Davies 1988, pp. 129-136, Publisher: IEEE.
2001/0025338  Al * 9/2001 Zumkehr et al ...............	 712/228 Ku, "Systematic Design of Fault-Tolerant Mutiprocessors With
2002/0099753 Al* 7/2002 Hardin et al .	 .....................	 709/1 Shared Buses", "IEEE Transactions on Computers", Apr. 1997, pp.
2002/0144177 Al 10/2002 Kondo et al. 439-455, vol. 46, No. 4, Publisher: IEEE.
2003/0126498 Al 7/2003 Bigbee et al.
2003/0177411 Al 9/2003 Dinker et al. * cited by examiner
cO
w
cuU
CLaQ
c
O
cuU
.Q
CL
Q
c
O
w
mU
.Q
a
Q
c
O
:r
U
.Q
CL
Q
c
O
U
.Q
a
Q
C
O
cvUQ
CL
Q
c
O
c^UQ
a
Q
C
O
f^VQ
CL
Q
co
N
T-
Nt
co
N
T-
U.S. Patent	 Jun. Zs, 2011
	 Sheet 1 of 6
	 US 7,971,095 B2
i
r
2
J
r
LL
a
r
a
ai
14
a
U.S. Patent	 Jun. 28, 2011	 Sheet 2 of 6	 US 7,971,095 B2
T—
C
O
cuUQ
CL
c
O
c^
U It
.Q
d
Q
C
O
+r
m
U co
.Q
Q
Q
C
O
cu
U NQ
Q
Q
C
O
cuU r-Q
Q
Q
N
O C
QD W	 p
r
W
C
O
caU
Q-
0
cu
U
.Q
Q
Q
C
O
cu
U NQ
Q
Q
Q
^_
LL
co
ca
N
(Y)
.Q
4
U
r
r
Z N
o z
C
O •^
jLO N
(! u)
U
r
LL
U.S. Patent	 Jun. 28, 2011	 Sheet 3 of 6	 US 7,971,095 B2
C0
cuU ^	 ^
Q	 +
O Q	 Z
O Q
cu
U_ co
Q+
Q O	 ZO
cu
U N
	 _
C CL
p Q	 Z
CL
mU r
Q	 ^
Q	 ^C	 ZO
U ^
O > C	 Q	
Z.
O
:_
U O U M
a) w
a) O -
Q	 ZwY Q C
=3 O 	 O
LPL	 U N
•a
Q	 Z
Q
O
(u
	
Z
U r
CL
CL
C)O Q^,	 Z
(0
U ^
Q oQ
ca
Z
U M
CL
C	 CL
O Q	 r
co
U_ N
a	 CCL 
OQ	
.:r-CO	
-
L)i	 Z
a
Q
Q
U.S. Patent	 Jun. 2s, 2011
	 Sheet 4 of 6
	 US 7,971,095 B2
0
0	 ON	 N
r---------------- ------------ --------I	 1I	 1I	 1I	 1I	 1I	 1I	 1I	 1
I	 ^	 II	 ^	 L	 I
N N
I	 W	 I
UIO^
L
	 I
I..L 1
'	 N	 'I	 1
1	 N	 1NI	 1
1
00	 1O	 iN	 E	 1I
I	 a)	 I
I	 L	 ^	 1
	
O	 1
^	 1I	 ^	 1
	
O	 1I	 1
I	 C	 11	 I1	 1
1
^	 1
I	 1
,	 11	 1I1	 11	 iI	 ♦I^	 1
W	 1I	 1
C14	 1
I	 L	 1
I	 O	 1I	 ^1	 11	 1
I	
•5
	
O	 1
N	 i
I	 ^	 I
I	 1OI	 C	 1C	 1
I	 ^	 1
I	 1
I	 ^	 1
I	 I
I	 1
C-4
I	 ^	 1
N	 L	 i
I	 O	 1
I	 ^	 1
I	 ^	 1
I	 ^	 1I
I	
V	
1
	
O	 1
^	 L
I	 a	 1I	 11	 1I	 11	 11	 1
'----------------------------------------
0MN
r-------------------------------- -1
I
' N
^
Ni i
I
1
^ T 1
^^
CLu 2
1
1 m
1
1
1
1
> 1
1
I
I
I!
I
1
1
1
I	 '^
1	 0
I
,
I
1
I
1	 y
1	 V
,
1	 N L ,
1
I
1
^
^
,
1
1
1 ^ 1
1
1
^ ,
I
1 0
i C NNM
NI I
1 D I
I Q I
O U
C UM
U O O
i U CO	 IM
O N
I
1
C
G
1
11 1
1
_0 1
1	 ^1 O 1
1	 v
1
1
N L
I
I
I
^`W
^
1
1
1
: 1
I
I
^ 1
,
I
E
1
ce)O N
I cC I
I
I A^`
1
I
W
>
II
1III1
W iII
IIIIII
N
u-
U.S. Patent	 Jun. 28, 2011	 Sheet 5 of 6	 US 7,971,095 B2
0
Lfl
co
0
Ln
co
I
i
c
L
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
C
G
M
r- - - - - - - - - - - - - - lll^--------------------
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
r	 ;
)
V_/
0)
FL
U.S. Patent	 Jun. 28, 2011	 Sheet 6 of 6	 US 7,971,095 B2
N	 M	 M
C)
US 7,971,095 B2
1
FAULT RECOVERY FOR REAL-TIME,
MULTI-TASKING COMPUTER SYSTEM
GOVERNMENT LICENSE RIGHTS
The U.S. Government may have certain rights in the
present invention as provided for by the terms of Contract No.
NCC-1-393 NASA CRA awarded by NASA.
TECHNICAL FIELD
The present invention generally relates to multi-tasking
computer platforms and more specifically to fault detections
and recovery for software applications executing in real time
multi-tasking environments.
BACKGROUND
The automation of aircraft functions being implemented in
avionics systems, specifically flight critical systems, are
migrating towards real-time multi-tasking computers. Rather
than performing one aircraft function on a single computer,
multiple functions, potentially of different criticality signifi-
cance, are integrated into a single system. Flight critical dis-
play functions, but not flight critical control (for example,
fly-by-wire) functions, have been implemented using multi-
tasking computers. Another trend is that digital electronics
built for consumer products are getting continually smaller.
As the digital devices become smaller, it takes less energy to
corrupt those devices by placing individual bits in an unin-
tended state. Miniaturization has increased the susceptibility
of computer electronics and processor hardware elements to
various upsets events. Miniaturization has reached the point
where atmospheric neutrons now pose a threat for corrupting
these devices, as well as intense electromagnetic fields pro-
duced by environmental events such as lightning. In the mili-
tary world, deliberate weapons that create high powered
microwave threats are also a concern. Using only commer-
cially available parts to build safety critical systems, it is
difficult to design computer hardware which is immune from
faults caused by these, as well as other threats.
For the reasons stated above and for other reasons stated
below which will become apparent to those skilled in the art
upon reading and understanding the specification, there is a
need in the art for sufficiently robust systems and methods for
executing safety critical applications (such as those imple-
menting fly-by-wire functions) on real-time multi-tasking
computers that use commercially available parts.
SUMMARY
The Embodiments of the present invention provide sys-
tems and methods for executing safety critical applications on
real-time multi-tasking computers and will be understood by
reading and studying the following specification.
In one embodiment, a recoverable real time multi-tasking
computer system is presented. The system comprises a real
time computing platform, wherein the real time computing
platform is adapted to execute one or more applications,
wherein each application is time and space partitioned. The
system further comprises a fault detection system adapted to
detect one or more faults affecting the real time computing
environment and a fault recovery system, wherein upon the
detection of a fault by the fault detection system, the fault
recovery system is adapted to restore a backup set of state
variables.
2
In another embodiment, another recoverable real time
multi-tasking computer system is presented. The system
comprises one or more applications and one or more proces-
sors. The one or more processors execute the one or more
5 applications, wherein each application is time and space par-
titioned. The system further comprises one or more scratch-
pad memories, wherein the one or more processors store state
variables for the one or more applications in the one or more
scratchpad memories; one or more fault monitors, the one or
io more fault monitors adapted to detect one or more system
faults occurring during the execution of a first application of
the one or more applications; and a fault recovery system
adapted to duplicate state variables that are stored in the one
or more scratchpad memories. Upon the detection of a fault,
15 the one or more fault monitors is further adapted to notify the
fault recovery system, wherein the fault recovery system is
further adapted to restore a backup set of state variables into
the one or more scratchpad memories. The one or more pro-
cessors are adapted to resume processing of the first applica-
20 tion using the backup set of state variables.
In another embodiment, a method for fault recovery for
applications executing on real time multi-tasking computer
systems wherein each application is time and space parti-
tioned, is presented. The method comprises duplicating state
25 variables for one or more computational frames; detecting a
fault from an upset event within the computational frame in
which the upset event occurred; and recovering state variable
data duplicated during a computational frame prior to the
upset event.
30 In yet another embodiment, a computer-readable medium
having program instructions for a method for fault recovery
for applications executing on real time multi-tasking com-
puter systems wherein each application is time and space
partitioned is presented. The method comprises duplicating
35 state variables for one or more computational frames; detect-
ing a fault from an upset event within the computational frame
in which the upset event occurred; and recovering state vari-
able data duplicated during a computational frame prior to the
upset event.
40 In yet another embodiment, a rapid recovery mechanism
for a self-checking lock-step computing lane including two or
more processors, two or more scratchpad memories and two
or more fault monitors, the self-checking lock-step comput-
ing lane adapted to execute one or more applications, wherein
45 each application is time and space partitioned, wherein each
application of the one or more applications is executed by the
two or more processors during one or more computational
frames, wherein the two or more fault monitors are further
adapted to detect one or more system faults within the com-
50 putational frame in which the fault occurred, is presented. The
rapid recovery mechanism comprises a first duplicate
memory adapted to store state variables duplicated from the
one or more scratchpad memories; and a recovery control
logic module adapted to receive fault detection signals from
55 the two or more fault monitors. Upon the detection of a fault,
the recovery control logic module is adapted to restore a
backup set of state variables into the two or more scratchpad
memories.
In still another embodiment, another recoverable real time
60 multi-tasking computer system is presented. The system
comprises means for executing two or more time and space
partitioned software applications; means for detecting one or
more faults affecting at least one of the two or more time and
space partitioned software applications; and means forrestor-
65 ing a backup set of state variables upon the detection of a fault
affecting the at least one of the two or more time and space
partitioned software applications.
US 7,971,095 B2
3
	
4
DRAWINGS
	
	 based on current parameters. In one embodiment, current
parameters include current inputs from sensors. A multi-task-
The present invention can be more easily understood and
	 ing computer system is a computer system adapted to perform
further advantages and uses thereof more readily apparent, 	 multiple tasks, also known as processes, using shared com-
when considered in view of the detailed description and the 5 mon processing resources. A multi-tasking computer system
following figures in which: 	 is adapted to execute two or more software applications
FIG. lA is a time line diagram illustrating the real time 	 simultaneously by scheduling computer processing resources
execution of applications on real-time multi-tasking comput- 	 between the two or more software applications. In one
ers of one embodiment of the present invention; 	 embodiment of the present invention, a multi-tasking com-
FIG. 1B is a time line diagram illustrating an upset event io puter system is adapted to schedule computer processing
during the real time execution of applications on real-time 	 resources to support execution of at least one application in
multi-tasking computers of one embodiment of the present 	 real time.
invention;	 Embodiments of the present invention employ high integ-
FIG. 1C is a time line diagram illustrating fault detection 	 rity processing systems utilizing space partitioning. Accord-
and state variable recovery of one embodiment of the present 15 ingly, when multiple pieces of software are executed by a
invention;	 single hardware platform, it is problematic if the operation of
FIG. 2 is a block diagram illustrating a fault recovery 	 one piece of software contaminates the operation of another
system of one embodiment of the present invention; 	 piece of software running on the same platform. Thus when
FIG. 3 is a block diagram illustrating another fault recovery 	 the same hardware platform is used to run both safety critical
system of one embodiment of the present invention; and 	 20 applications and other applications, care must be taken to
FIG. 4 is a flow diagram illustrating a method of fault	 prevent the contamination of a safety critical application by
recovery of one embodiment of the present invention. 	 any other application.
In accordance with common practice, the various 	 Computer systems implementing time and space partition-
described features are not drawn to scale but are drawn to 	 ing are adept at supporting real time computing recovery
emphasize features relevant to the present invention. Refer- 25 capabilities. Time and space partitioning of processor
ence characters denote like elements throughout Figures and 	 resources guarantees that one application will not corrupt the
text.	 memory or execution space of any other application run in
computational frames before or after it. No application can
DETAILED DESCRIPTION	 corrupt the timeline such that the application would overrun
so its processing time thus starving out the next application
Fast fault recovery is important in safety critical systems,	 running in the next computational frame. As used in this
such as avionic computer systems, which perform real time	 application, the term computer system includes those ele-
computations necessary to control or stabilize dynamic sys- 	 ments of an overall system that perform processing or com-
tems, such as aircraft in flight. Embodiments of the present 	 putational functions for the overall system. In one embodi-
invention increase a computer system's tolerance for faults by 35 ment, the computer system is a subsystem integrated into the
providing methods and systems that allow a very fast recov- 	 overall system.
ery from system faults.	 FIG. lA illustrates a normal execution timeline in a real
Embodiments of the present invention have three elements. 	 time computing environment of one embodiment of the
The first element involves a real time computing environment 	 present invention. In the example illustrated in FIG. 1, a
utilizing time and space partitioning. The second element 40 single hardware platform is executing multiple applications.
provides fault detection. The third element provides fault 	 The processor cycles through each computational frame,
recovery.	 executing applications only within their designate computa-
Computer systems implementing time and space partition- 	 tional frame. For example, the processor executes Applica-
ing are adept at supporting real time computing recovery 	 tion 1 in computational frame 1-a in order to perform com-
capabilities. As provided by embodiments of the present 45 putations resulting in a set of state variables N. The processor
invention, time and space partitioning when combined with 	 then switches to performing applications 2, 3 and 4 in com-
state variable recovery provides a higher level of computa- 	 putational frames 2-a, 3-a and 4-a respectively, each produc-
tional integrity than either achieves independently 	 ing their own sets of state variables N. Application 1 is again
Real Time Computing Environment. Embodiments of the 	 executed in frame 1 -b to perform its next frame of computa-
present invention employ high integrity computer systems 50 tion resulting in the set of state variables N+1. FIG. lA
utilizing time and space partitioning which allows hosting of
	
illustrates a multi-tasking hardware platform utilizing time
multiple pieces of software on a single piece of hardware. 	 and space partitioning. That is, each application is executed
Each piece of software is resident in hardware and can per- 	 only during its own computational frame and separately
form a multitude of computational functions including but 	 stores state variables relevant to its computations.
not limited to operating systems, monitoring systems, and 55	 FIG. 1B illustrates an upset event occurring during com-
application programs.	 putational frame 1 -b causing the corruption of Application
Embodiments of the present invention can be used in safety 	 1's state variable set N. Because of time and space partition-
critical applications such as a primary flight control applica- 	 ing, the repercussions of the upset event are limited to affect-
tion that must robustly execute in real time. Safety critical
	
ingApplicationl because the processor will switch to execut-
applications, such as a primary flight control application, 60 ing Application 2 at the start of computational frame 2 -b.
must execute in real time to maintain the stability and control 	 Although FIGS. lA and 1B illustrate time and space par-
of an aircraft in flight and during landing. Typically, real time 	 titioning with four applications, one skilled in the art upon
systems are designed to control physical devices (e.g. valves, 	 reading this specification would appreciate that a computer
servos, motors, heaters) that require timely processing to 	 system executing four applications is only presented as an
perform their designated task correctly. As used in this appli- 65 example and is not a limitation of the present invention.
cation, real time execution of applications refers to a com- 	 Additionally, it would be understood by one skilled in the art
puter system performing calculations at the current time 	 upon reading this specification that software, such as Appli-
US 7,971,095 B2
5	 6
cations 1 to 4, executing on a computer system with time and 	 which restores Application 1's state variables as they existed
space partitioning can include one or more pieces of operat-	 for Application 1 after the computational frame just before
ing system software, wherein one or more of the state vari- 	 the upset event occurred. In the illustration provided in FIG.
ables illustrated in FIGS. 1A and 1B pertain to the state of the 	 1C, the upset event occurs in computational frame 1 -b, cor-
computer system itself. It would also be appreciated by one 5 rupting Application 1's state variable set N. Fault recovery
skilled in the art upon reading this specification that compu- 	 system 230 restores Application 1's state variables N—I into
tational frames for one application, such as computational 	 memories 220 and 222. Execution of Application 1 then
frames 1-a, 1-b and 1-c for Application 1, are not necessarily	 resumes in computational frame 1-c, using the last known
periodic or equal in time duration as computational frames for	 uncorrupted set of state variables from frame 1-a. One dis-
another application.	 co tint advantage of this embodiment of the present invention is
Fault Detection. In one embodiment, lock-step fault detec-	 that fault recovery system 230 only needs to maintain copies
tion allows a system to detect upset events almost immedi-	 of state variable data sets that are one computational frame
ately. One example of lock-step fault detection is provided by 	 old.
the self-checking lock-step computing lane provided in U.S. 	 In operation, in one embodiment, processors 212 and 214
Pat. No. 5,909,541.	 15 hold state variables for applications in respective memories
Traditional lock step processing implies that two or more 	 220 and 222. The memory locations in memories 220 and 222
processors are executing the same instructions at the same 	 used by each application to store state variables as the appli-
time. Self-checking lock-step computing provides the cross 	 cations are executed in their respective computational frame
feeding of signals from one processing lane to the other	 are referred to as "scratchpad memories". Fault recovery
processing lane and then comparing them for deviations on 20 system 230 creates a duplicate copy of the state variables
every single clock edge. FIG. 2 illustrates one embodiment	 stored in memories 220 and 222, creating a repository of
200 of a self-checking lock-step computing lane 210 of one	 recent state variable data sets. Fault recovery system 230
embodiment of the present invention. Self-checking lock-	 stores off the state variables in real time, as processors 212
step computing lane 210 comprises at least two set of dupli-	 and 214 are executing and storing the state variables in memo-
cate processors (212 and 214), memories (220 and 222), and 25 ries 220 and 222.
fault detection monitors (216 and 218). On every single sys- 	 In one embodiment, as state variable values are produced
tem clock edge, monitors 216 and 218 both compare the data	 by processors 212 and 214 and stored in memories 220 and
bus signal and control bus signal output of processors 212 and	 222, there is a redundant copy made in duplicate memory 238.
214 against each other. When the output signals fail to corre- 	 In one embodiment, duplicate memory 238 is contained in a
late, monitors 216 and 218 identify a fault. This guarantees so highly isolated location to ensure the robustness of the data
that if one processor deviates (e.g. because it retrieves a 	 stored in duplicate memory 238. In one embodiment, dupli-
wrong address or is provided a wrong data bit) one or both of 	 cate memory 238 is protected from corruption by one or more
monitors 216 and 218 will detect the fault on the next clock	 of a metal enclosure, signal buffers (such as buffers 244 and
edge. The fault is thus detected in the same computational 	 246) and power isolation.
frame in which it was generated. In one embodiment, when 35 In another embodiment, the redundant copy of state vari-
either monitor 216 or monitor 218 detects a fault, the monitor 	 ables can be stored on a hardened memory device. As used in
notifies processors 212 and 214. In embodiments of the 	 this application, a hardened memory device refers to a
present invention, upon notification of a fault, processors 212 	 memory device which is itself inherently immune to corrup-
and 214 shut off further processing of the application which 	 tion due to environmental factors.
was executing in the faulted computational frame and the 40	 In addition to protecting applications against the corrup-
fault recovery system is invoked. 	 tion of state variables, embodiments of the present invention
Fault Recovery. Fault detection allows the recovery tech-	 further protect against undesirable consequences from appli-
nology of the present invention to restore state variables in the 	 cations that stall during their computational frame, or enter
event of an upset. The advantage for avionics systems is that 	 into infinite loops. For example, in one embodiment, if an
a computer error is not propagated to the pilot level or the 45 application executing within its computational frame stalls
airplane motion level, but is detected quickly within the 	 and never completes its frame, this fault will be detected by
computational frame in which the error occurred. State vari- 	 one of monitors 216 or 218. In one embodiment, one of
able data is typically the type of data that changes slowly	 monitors 216 or 218 then notifies recovery control logic 232
relative to the processing speed of the hardware platform 	 to initiate a recovery.
calculating the state variables. By restoring state variables 50	 One skilled in the art upon reading this specification would
which are only a relatively few computational frames old and	 recognize that it is undesirable to load duplicate memory 238
restarting the processing element, the resulting computa- 	 with state variable data in situations where the system only
tional results will contain only a negligible error due to the 	 partially completed a computing frame when the fault
upset. In an embodiment where the affected application is a 	 occurred. This is because duplicate memory 238 could end up
primary flight control application, aircraft response time is 55 storing corrupted data for that computing frame. Instead, to
not jeopardized because the computations are restarted and 	 ensure that a complete valid frame of state variable data is in
recalculated in such a fast fashion. 	 the duplicate memory and available for restoration, embodi-
FIG. 1C illustrates the same timeline as FIG. 1B with the	 ments of the present invention provide intermediate memo-
addition of fault recovery as provided by embodiments of the	 ries. In one embodiment, a duplicate of memories 220 and
present invention. In one embodiment, when a fault detection 60 222 for even computational frames is loaded into even frame
monitor, such as one of monitors 216 or 218 detects the fault	 memory 234. A duplicate of memories 220 and 222 for odd
affecting Application 1 during computational frame 1-b, the	 computational frames is loaded into odd frame memory 235.
monitornotifies processors 212 and 214 to shut off processing 	 The even frame memory 234 and odd frame memory 236
of Application 1, and notifies recovery control logic 232.	 toggle back and forth copying data into the duplicate memory
Meanwhile, the execution of unaffected Applications 2-4 65 238 to ensure that a complete valid backup memory is main-
continue during their assigned computational frames 2 -b, 3 -b	 tained. Even frame memory 234 and odd frame memory 236
and 4 -b. Recovery control logic 232 invokes fault recovery	 will only copy their contents to duplicate memory 238 if the
US 7,971,095 B2
7
intermediate memories themselves contain a complete valid
state variable backup for a computing frame that successfully
completes its execution.
In another embodiment, during the normal computer ini-
tialization sequence of computer system 210, duplicate
memory 238, even frame memory 234 and odd frame
memory 236 are each adapted to copy all state variables
written to memories 220 and 222 by processors 212 and 214
in order to set the initial state variables saved in all memories
to the same condition. In one embodiment, after initialization
the alternating operation of even frame memory 234 and odd
frame memory 236 memories begins as described above.
In one embodiment, fault recovery system 230 also
includes variable identity array 242 which provides for the
efficient use of memory storage. In one embodiment, instead
of creating backup copies of every state variable for every
application, variable identity array 242 identifies a subset of
predefined state variables which allows recovery control
logic 232 to backup only those state variables desired for
certain applications into duplicate memory 238. In one
embodiment, only state variables for predefined applications
are included in the predefined subset of state variables that are
duplicated into duplicate memory 238. In one embodiment,
variable identity array 242 contains predefined state variable
locations on an address-by-address basis. In one embodi-
ment, variable identity array 242 allows only the desired state
variable data to load into the intermediate memories.
When recovery control logic 232 is notified of a detected
fault, recovery control logic 232 retrieves the duplicate state
variables for an upset application from duplicate memory 238
and restores those state variables into the upset application's
scratchpad memory area of memories 220 and 222. In one
embodiment, once the duplicate state variables are restored
into memories 220 and 222, recovery control logic 232 noti-
fies monitors 216 and 218 and processors 212 and 214 resume
execution of the upset application using the restored state
variables.
In another embodiment of the present invention, monitors
216 and 218 are adapted to notify the faulted application of
the occurrence of a fault, instead of notifying recovery control
logic 232. In operation, in one embodiment, upon detection of
a fault affecting an application, the monitor notifies proces-
sors 212 and 214 which shut off processing of the upset
application. On the upset application's next processing
frame, at least one of processors 212 and 214 notify the
faulted application of the occurrence of the fault. In one
embodiment, upon notification of the fault, the upset appli-
cation is adapted to request the recovery of state variables by
notifying recovery control logic 232. In one embodiment,
once the duplicate state variables are restored into memories
220 and 222, recovery control logic 232 notifies monitors 216
and 218 and processors 212 and 214 resume execution of the
upset application using the restored state variables.
It would be appreciated by one skilled in the art upon
reading this specification that the present invention is not
limited only to embodiments with self-checking lock-step
computing lanes. In other embodiments the recovery system
of the present invention can be adapted to accommodate
slower fault detection systems, which may allow several com-
putational frames to elapse before they can identify a fault
condition. In those circumstances, the duplicate memory is
adapted to hold not only the state variable of the most recent
computing frame, but also hold state variable for one or more
previous computing frames. In one embodiment, the recovery
system is adapted to restore the N-z backup frame state vari-
8
ables to the scratchpad memory when the fault detection
system is known to take up to z computation frames to detect
a fault.
FIG. 3 illustrates another embodiment of a recoverable
5 computer platform 300 of one embodiment of the present
invention. Computer system 350 includes one or more pro-
cessors 352, memory 356 and fault monitor 354. For state
variable values produced by processors 352 and stored in
memory 356, there is a redundant copy made in duplicate
io memory 368. In one embodiment, a duplicate of memory 356
for even computational frames is loaded into even frame
memory 364 and a duplicate of memory 356 for odd compu-
tational frames is loaded into odd frame memory 366. As
described in the embodiment for FIG. 2, even frame memory
15 364 and odd frame memory 366 toggle back and forth copy-
ing data into duplicate memory 368. Thus duplicate memory
368 always contains a backup of state variables for the most
recent non-faulted computational frame. In one embodiment,
fault recovery mechanism 360 comprises one or more dupli-
20 cate memories 370, in which are maintained valid state vari-
able data sets for one or more computational frames previous
to the most recent computational frame. When recovery con-
trol logic 362 is notified of a fault by monitor 354, fault
recovery mechanism 360 restores the z'th frame prior state
25 variable data set into memory 356, when monitor 354 is
known to take up to z computational frames to detect a fault.
In another embodiment of the present invention, one or
more externally located fault detection monitors, such as
monitor 319, are adapted to identify one or more faults affect-
30 ing one or more applications executing on computer system
350 and notify recovery control logic 362 to initiate a fault
recovery as described above. In one embodiment, monitor
319 monitors and communicates with computer system 350
via one or more input/output ports 358.
35 In one embodiment, fault recovery system 360 also
includes variable identity array 372 which provides for the
efficient use of memory storage. In one embodiment, instead
of creating backup copies of every state variable for every
application, variable identity array 372 identifies predefined
40 state variable which allows fault recovery mechanism 360 to
backup only those state variable desired for certain applica-
tions. In one embodiment, variable identity array 372 con-
tains predefined state variable locations on an address by
address basis. In one embodiment, variable identity array 372
45 allows only the desired state variable data to load into the
intermediate memories.
FIG. 4 provides a flow chart illustrating a method 400 of
one embodiment of the present invention. The method com-
prises duplicating state variables for a computational frame
50 (410); detecting a fault within a computational frame of an
upset event (420); and recovering state variable data from a
computational frame prior to the upset event (430). In other
embodiments, the method further comprises halting the
execution of an application affected by an upset event (425)
55 and resuming processing after recovering state variable data
(435). When processing is restarted, the processor is able to
resume calculations at a point very close to where the disrup-
tion occurred.
Several means are available to implement the fault recov-
60 ery systems and methods of the current invention. These
means include, but are not limited to, digital computer sys-
tems, programmable controllers, or field programmable gate
arrays. Therefore other embodiments of the present invention
are program instructions resident on computer readable
65 media which when implemented by such controllers, enable
the controllers to implement embodiments of the present
invention. Computer readable media include any form of
US 7,971,095 B2
9
computer memory, including but not limited to punch cards,
magnetic disk or tape, any optical data storage system, flash
read only memory (ROM), non-volatile ROM, program-
mable ROM (PROM), erasable-programmable ROM
(E-PROM), random access memory (RAM), or any other
form of permanent, semi-permanent, or temporary memory
storage system or device. Program instructions include, but
are not limited to computer-executable instructions executed
by computer system processors and hardware description
languages such as Very High Speed Integrated Circuit (VH-
SIC) Hardware Description Language (VHDL).
Embodiments of the present invention do not preclude
other fault detection and recovery methods for a computer
system from being utilized.
Although specific embodiments have been illustrated and
described herein, it will be appreciated by those of ordinary
skill in the art that any arrangement, which is calculated to
achieve the same purpose, may be substituted for the specific
embodiment shown. This application is intended to cover any
adaptations or variations of the present invention. Therefore,
it is manifestly intended that this invention be limited only by
the claims and the equivalents thereof.
What is claimed is:
1. A recoverable real time multi-tasking computer system
comprising:
areal time avionics computing platform adapted to execute
two or more avionics applications simultaneously,
wherein each avionics application is time and space
partitioned;
a fault detection system adapted to detect one or more
faults affecting the real time avionics computing plat-
form; and
a fault recovery system, wherein upon the detection of a
fault by the fault detection system, the fault recovery
system is adapted to restore a duplicate set of state vari-
ables, wherein the fault recovery system is further
adapted to:
store, duplicate, and recover only selected state variables
from one or more frame times; and
recover state variables pertaining to any one or more of
the avionics applications simultaneously;
wherein the fault recovery system operates without any
involvement from the avionics applications, and
wherein when a recovery of the one or more avionics
applications occurs, the other avionics applications con-
tinue to operate without disturbance.
2. The system of claim 1, wherein each application of the
two or more avionics applications is executed by the real time
avionics computing platform during one or more computa-
tional frames, wherein the fault detection system is further
adapted to detect the one or more faults.
3. The system of claim 2, wherein the fault recovery system
is further adapted to restore the duplicate set of state variables
from a computational frame occurring more than one frame
before the computational frame in which the fault occurred.
4. The system of claim 3, wherein the fault recovery system
is further adapted to restore a duplicate set of state variables
from a computational frame occurring one computational
frame before the computational frame in which the fault
occurred.
5. The system of claim 1, wherein the fault recovery system
further comprises:
a first duplicate memory;
an even frame memory, wherein the fault recovery system
is adapted to duplicate state variables computed by the
real time avionics computing platform during even com-
putational frames into the even frame memory; and
10
an odd frame memory, wherein the fault recovery system is
adapted to duplicate state variables computed by the real
time avionics computing platform during odd computa-
tional frames into the odd frame memory;
5 wherein, the even frame memory and odd frame memory
toggle back and forth duplicating state variables into the
first duplicate memory for computational frames in
which no fault was detected by the fault detection sys-
tem.
10 6. The system of claim 5, wherein the even frame memory
and odd frame memory are further adapted to not duplicate
into the first duplicate memory state variables for computa-
tional frames in which a fault was detected by the fault detec-
15 tion system.
7. The system of claim 5, wherein the first duplicate
memory, the even frame memory and the odd frame memory
are further adapted to duplicate state variables computed by
the real time avionics computing platform during initializa-
20 tion of the real time avionics computing platform.
8. The system of claim 5, further comprising:
a second duplicate memory, wherein the fault recovery
system stores duplicate sets of state variables for a plu-
rality of computational frames.
25 9. The system of claim 5, wherein the first duplicate
memory is protected from corruption due to environmental
factors by one or more of shielding from a metal enclosure,
signal buffers, isolated power supplies and hardened memory.
10. The system of claim 1, the faultrecovery system further
30 comprising:
a variable identity array, adapted to identify a predefined
subset of state variables, wherein the fault recovery sys-
tem duplicates only the subset of state variables.
11.A recoverable real time multi-tasking computer system
35 comprising:
two or more avionics applications;
an avionics computing platform comprising one or more
processors, the one or more processors executing the
two or more avionics applications simultaneously,
40	 wherein each application is time and space partitioned;
one or more scratchpad memories, wherein the one or more
processors store state variables for the two or more avi-
onics applications in the one or more scratchpad memo-
ries;
45 one or more fault monitors, the one or more fault monitors
adapted to detect one or more system faults occurring
during the execution of a first application of the two or
more avionics applications; and
a fault recovery system adapted to duplicate state variables
50	 stored in the one or more scratchpad memories, wherein
the fault recovery system is further adapted to:
store, duplicate, and recover only selected state variables
from one or more frame times; and
recover state variables pertaining to any one or more of
55	 the avionics applications simultaneously;
wherein the fault recovery system operates without any
involvement from the avionics applications,
wherein upon the detection of a fault, the fault recovery
system is further adapted to restore a duplicate set of
60	 state variables into the one or more scratchpad memo-
ries,
wherein the one or more processors are adapted to resume
processing of the first application using the duplicate set
of state variables, and
65 wherein when a recovery of the one or more avionics
applications occurs, the other avionics applications con-
tinue to operate without disturbance.
US 7,971,095 B2
11	 12
12. The system of claim 11, wherein upon the detection of 	 20. The method of claim 19, wherein recovering state vari-
	
the fault, the one or more fault monitors are further adapted to 	 able data from the computational frame duplicated prior to
notify the fault recovery system.	 the upset event further comprises:
13. The system of claim 11, wherein upon the detection of
	
duplicating state variables from the third memory into one
the fault, the one or more fault monitors are further adapted to 5	 or more scratchpad memories.
	
notify a first application affected by the fault, wherein the first 	 21. A computer-readable medium having program instruc-
application is adapted to notify the fault recovery system. 	 tions for a method for fault recovery, the method comprising:
14. The system of claim 11, wherein each application of the 	 executing a plurality of avionics applications simulta-
	
two or more avionics applications is executed by the one or 	 neously on a real time multi-tasking avionics computer
	
more processors during one or more computational frames, io	 system wherein each avionics application is time and
	
wherein the one or more fault monitors are further adapted to	 space partitioned;
	
detect one or more system faults within the computational
	
duplicating state variables for one or more computational
frame in which the fault occurred. 	 frames;
15. The system of claim 14, the fault recovery system 	 detecting a fault from an upset event within the computa-
further comprising: 	 15	 tional frame of one of the applications in which the upset
a first duplicate memory;	 event occurred;
	
an even frame memory, wherein the fault recovery system 	 recovering state variable data duplicated during a compu-
	
is adapted to duplicate state variables stored in the one or 	 tational frame prior to the upset event; and
	
more scratchpad memories during even computational 	 restoring the duplicated state variable data to a computa-
frames into the even frame memory; and 	 20	 tional frame of the one of the applications that occurs
	
an odd frame memory, wherein the fault recovery system is 	 immediately after the computational frame in which the
	
adapted to duplicate state variables stored in the one or 	 upset event occurred, wherein the duplicated state vari-
	
more scratchpad memories during odd computational	 able data is restored without any involvement from the
frames into the odd frame memory; 	 avionics applications, and wherein during recovery of
	
wherein the even frame memory and odd frame memory 25	 the one of the applications, the other applications con-
	
toggle back and forth duplicating state variables into the	 tinue to operate without disturbance.
	
first duplicate memory for computational frames in 	 22. The computer-readable medium of claim 21, the
	
which no fault was detected by the one or more fault 	 method further comprising:
monitors.	 haltingthe execution of an application affected by the upset
16. The system of claim 15, further comprising: 	 so	 event; and
	
a second duplicate memory, wherein the fault recovery 	 resuming processing the application affected by the upset
	
system stores duplicate sets of state variables for a plu- 	 event after recovering state variable data.
rality of computational frames. 	 23. The computer-readable medium of claim 21, wherein
17. A method for fault recovery, the method comprising: 	 duplicating state variables for one or more computational
executing a plurality of avionics applications simulta- 35 frames further comprises:
	
neously on a real time multi-tasking avionics computer 	 duplicating state variables from an even computational
	
system wherein each avionics application is time and
	
frame into a first memory;
space partitioned;	 duplicating state variables from an odd computational
	
duplicating state variables for one or more computational
	
frame into a second memory; and
frames;	 40	 alternately duplicating state variables from the first
	
detecting a fault from an upset event within the computa- 	 memory and the second memory into a third memory.
	
tional frame of one of the applications in which the upset 	 24. The computer-readable medium of claim 23, wherein
event occurred;	 recovering state variable data from the computational frame
	
recovering state variable data duplicated during a compu-	 duplicated prior to the upset event further comprises:
tational frame prior to the upset event; and 	 45	 duplicating state variables from the third memory into one
	
restoring the duplicated state variable data to a computa- 	 or more scratchpad memories.
	
tional frame of the one of the applications that occurs 	 25. A system comprising:
	
immediately after the computational frame in which the	 a self-checking lock-step avionics lane including two or
	
upset event occurred, wherein the duplicated state vari- 	 more processors;
able data is restored without any involvement from the 50 two or more scratchpad memories and two or more fault
	
avionics applications, and wherein during recovery of 	 monitors, the self-checking lock-step avionics lane
	
the one of the applications, the other applications con- 	 adapted to execute two or more avionics applications
tinue to operate without disturbance. 	 simultaneously, wherein each application is time and
18. The method of claim 17, further comprising: 	 space partitioned, wherein each application of the two or
	
halting the execution of an application affected by the up set 55	 more avionics applications is executed by the two or
event; and	 more processors during one or more computational
	
resuming processing the application affected by the upset 	 frames, wherein the two or more fault monitors are fur-
event after recovering state variable data. 	 ther adapted to detect one or more system faults within
19. The method of claim 17, wherein duplicating state	 the computational frame in which the fault occurred;
	
variables for one or more computational frames further com-  60	 a rapid recovery mechanism comprising:
prises:	 • first duplicate memory adapted to store state variables
	
duplicating state variables from an even computational
	
duplicated from the two or more scratchpad memo-
frame into a first memory; 	 ries; and
	
duplicating state variables from an odd computational 	 • recovery control logic module adapted to receive fault
frame into a second memory; and	 65	 detection signals from the two or more fault monitors;
	
alternately duplicating state variables from the first 	 wherein the rapid recovery mechanism is further adapted
memory and the second memory into a third memory. 	 to:
US 7,971,095 B2
13
store, duplicate, and recover only selected state variables
from one or more frame times; and
recover state variables pertaining to any one or more of
the avionics applications simultaneously;
wherein the rapid recovery mechanism is further adapted 5
to:
store, duplicate, and recover only selected state variables
from one or more frame times; and
recover state variables pertaining to any one or more of
the avionics applications simultaneously;	 10
wherein the rapid recovery mechanism operates without
any involvement from the avionics applications,
wherein upon the detection of a fault, the recovery control
logic module is adapted to restore a duplicate set of state
variables into the two or more scratchpad memories, and 15
wherein when a recovery of the one or more avionics
applications occurs, the other avionics applications con-
tinue to operate without disturbance.
26. The system of claim 25, wherein the recovery control
logic module is further adapted to restore a duplicate set of 20
state variables from a computational frame occurring more
than one frame before the computational frame in which the
fault occurred.
27. The system of claim 26, the rapid recovery mechanism
further comprising:	 25
an even frame memory adapted to duplicate state variables
stored in the two or more scratchpad memories during
even computational frames into the even frame memory;
and
an odd frame memory adapted to duplicate state variables 30
stored in the two or more scratchpad memories during
odd computational frames into the odd frame memory;
wherein the even frame memory and odd frame memory
toggle back and forth duplicating state variables into the
first duplicate memory for computational frames in 35
which no fault was detected by the two or more fault
monitors.
28. The system of claim 27, wherein the even frame
memory and odd frame memory are further adapted to dis-
card state variables for computational frames in which a fault 40
was detected by the one or more fault monitors.
14
29. A recoverable real time multi-tasking computer system
comprising:
means for executing two or more time and space parti-
tioned avionics applications simultaneously;
means for detecting one or more faults affecting at least one
of the two or more time and space partitioned avionics
applications; and
means for restoring a duplicate set of selected state vari-
ables upon the detection of a fault affecting the at least
one of the two or more time and space partitioned avi-
onics applications;
wherein the means for restoring operates without any
involvement from the avionics applications, and
wherein when a recovery of the one or more avionics
applications occurs, the other avionics applications con-
tinue to operate without disturbance.
30. The system of claim 29, wherein the means for restor-
ing a duplicate set of state variables further comprises:
a first means for storing state variables;
a second means for storing state variables adapted to dupli-
cate state variables computed during even computa-
tional frames; and
a third means for storing state variables adapted to dupli-
cate state variables computed during odd computational
frames;
wherein the second means for storing state variables and
the third means for storing state variables toggle back
and forth duplicating state variables into the first means
for storing state variables for computational frames in
which no fault was detected;
wherein the means for restoring a duplicate set of state
variables is further adapted to restore the state variables
from the first means for storing state variables.
31. The system of claim 5, wherein the first duplicate
memory, the even frame memory, and the odd frame memory
are adapted to duplicate state variables computed by a real
time operating system of the real time avionics computing
platform.
