The implementation and use of Ada on distributed systems with high reliability requirements by Knight, J. C. et al.
General Disclaimer 
One or more of the Following Statements may affect this Document 
 
 This document has been reproduced from the best copy furnished by the 
organizational source. It is being released in the interest of making available as 
much information as possible. 
 
 This document may contain data, which exceeds the sheet parameters. It was 
furnished in this condition by the organizational source and is the best copy 
available. 
 
 This document may contain tone-on-tone or color graphs, charts and/or pictures, 
which have been reproduced in black and white. 
 
 This document is paginated as submitted by the original source. 
 
 Portions of this document are not fully legible due to the historical nature of some 
of the material. However, it is the best reproduction available from the original 
submission. 
 
 
 
 
 
 
 
Produced by the NASA Center for Aerospace Information (CASI) 
https://ntrs.nasa.gov/search.jsp?R=19840014160 2020-03-20T23:57:14+00:00Z
n
4
•	 y
A Final Report
THE IMPLEMENTATION  AND USE OF ADA ON DISTRIBUTED SYSTEMS
WITH HIGH RELIABILITY REQUIREMENTS
Submitted to:
National Aeronautics and Space Administration
Langley Research Center
Hampton, Virginia	 23665
Submitted by:
John C. Knight
Samuel T. Gregory
John I. A. Urquhart
Report No. UVA/528213/AMCS84/104
February 1984
(NASA —CR-175413) THE IMPLEMENTATION AND USE
	
N84-22228
OF ADA ON DISTRI.BUIED SYSTEMS WITH ,HIGH
RELIABILITY RE,^UTREMENTS (Virgii,ia Uu 1V.)
174 p HC A08/MF A01
	 CSCL 09B	 Unclas
G3/60 00621
i
SCHOOL OF ENGINEERING AND
APPLIED SCIENCE
DEPARTMENT OF APPLIED MATHEMATICS
AND COMPUTER SCIENCE
UNIVERSITY OF VIRGINIA
CHARLOTTESVILLE, VIRGINIA 22901
1	
it
^
+„ I
i;
M
Ys,
tl	 +t
i
The Implementation and Use of Ada On Distributed Systems
1
a,
With High Reliability Requirements r!{I
l
F
Final, Report
r."
3
4
v
3
John C. Knight
rSamuel T. Gregory
Jahn I. A. Urquhart
Department of Applied Mathematics and Computer Science
University of Virginia
Charlottesville
Virginia, 22901
February 1 98
j
« 0 0 0 0 0 ..... 0!_« ....a . a f.\. a. a a 0 f a a« f f
9
13
00 00 . 	
..	 t,	 1. -4f
CONTENTS
1 Introduction .....000a as 0^ . ► \ «..:tr......•i.11l... =... li.i
	2 Hardware Configurations 0000., .....,.t ...................
	 5
2 .1 Physical Mul ticomp tern)	 f \ . f . \ f .. i . ! s ... f ! . f .. R
	 5
	2.2 Logical. Multicomputers ....l.... r !l r ..s.......l r .l.,.	 7
3 Software System Design
4 Ada Subset Translator
5 Language Issues rr.. •r. .r.r........ se
	 f.. sif s.f... a... ...	 14
GSequencer	 .... \ r:.. ..ra.ar..ar..a..a.,rr.r.........rr=.a.. n 	 17
Appendix I .. Ada Virtual Processor Hardware Manua? ........... 21
Appendix 2 -- Paper Presented To ASAA Conference ....r.....a.. 58
Appendix .3 —.First Paper Submitted To FTCS1
	 ................ 82
Appendix 4 -r Second Paper Submitted To FTCS1 4 0 0 0 0 . \ .. ! 0 0 0 0..1 3
Appendix 5 .. General Paper In Preparation . , 0 0 0 0 . , 0 0 0 0 . , .. 123
f	
A
'fFti.
2	 ,
i
I`
^^	 t
F
E:
1	 ^
i
'k	 -'•	 _ 1
i
111TRODUC1I0
1
The	 purpose	 of	 this	 grant	 is	 to	 investigate	 the	 use	 and is
implementation
	
of	 Ada	 (a	 trade	 mark	 of the US Dept. of Defense) in f
distributed environments in which the hardware components are assumed to
be	 unreliable.	 In	 particular,	 we are concerned with the possibility ;!1
that a distributed system may be programmed entirely in Ada so that	 the 1
individual	 basks	 of	 the	 system are unconcerned with which processors
they are executing on, and that failures may 	 occur
	 in	 the	 underlying
hardware.
Over the next decade, it is expected that 	 many	 aerospace systems
will	 use Ada as the primary implementation language.	 This is a logical i
choice because the language has	 been
	
designed	 for	 embedded	 systems.
Also,
	 Ada has received such great care in its design and implementation
that it is unlikely that there will
	
be	 any	 practical	 alternative
	
in
selecting 'a programming . language for embedded software.
The reduced cost. of computer	 and the	 expected	 advantages,hardware ^.
of	 distributed
	 processing	 (for example, increased reliability through
redundancy	 and	 greater
	
flexibility)	 indicate	 that	 many	 aerospace 1r
computer	 systems	 will 	 distributed.	 The use of Adaand distributed r.:
systems seems like a good •uombination for	 advanced	 aerospace
	 embedded
systems.
During the twelve-month period covered by this grant,
	 our
	 primary
activities 	 have	 been
	
designing	 our	 fault-tolerant	 Ada	 System	 and f:
implementing an initial version of it. 	 We have completed	 much	 of	 the
design work	 although some details remain to be worked out. 	 The design
2has been influenced by our desire to use this system for
experimentation. 	 Many features have been included to facilitate
analysis and verify results. Consequently we have chosen . to ignore
efficient execution and to stress simplicity and flexibility,
A first version of the implementation has also been completed and
is being tested. At present all the major features of the tasking and
exception mechanisms of Ada have been implemented. The interface that
the testbed presents to the user is still extremely crude and is
presently the subject of revision, Our original intention was to
implement only an execution time system for Ada and not to bother
building a translator. We felt that whatever programs were needed for
demonstration or experiment could be compiled by hand. We discovered
how wrong this was when the hand compilation of the first test program
took a whole day. Consequently, we have begun the development of a
translator for the subset of Ada that pertains to this research.
In Section 2 of this report, the hardware configurations we are
using and intend to use are described. An overview of the system design
is presented in Section 3. A brief description of the translator
mentioned above and its present status is given in Section 4.
i
3
report as Appendix 2.
	
The second paper is a still	 in	 draft	 form	 and I
will
	
be	 submitted	 to a ,journal when complete.	 It is included in this
report as Appendix 5.
	
The reader is cautioned that Appendix 5 will
	
not
be	 the	 final	 version of that paper, and that inevitably there is some '+
overlap between that paper and the one in Appendix 2.
A	 consequence	 of	 our	 analysis	 of	 Ada	 is	 a	 set	 of	 general, i
impressions . about	 what	 features	 are	 needed	 in languages for fault- k
tolerant distributed processing. 	 We have documented	 these	 impressions E
fi 
in	 a	 paper	 that	 has
	
been	 submitted to the Fourteenth Annual Fault-
Tolerant Systems Conference.	 A copy of that paper is included	 in	 this
-
k.
I;
report. as.	 Appendix	 3.	 We	 have	 not	 received	 the	 decision of the
conference program committee about that paper. : j
Our distributed Ada implementation will constitute an incomplete, t	 I
but useful., operational semantic definition of Ada tasking. The purpose 	 #
of a semantic definition is to answer questions of language meaning. An
operational definition does this by allowing programs to be executed and 	 -^
their actions to be observed.
To avoid this problem we have designed a sequence control system
which will allow the progress of individual tasks to be adjusted so that
a program can be forced into any particular state. This is discussed in
depth in Section b. The sequence control mechanism is a part of the
testbed and a paper describing the testbed has been submitted to the
Fourteenth Annual Fault-Tolerant, Systems Conference. A copy of that
paper is included in this report as Appendix 4. We have not received
the decision of the conference program committee about that paper
either.
Our work on Ada has attracted attention From outside the University
of Virginia. During the grant reporting period seminars have been given
describing the work at:
-j
(1) The Research Triangle Institute (twice)
(2) The University of Forth Carolina at Chapel Hill.
^r
5
Z. HARDWARE CONEIGURATIOU
One of the criticisms frequently made of demonstrations is that
asynchronous hardware is often simulated and, consequently, result
obtained cannot be relied upon. Since we are attempting to demonstrate
reliability of a system involving several. computers, We will be using a
	
hardware configuration with several, computers in it. However, for the 	 of
purposes of software development we will be using a simulation of a
multieomputer configuration running on a DEC VAX 19/780. Both of these
r
configurations are described in this section.
I	
.1, Physical Multinomnuters
	
We have .purchased two IBM-Personal Computers (PCs) to be used in	 i
conJunetion with three other PCs owned by another research project. We
will also be using the department's DEC VAX 91/780 computer, 	 The PCs	
n.
	will be connected to the VAX by low-speed serial lines and these lines
	
iff,
will allow software and data files developed on the VAX to be
`	 transferred to the PCs. Our original intention was to route all PC-to-
	 E
PC communication through the VAX using these serial lines. This would
{
have allowed the VAX to monitor all PC-to-PC communication and to
control. the PCs by sending messages to them which it originated. This
;i 1
E
	
	
mechanism would have been used for debugging, and for initiating and
monitoring reconfiguration experiments. Careful review of this plan
showed that it would be very difficult to implement. The reason is that
S	 since all messages would be manipulated by both Pes and the VAX, it
w^111 ri he no nneiahsnr 4-^ hincre 	 f-n mn n g.^rrn	 -- 7...4-U
6r	 ,!
i'
F,
OPERATOW S
	 f
TERMINAL ------ f	 VAX	 {
f	 {
Serial	 #	 f
1	
.
Lines -->	 {	 #	 {	 fi	 f	 ^:{	 #	 I	 (	
is
{	 PC	 #	 #	 PC	 =	 #	 PC	 {	 {	 PC	 f	 {	 PC	 f
f	 {	 {	 {	 #
C-- Ethernet __}
Figure I .. Hardware Configuration.
r
computers.	 All of	 our	 software	 is	 implemented	 in . Pascal	 and	 all
F	
R
messages	 are	 defined	 as	 Pascal records.	 Since VAX Pascal and IBM PC
.S	 -
Pascal implement records differently, this would make symbolic access to
the	 messages	 on	 both
	 machines	 extremely	 difficult.	 This is such a'
t	 .
substantial problem that we chose to abandon the approach,
The PCs are also connected by a high-speed Ethernet system that	 is
.	
l
not routed through the VAX.	 The necessary Ethernet hardware for the two	 I
PCs purchased under this grant has been provided by	 the	 University	 of	 .:
Virginia's	 Department Of Applied Mathematics and Computer Science, 	 The
i
equipment was manufactured by TECMAR Inc.
	 a
K	
1
i# 
zab Lorxieal Mul.tieomnuters
I
7
	
^I
A major part of this grant period has been spent in implemcntizig
the	 software	 necessary	 to	 allow	 the	 PCs 	to	 communicate using the l	 ^
Ethernet hardware.	 Although the equipment was	 manufactured	 by	 TECMAR
specifically for the IBM Personal Computer, it leaves a great dead; to be
desired,	 The hardware is poorly designed.	 For example,. only
	 a	 single K
buffer is provided so that, in principle, each hardware unit can only be
used for transmission or reception, not both, at
	 any	 given	 time.	 We
x
have	 circumvented
	 this problem in software at the cost of some loss of
performance,	 The documentation provided by TECMAR
	
is	 extremely	 poor. n
Not	 only	 is	 it	 incomplete
	 but	 it	 contains ,numerous ggrgrs in the
detailed description of	 how	 the.	 hardware	 works.	 Consequently	 very
substantial delays were incurred by relying on the documentation and not
understanding why the system would not work correctly.
f^
	 {
The	 necessary	 software	 to	 support	 our	 testbed	 has	 now
	
been #	 G
completed
	 and	 tested.
	
It
	 works	 to	 our	 satisfaction,	 and, in our
opinion, is far better than any.so gtware available Form TECMAR Inc.
f	 .:
The proposed configuration is shown in Figure 1.	 t
I,
8
We hay, a been using UNIX to develop our testbed software and we felt
it Was important that it be possible to test the . software under. UNIX.
Consequently, we have put substantial effort into constructing a
software analogy of the IBM PC/Ethernet configuration. This analogy
t	 t	 M	 tuses UhIX processes o simula a LD PC processors, ^SNXX p..pes v 	 F
simulate the Ethernet communications facility, and UNIX terminals to
simulate the monitors and keyboards of the PCs. Thus We are able to E 	 '
execute the testbed on. the ` VAX in an asynchronous environment that is
reasonably realistic. This mechanism is described in more detail below. 	 µ 4q
i;
In the remainder of this reports any capability described as running on
the IBM PCs will also run on the UNIX process/pipe implementation. 	 }_
3
t^
3. -SQEZHARE SYSTE 12E&QX
The software system is in two parts. One parts called the
sequencer, runs on one PC and the other, called the interpreters on the
remaining PCs (one copy on each) .
E
1	 {
tp
!	 !i
1
t!
The sequencer controls the entire system. It communicates with the
experimenter via an interactive terminal and processes a command
language. Commands are then implemented by sending special; purpose
messages to the remaining PCs. These commands allow programs to be
loaded, parts of the system to be deliberately failed, and so on. The
sequencer also implements the sequence control ,system (See below).
The software running on the remaining PCs .actually executes the	 M	 /'
distributed Ada program. We have chosen not to generate native Intel
8088 code for the PCs but to generate code for a synthetic machine _which
	 ^l
Will be interpreted. Our reasons. are:
	
I
	
It
(1) It would be difficult to retain complete contra "s of the program if
the 'PCs were executing it directly.
i`(2) Generating code for 	 synthetic machine will make code	 generation
very much simpler.
. (3) The synthetic machine architecture can itself . be the	 subJect	 of I!I
experimentation.	 This will allow investigation of hardware designs
which can support. distribution.
(^#) The software can be moved. to	 different	 physical: ` computers 'very
easily.
_V-II
10
The PC software is organized as three major layers. The first
Layer provides communication facilities. it acoepts and delivers
complete messages from the rest of the system, and interfaces at the
character leq'el with the serial line and the Ethernet. Since the
knowledge of the two communications lines is hidden in this layer, it
will be relatively simple to use either as desired.
	
Although we will have up to five SBM PCs available for
	
v
#	 s
experimentation, we find it desirable to be able to model arbitrary
c:distributed systems. We have defined the concept of an tiabetraet
	proeessor n
 (AP) that is a generic processor suitable for use as a
	 -
general node in a distributed system. A set of abstract processors will
be the distributed system that Is presented to the Ada program.
Each PC will implement. an arbitrary number of APs. This is the
function of the second layer of the PC software. Thos a distributed
system comprising a set of any number of processors could be run on arq
number
.
 of PCs; from one to five. In addition, since the system can be
	
run on the.VAX, a . single physical processor . can . appear to be any desired.
	 !`
distributed configuration.
This use of the abstract processors will allow experimentation with
	11
	 {{.I
tt 	'^.
	implementation of multiple tasks on a single processor is to multiplex
	
S
	
the real processor. and give each task the impression that it has its
	
j^
own, rather slog, processor. Ada tasks will be implemented this way and
each AP Will support an arbitrary nuaber of virtual processors (VPs)
	
with one for each Ada task,, The provieiorn of the VPs is the function of
	
,I
the third layer of PC software,
The virtual processors are designed to make Ada task execution.
fairly easy. Their instruction sets are tailored to Ada tasking and
f	 they have special. "hardware." features such 	 as	 built-in	 entry	 queues,
The software therefore does not have to implement these queues,
i
The virtual processors must support the	 entire	 semantics
	 of	 Ada
3
tasks	 and so their implementation is quite complicated.	 As we discover
j	 more	 about	 the	 language	 semantics	 so	 the	 complexity	 of	 the	 VPs
increases.
	
The	 "hardware reference manual" for the virtual processors
4
is included in this report as Appendix 1. j
All of the processors in. the 	 system
	
communicate	 via	 a	 set	 of S
messages.	 Thus 	 for example a rendezvous is implemented as a series of
messages even when the V.Ps involved are executing on the same U. 	 Some
messages
	 (those	 between APs	 on	 a	 single PC	 do not
	
)	 get-transmitted-
through. the	 Ethernet.	 The ' PC	 communications	 software	 reflects	 or
"mirrors" these messages back to the appropriate AP.
In our original discussion of fault detection, we proposed a ,system
of	 software heartbeats which would allow detection of failed equipment.
In our design, we have included the heartbeat mechanism
	 at	 the	 higher i
levels	 of	 abstraction,.	 but	 we will not include the heartbeats in .our
,V;
initial implementation. The reason, that they add nothing to the
experiments that we wish to perform.. We are concerned with the events
occurring following a faW.t and these are best observed if the fault is
deliberately injected. This will . be done . from the command language
supported by the sequencer software. Failure of an AP will be
communicated to the PC respoftaible for running that AP by a message.
The AP will then cease being scheduled by the PC. The remainder of the
ns will be informed by a broadcast message * Thus instead cif the
heartbeat menhanim. fha axnarimenter will- be abl a i-n rminqp- pnv of i-hp
12
13
•.	
.$k-SUBSET T1RAhr5LAQR i	 .
In	 order	 to	 allow
	
us	 to	 debug	 our
	 testbed	 and	 to	 perform
experiments	 we	 need	 to	 be able to execute a variety of Ada programs.
.	
I
Preparing these programs for execution requires a translator and a major
portion
	
of	 the	 grant	 reporting	 period	 has been spent designing and
implementing this translator.	 It is important to
	 understand
	 that	 the
target
	 of this translator is the virtual processors described above and
not the IBM PC or the DEC VAX 11/ 780. E
Our goal in this project does not inclu de compiler research and	 so
we	 sought	 the most timely manner of producing the translator we needed
rather than spending a
	 lot	 of	 time	 developing	 fast,	 efficient,	 or
otherwise noteworthy compilation techniques.	 Consequently, our approach.
has been to modify an existing translator fora	 larzuage	 that	 was	 in
some	 ways	 similar
	
to	 Ada.	 This translator had originally been built
using the
	 UNIX	 compiler	 construction	 tools	 -	 YACC	 and	 LEX.	 The
++
	 ^I 	 ^
translator is written in C.
The translator for our subset of Ada is	 essentially	 complete and
-3
generates	 code for the testbed t s virtual processors.	 The translator is
presently. being tested. i
-	 I
1 4 a.
^,
5-.	 LANOITAGE ISSUES- I 1
As a part of designing the software for the VP 	 software
	 layer
	 we -%
had to look closely at the semantics of task termination. 	 This had been
a
discussed on several occasions and thought to be well
	
understood.
	 The
current discussions centered around efficient methods of implementation.
k
The most efficient,	 and
	 we	 believe	 the	 usual,	 method	 of	 handling
termination
	
is	 for a master task to count the number of its dependents
v	 ..
who are ready to terminate, and to terminate the group when
	
the	 number
i`
counted .e uals. the number of deq	 pendants.	 : The. count may decrease as well y	 ....	 ^.
as increase because a task may become "unready" if one of its entries is
called by another task.
On a uni.processor	 this	 is	 not	 a	 problem
	
and	 yields	 a	 valid
implementation.
	 On	 a distributed system it may not. 	 The Ada Language
Reference Manual. (LEM) states that a necessary and sufficient 	 condition
l	
^
for termination is:
"Each task that . depends on the master	 j s	 either. already	 ter- ^#	 3
minated or similarly waiting on an open terminate alternative of ..	 .r
a select statement".
In a distributed system, determination of a taskt s	 state	 on	 a	 remote
machine
	
has	 to	 be determined by message passing.	 The above condition
uses the present tense and therefore a task may 	 not	 change	 its	 state -
once	 asked	 about termination until the master has made its decision to
terminate or not. 	 This means that if termination is not 	 possible, -the
master, must	 send ,a	 second	 message indicating that a task may resume ^'.
execution.
t
15
	 i
There are two things to note here. The first is that failure of
the processor running the master between the two required messages will
suspend the dependent task permanently. This is similar to many other
conditions we have noted 'before. The second thing to note is that
counting dependents is not a valid	 implementation 	 because	 it	 records
task	 states	 as they Were.	 Termination has to be based on a "snapshot"
.	
of task states. n
The reason that this is not a major problem on 	 a	 uniprocessor	 is F'
that	 a snapshot can trivially. be obtained by a master since while it is
R,
executing, all its dependents are suspended.
L.
The Ada text shown in figure 2 is a set of tasks. which should never
i
terminate.	 Despite	 the TERMINATE alternates in the SELECT statementsy
E
any implementation which terminates these tasks is	 wrong.	 Task	 X	 is
unable	 to	 terminate	 since	 tasks	 A	 and B are in an infinite loop of
.	 I
alternating instigations of rendezvous. 	 Any termination-check algorithm
which	 does. not	 stop	 both	 A	 and	 B,	 such	 as	 a dependent counting
4
algorithm, allows the possibility for the combination 	 of	 old	 and	 new
information
	
to	 indicate	 that	 BOTH . A	 and	 B	 are	 waiting at select
statements	 with	 open	 terminate	 alternatives,	 which	 is	 clearly
impossible.
Also note that,	 with	 an	 algorithm	 which	 stops	 all	 dependents
periodically	 for	 polling, task X will actually be interfering with the
progress of tasks A and B throughout their lives. 4
We feel that we understand, the language issues	 involved	 with	 Ada
operating. on
	
distributed systems at this . .point. :	Papers . deseribing the
3
1,
procedure DEMO is
task X.; r
task body X is
task A is entry E; end A;
task B is entry E; end B;
task body A is begin
loop
B.E;
select
accept E;
or
terminate;
end select; r
end loopi.
end A;
task body B is begin
loop
select
accept E; -, ►
or
terminate;
end select;
A. E;
end loop, .
end B; #
_	 begin null; end X; 3
begin null; end DEMO;
Figure 2 - Tasks Which Should Never Terminate
different aspects of this are included in this	 report	 in	 Appendix	 2,
i
Appendix 3, and Appendix 5.
1
^i
17
f.•	 SEQUENCER I
The sequence of actions within the testbed will be controlled
	 from j,
the	 sequencer.	 In most simulators, there is a single step mode whereby
effects of individual instructions within a program can 	 be	 studied	 in
detail.	 The	 case	 of	 simulating parallel programs, particularly when
distributed over many machines, :,s more complicated. 	 Not
	
only	 is	 it
necessary
	
to	 single step individual tasks, but is it also necessary to
single step them in . relation to each other,	 Further, our	 interests	 in F
this	 research	 are	 such	 that we want to deal with the tasks through a
perspective which is more . microscopic	 than	 the .Ada . source	 language
J
level.	 A	 typical	 experiment	 is expected to arrange for an accepting
task to send a CHECK;CALLER message just after the calling task s s	 timed v-,
's
entry	 call
	
times out.
	
Both	 tasks must be held at points is Ada
statements so as to Force the required interaction.
t
The sequencer is a part of the testbed software and its purpose 	 is
to	 control	 the parallel activities of the tasks within an Ada program.
It deals with the program at the i nterpreted	 intermediate	 code	 level. f
p
It	 provides	 breakpoints and allows tasks to be single stepped in terms. j
of individual messages as well as the interpreted instructions.
A scheme has been worked out by which an actual dist ributed
	 system
could
	 establish	 communication	 and	 start	 up	 without	 the sequencer.
Uowever	 in-order	 for	 the	 sequencer	 to	 maintain	 control	 it	 must
establ ish its control at start-up, thu' s it handles the assignment of APs
to PCs and PC names to ports
18I
The sequencer is interactive and is able to	 display	 all	 relevant
tables	 owned	 by	 the PCs and by the message switch.
	 It also permits a
fast-forward interpretation mode to allow
	
the	 system	 to	 set	 up	 the
desired
	 experiment without requiring a great deal, of the experimenter s s
time.	 Since code is shared among tasks, breakpoints	 by	 code	 location
only	 are	 insufficient. A breakpoint is named by source- ,level
 task name
(task id), code location, and number of timsM the task must execute that
code location before "hitting tt the breakpoint.	 This count is due to the t
experimenter's	 desire	 to	 perform
	
experiments	 within	 loops	 or	 at is
(trailing) end conditions. 	 Implications of the source level task id are
f
i'	 -.
that:
(1). There will be no more than one (1) execution of allocators of tasks
for any access variable.
(2)	 There will be no two identical simple names for
	
tasks	 within	 any
experimental program. }	
`t
These are not serious restrictions.
1
3
As part of its control. of the PCs, the sequencer must have	 y
extensive communisations with them. All of these messages are copied
into.a log file to allow for later. detailed study. Farther, in order to
direct individual task t s (i.e. VP's) actions, the sequencer must also
maintain copies . of all the AP's VP maps.
	
	 I
I
Brief descriptions of the sequencer commands are:
(1) Load AP to PC map information into both the sequencer and the
individual.:PCs, i
19
(2) Load program to be run. The compiler's output has been stored in a
file on the VAX and the appropriate file for this experiment is
copied into the individual PCs i memories.
(3) Inform all PCs of the demise of a particular AP. This causes the
PC owning the subJeet AP to cease to schedule it.
(4) Allow the named tasks to run until they encounter breakpoints or
terminate. The absence of task names indicates that all tasks
should be run (none should be artificially suspended)
(.5) Stop or artificially suspend the named tasks no . matter what they
are doing. Absence of parameters means stop all tasks.
(6) Single step the named task through the interpretation of one
instruction.
(7) Single step the named task through the handling of one message.
This has been separated From the command to single step. an
instruction to allow better control of the order of events within a
^I
a^
Al
+^I
F	 ,
r	 '^
Y.
Y
{
is
T ,.
qq4
F
VP.
( 8)	 Display the sequencer' s tables. 	 Due to table vs.	 terminal	 screen
size,	 these may	 have	 to	 be	 individually selected. The tables
needed	 by the	 sequencer	 are	 task id	 to VP name, VP name
	 to.
Ar,
 number, AP number to PP number, PP number to port, and relevant
breakpoints,
(9)	 Remove	 breakpoint.	 Inverse ' operation	 of the	 set breakpoint
1.
V
command. ,.
fi'
2 Q ,k r,
Part of the sequencer s s	 monitoring	 interface	 provided	 for	 the
experimenter	 is actually provided	 by the individual. PCs on their own
terminal screens. This is a 4et of	 displays	 selectable	 at	 the	 PCst
keyboards.	 The displays are a summary with one line per VP including
minimal status information and any	 simulated	 output	 control	 signals,
Full.	 VP	 status for	 one VP	 and utilizing the entire screen, full. AP
status for one AP, and full status	 of	 the	 PC	 itself.
	
These	 status ^	 :I
reports include contents of message buffers as well as internal tables.
e	 .^
W
V	 'r
f
a	 ..
{
1
r
'r
r
p
F
yy
i
-^ Y
. I
sj
APPENDIX 1
22
This is a description of the testbed t s virtual, processors. Names
taken from the Pascal-Like declarations bel.ew and appearing in the text	 }
are del;Lmited by the character "I".
t
Certain sample instruction sequences are fiven toward. the end of
this document as aids in the compiler = s author's task, As a further
aid, two simple Ada programs are shown along frith their translations to
instruction sequences and template sequences
The thread of control associated with an Ada
	
task	 (incl.uang	 the
environment task)	 represents	 the	 execution	 off'	 exactly
	 one
	 target }
.machine.
The target machine has the following kinds of memories;
(I) A static memory for instructions and string constants.
(2) A static memory for templates. t
(3) A set of entry queues.
(4). A tagged expression stack.
^	
1
(5) A set of . lists of addressing information For dependent tasks. ii
(6) A -set: of lists	 of	 addressing - information	 for	 locally
	 declared .'
tasks.
(7) A set of lists of	 addressing	 information	 for	 locally
	 allocated
tasks. 1
_	 4
-	 l
25
8) A set of farms" which contain template indices, continuation
i
addresses, entry indices, and delay intervals, as appropriate, for
use by the lselectl Instruction in implementing the semantics of
the Ada select statement.	 ',a
(9) A set of tagged memories for allocation of space for variables.
(10) A set of tagged memories for allocation of space for formal
subprogram parameters.
(1'f) A wet of tagged memories for allocation of space for 	 formal	 entry
parameters.
(12) A connection to	 each	 of	 the	 effectors	 (output
	
ports)	 of	 the E
abstract processor an Which the virtual, processor as running,
The contents of	 the	 target	 rnacnine r s	 two	 static	 memories	 are
downloaded	 from the compiler or copied in toto from a parent to a child
during execution of a Icreatetaskl instruction (think of the process 	 as I
budding) ..	 The	 one	 containing	 instructions. and	 string constants is
called the	 codespace.	 The
	
other	 is	 called	 the	 templatespace
	 and r`
^.	 contains . Iunittemplatels.. 	 A	 iunittemplateI	 will be generated by the {f
compiler for the environment task and for each subprogram, block, accept
statement,:	 task, and package found in. an Ada program. 	 `these will lat er ?.
be refezred to as tFunitsn .	 All lunittempl.atels contain certain items of
information-An common.: 	 These are 	 in order of their ,appearance Sri thi n a
lunittemplatel: ^.
1	 1
(1)	 I,stringmax!	 characters	 containing	 the	 source	 name	 of	 the
subprogram,
	
block,	 task,	 package,	 or	 accept	 statement
	 whose.
'.	 a
t.
24
occurrence caused the compiler to ,generate this lunittemplatel.
The name 9:s left justi faed with blank fill. The name of the
environment task is "ET". The name of an accept statement is
	 tl^
composed of the word "accept It followed by the entry mate up to the
left parenthesis of the entry index.
(2) The absolute address in eodespace of the first instruction
(sometimes called . an entry poi nt) of the task, subprogram, block,
package body's sequence of statements (explicit or implicit) ? or
accept statement.	 A. lreturn! instruction is generated by the
compiler for an accept statement without a "do" part.	 t 1 1
s
(3) The static nesting level of the subprogram, task, block, accept 	 i
statement, or package. The static nesting level of the enviroment
task is defined to be Iminnestingdepthl * All compilations witbi.n
the Ada program are declarations of units immediately within the
envirornent task; thus, the outermost units of these compilations 	 J
are defined to have static nesting level .Iminnestingdepthl+9. The
	
,.	 order of these declarations is defined to be a topological sort of	 i
the declarations based on the partial, order _ given in the .source .by
the occurrence of "wi.tV clauses. Units nested at greater depths
IL 
i
}
than. compilations have greater static nesting levels. The static	 #
	
i	
s
nesting level of an accept statement or a block is one greater than
that of the immediately.: surroundi_ng..subprogram, task, : block, accept
statementf or package. 	 {
_	
1
{ ) A list of absolute addresses of instructions in codespace. Each of
	
C	 these addresses corresponds to an exception declared explicitly or 	 ^
I25 x•
implicitly.
	
Any	 particular
	 address	 is	 that	 of	 the	 first
I
11instruction	 of	 the	 exception handler declared by the subprogram,
task,	 block,	 aeoept	 statement,	 or	 package	 to	 handle	 the
corresponding	 exception.	 if	 the. subprogram, tas%, block, aeoeept
statement, or package declares no handler 	 for	 an	 exception,	 the
corresponding	 address is Inullcodeaddressl.. The occurrence of the
word "others" in a declaration. of exception handlers 	 implies	 that
the addressee corresponding to all exceptions not explicitly listed r
in that declaration sequence will be of the same instruction.
(5)	 A keyword of the Pascal type lunittypel specifying for which 	 of	 a
subprogram,	 task,. block,	 accept	 statement, or package body this
lun. ittemplatel	 was generated. I
^	 v
A lunittemplatel generated for a task also contains	 the	 value	 of	 the
task I s	 priority as explicitly given in the Ada source or as assigned by i
the compiler.
	
lunittempla4els generated for 	 blooks	 contain	 no	 other
informatlun.
	
A	 lunittemplatel
	
generated for	 a subprogram or for an.
accept statement also contains a boolean clap for the in (in out) 	 and	 a
Boolean	 map	 for the out (in out) formal parameters declared explicitly ^	 G
or implicitly (the return value of a function is an implicit
	 parameter)
..in	 that subprogram .1 s specification, or in the specification of the entry ^I
i
family (an entry declared without an index is a	 family	 of 	 one)	 being
accepted.
	
The	 boolean maps	 indicate (by true elements) how many and
which formal parameter tagged memory	 larbnodels	 (see
	 below)
	
will	 be
:initialized
	 with	 values from the. cal.,ler' s expression stack : and will. be ^
pushed back onto that stack upon return.	 A lunittemplate'l generated for
a package elso contains a value of the Pascal type-boolean indicating
26
the veracity of the statement #this package is a libr ary package. " The
lunittemplatel for the environment task is always at template address
lE'l'template l
The target machine's codespaee is occupied by linstructionunitls as
defined	 by	 the Pascal type linstructionunitl below.	 A very high level
description of the semantics in Ada, terms of each 	 instruction	 is	 also
given	 below.
	
Fields in the type linstructionunitl other than the field
opcode represent operands of the appropriate instructions. 	 The	 unused
portion	 of	 the	 instruction
	
memory	 is	 initialized	 to	 the !arrestl
t
instruction which terminates	 execution	 of	 one	 or	 more	 Ada	 virtual
machines
	 and	 is	 not	 to	 be	 generated	 by	 the	 compiler.	 An
linstructionunitl represented by the opcode ldwl is not	 an	 instruction
but	 a string constant of 10 characters found in the Ada source in calls -`	 3
to the predefined inline machine code procedure named send control 	 (see
Appendix
	 1)..	 The target maehinets instruction-fetch operates so as..to i
r	 ,
always fetch the instruction. having the neat greater 	 address	 unless	 a j
branch	 has	 taken	 place	 using. the ldestaddrl (or other) operand of an.
instruction.	 A return address	 from	 a	 subprogram,	 accept	 statement,
block,	 package	 body's sequence of . statements . or entry call is vef erred
to as a continuation address and is explicitly Loaded before	 the	 call.
Other	 continuations	 may : be autamatically substituted for. these by the
target machine's execution.
	
These continuations may be 	 retrieved	 from i
: the	 establishment	 of delays or else parts . within select statements, or
from exception handler. addresses _ during 	 exception	 propagation.	 An
instruction addre. s.s .` like a template index, is an integer value which.. is j
not less than:	 inulleodeaddressf	 (lnulltemplateaddressl). 	 Instruction
k
J^ 1 ..	 .	 I
27
addresses and template indices can therefore be loaded onto the target
machine's expression stack via the Iloadinteonst.1 instruction.
Items on the expression stack are described by the Pascal type
larbnodel. Items retain their memory ItagIs while on the expression
stack. An item can be placed on the expression stack by a frefl
instruction followed by zero or more Iloadt instructions, by a
floadcountl or Iloadintconstl or lareatetaokl instruction, or as a
result of another instruction which uses other values already on the
stack. Many instructions consume values already on the expression stack
and leave. their results, if any, on the. expression stack.
In (in out) actual parameters (explicit or implicit) for entry
calls and subprogram calls must be placed on the expression stack prior
to the call and results (out or in out parameters or function return
values) must be popped off after control has returned normally.
Abnormal returns result in parameter space not being occupied on the
expression stack.
The target machine has a set of tagged memories (containing
larbnode Is) each of
.
which can be used for storing variables, addresses
of variables, the addresses of parameters, the values of task variables,
and
.
. the value-is of variables declared as access to task types. . Formal
parameters of either subprograms or entries do not reside in. these
28
or indirectly, in the following proportions; A task has one tagged.
memory Within which the compiler may assign space for the task's local
variables, loop indices, temporaries, and task variables. A subprogram
has two tagged memories;. one for the subprogram s s local variables, loop
indices, temporaries, and task variables, and the other for the explicit
or implicit formal parameters, if ary. An accept statement has two
tagged memories; one for the accept statement s s loop indices, and
temporaries, and the other for the formal parameters of the entry family
accepted. A block has one tagged memory for its local variables, loop
indices, temporaries, and task variables. A package has ons tagged
memory	 for
	
the	 loop indices and temporaries needed by the explicit or
implicit sequence of statements of. 	 the	 package	 body..	 The
	
variables_
declared in a package are assigned space by the compiler in the variables
s
tagged memory of	 the	 nearest	 task,	 subprogram,	 or	 block	 textually
surrounding	 the	 declaration of the package.	 The word "textually" here
includes the imagined placement of compilations
	
in	 the	 discussion	 of
static nesting levels above. #
Allocation of space within these	 tagged	 memories	 is	 assumed	 to
begin at the lowest allowable Ioffsetl and proceed to greater loffsetis.
The ,lowest and highest Ioffsetts for the variable, etc. tagged 	 memories
are	 Iminuserdatal	 and	 Imaxuserdatal.,	 respectively.	 The	 lowest and i
highest Ioffsetls for the formal subprogram and entry 	 parameter	 tagged
-	 -
fhh1
memories are U owparmoffsetl..and Imaxparametersl, respectively.
When a new task_ is created or
	
a	 new	 instance	 of	 a	 subprogram,
i	 ?
block, accept statement, or packa.ge(body s explicit or implicit sequence
of statements) is entered, all Iarbnodels in that task'sl	subprogram's,
f
1.
29
block' Sy accept statement' st or package I s variable tagged memory . have
the I ta. gI Inul I.	 Thusf	 initialization of variables is the
responsibility of the compiler through instructions generated upon
encountering the corresponding "begin"*
There are three ways for the compiler to address larbnodela in	 the
tagged	 memories.	 One is the use of the 1indexI instruction which adds
an integer value to an already existing address.
	 The second	 method	 is
the use of the Iloadl instruction where the value loaded is a previously
Istoreld address.	 The third method is the use of the Iref I	 instruction
which	 creates	 an	 address	 from the instructions operands.
	 The Irefl
instruction's Istatibnestinglevell operand specifies the static	 nesting
level	 of	 the task, subprogram, block, accept statement, or package one
A
of whose tagged memories contains the
	 location	 being	 addressed.	 The
Iref I instruction's ItagI operand specifies which of t.ae tagged memories
of the indicated task, subprogram, block, accept statement,	 or	 package
contains	 the	 location	 being addressed.
	 The only valid Values for the
fref I instruction's ItagI operand are Ivaral which specifies the
	 tagged
memory	 for	 variables,	 loop	 indiws, temporaries, and task variables,
leprml which specifies the tagged memory for
	 formal	 entry
	 parameters,
and	 Isprml	 which	 specifies
	 the
 
	 tagged	 memory for formal subprogram
parameters. 
	 The	 Irefl	 instructions 	 loffsetl	 operand
	 supplies	 the  
lof fset
.
1	 of	 the	 location	 being addressed
. within the indicated tagged
memory.	 The Irefly	 Iloadl, and	 Istorel	 instructions,	 when	 used
	 for
non--local references, implement the use of -shared variables as described
with "pragma shared" in the LEM.	 It is the compiler's duty to make
	 and
update	 local
	 copies
	 of	 shared variables which are not the targets . of
30
that pragma.
The values of task variables and of access variables to task types
are identical.	 These values have Itagl itskal and are created only by
the loreatetaskl instruction.
The target machine has a set of entry queues which can be
referenced by the lentryca? I.NORMAL, I, IentryeallCONDITTONAL I,
lentryeallTI=MEDI, lloadcountl and Isetarmacceptl instructions using
integer indices from Inullentryindexl to Imaxentriesl. The association
of an entry name in the Ada source with an entry queue index in the
target machine is the.responsibility of the compiler.
I--
_	 j
i
1
list. The Igetdependentl instruction changes, to the next dependent on
the list, the concept of which dependent is currently accessible. The
E
i
Iresetdependentlistl instruction ensures that the next execution of the	 1
Ige tdependent I instruction will make the first ; task on the dependent i
List become the currently accessible dependent. There are two similar
lists, one of allocated tasks and one of declared tasks. The concept of
currently accessible child task applies at any time to only one of these
lists and is moved along that list via the I;getownedtaskl instruction.
To which.List the concept of currently accessible child task applies at
any time depends on which of the Iresetdeclaredtasklistl or the
Iresetall.ocatedtasklistl instruction was most recently executed. 	 I
A Iselectl instruction performs most of the a:etions required by the
Ada selective wait, In order to perform this feat, the target machine
I
r
Each task, subprogram, block, accept statement, and package has a
List of dependent tasks and a currently accessible dependent task on the
ij
4
^l
must be informed of 'which alternatives of the select statement are open
or applicable.
	 The Isetarmaeceptl, Isetarmdelayl, Isetarmterminatel,
and Isetarmelsel, establish for the target machine the presence of open
accept alternatives, delay alternative, terminate alternative, and an
else part, respectively. The Iselectl instruction acts upon all of the
alternatives (and else parts) established since the execution of the
most recent Icleararmsl instruction. The upper limit of open accept
arms is Imaxarmsl.
i
The format of the code and template text file produced by the
compiler should be that which would be produced by procedures
	 f
lwritel.instructionl and !writel template l
	 (see jckNASA/rts/src/prat. i
at CSNST address uvacs) . All Iinstructionunitls appear before the first
i
Iuni.ttemplatel, the last linstructionunitt has lopcodel larrestl, and
	 1
the last lunittemplatel has Mindofuni.tl Ideadtmplatl. Addresses are
	
4
timplied by the rule that the first instruction in the file is the
instruction at Inullcodeaddressl, and the first template in the template
file is the template at Inulltemplateaddressl and that addresses I
increase by 1.
	
i
The dependent clears-up code provided in the code sequence examples
below must occur beginning. at Inulleodeaddressl in order for certain
iyyparts of the rung-time semantics of Ada to be modeled by the Ada virtual.
	 I
m^nT^ i raa:.
(* DTSCLADER:
	 }(	 These. constant and type declarations are for. or ganizational	 )	 ^(	 reference within the context of the text of this manual and	 )(	 are not necessarily the operative versions at any given time.
	 )	 ,4.
(	 The operative versions are in the Pascal source files 	 )(	 ' jckNASA/rts/sre/const.3	 )(	 and	 -jckNASA/rts/src/type.i,
	
at CSNET address uvacs	 )
const	 -
lowparmoffset	 1;	 maxparameters	 _	 5,
stri.nglow
	 =	 1;	 stringmax	 -	 80;
R
mi.nnestingdepth
	 =	 1;
	
maxnesti.ngdepth
	 =	 20;
nulleodeaddress
	 =	 0;	 codespacesize
	 _ 1 000;
nulltemplateaddress= 	0;	 numh'aroftemplates	 -	 30;
nullap	 =	 0;	 maxnumberofaps	 -	 .32;
nullentryindex	 =	 0;	 maxentries	 -	 10;
minuserdata	 =	 0;	 maxuserdata	 _	 20;
loweffector	 _	 0;	 higheffector	
_	 49;-
nullexception
	 -	 0;	 maxexceptions	 =	 10;
.z
taskingerror	 =	 1;	 .:	 k
ETtemplate	 _	 nulltemplateaddr ess;.
-^	 f
maxarms	
-	 5;
type
string	 = array[ stri.ngl.ow. . stringnaxjaf char;	 f
..nestinglevel
	
m:Lnnestingdepth., -maxaesti, ngdepth;
cod s pace	 nullea eaddress.. codespaces^ze•
templatespaze.	 _ nul:l tempt ateaddress.. nuniberoftemplates;
apnumber	 _ ` null.ap.. maxnumberof aps ;
1
:. exce ti.onaame	 nullexce t on. m--xey-ce t	 ;
-	
P	
..	
p	 pions	
_	
#
effectorrange	 - loweffector.,higheffeotor;
priority	 = integer;
_ _
	 __.....-..-
memQrytag
33
(nul ( null ^) i,
,Inbg ( integer value *)
ybool. ( boolean. value )
r Va3ra {* variable address)
,eprm. ( formal entry	 parameter address ^)
,sprm ( formal. subprogram parameter address
,tska (* task address (access value) )
t
.34	 a
j
opoo de s	 (clear.arms
3	 j arrest .
...
abortdependent
, aborttask '±h
,ALLOCATE
[	 , Biel ay
.	 , GOTOFROMB
,index
,ref
,return
, RET3NBLOC
'	 , awai.taetivationdoneforall
,call
select
r	 ,ereatetask
r dW €
effeet or
> enabl.eexceptions
y entrycall.CONDITIONAL
+	
-	 , entryeallHORMAL
entryeal (TINED
,getdependent !.
,getownedtask
..	 :,
rese tder	 pendentZ3.^t I -..
,resetdeclaredtasklist
resetal.locatedtasklist
, myaetivationi.sdone f'p	 , checkdepe,ndent
i	 ,letdependentproceed
,iamterminable
,letchildbegin
removedependent
,activateehild
Se tarmaeoept
i
setarmdel.ay i
, se tarmtenminatE
, setarmel sE ]
, Jmp
, brfalse
br tr ue
., raisex
,remise
, loadeount
Zoadi.ntcowt
,load
store
,addinstr :<
, subinst
t mul instr	 -
, divinstr
modinstr
^eginstr
,,neinstr
r35
^itinstrgtinstr 
^.
e
insr
r
tr
98in ^^ itr
, anclinstr
orinstr.
r Xorinstr
notinstr
36
a lstk'uct^ onun t	 record
case opcode : opoodes of
dW ; (s al„f'ar
}i
effector :(area effectorrange;
isastring
	 :. bool.ean;.
ref : (stationestingl.evel : nestinglevel;
offset integer;
tag 	:.memorytag;
entry callCONDITIONAL
entryeall'TIMED
setarmaco--pt :(tmplate templatespace;
continu codespace;
entrycallNORMAL : (templat templatespace;
ge tde pe nde nt
getownedtask
setarmdelay
aetarmelze : (al.tpath	 r codespace;
checkdependent
}i
abortdepexident i
call
l.etdependentproceed
letehildbegi.n
removedependent
aetivateehild (continue codespace;
se.tarmterminate :tchka.ddr	 : codespace;
jmp t
br fal se
brtrue :(destaddr	 a codespace,
OOMFROMB
_)9
: (ntmtl.evelstoexit	 ;integer;
destaddress codespace;
.ALLOCATE :(size integer;
3raisex y
l:oadintco4st : (theeonst	 : integer;
arrest r
reraise r
aborttask
index
re tur n
Y:	 ^I
qi
I
I
^I
,k.
l
I
i
.i
4
t
I
j
kclaa.rarms
enableexcept3 ons
loadcount	 y
resetdeperidentlist 	 r
resetdeelaredtasklistt
resetallocatedtasklist
my'activationisdo3ne
amterignable
delay
load
store
add3,nstr ^.
subinstr
mulinstr
dis, instr	 r
zodi nstr
eginstr	 r
neinstz`
ltinstr t..
gtinstr.
leinstr	 t
gel natx'
	 r
andinstr i
xorinstr	 t
notinstr
createtask	 :(template templatespace;
masterindex : nestinglevel;
isallocate.d : boolean, j
end; i
1
i
I.
i
i
138
arbnode
	
.: record
Case tag : memorytag of
nul
	 :();
intg :(j : integer;);
boot :(b : boolean;);
vara 7
eprrn i
sprm ..(a.	 : variebieaddress;); (* com piler canngt generate	 )
tska , (t vpham.e; );
	
(e values of these types	 *)
end;
unittype	 (.task
, acde pt
,block
, subprogram
$ packagebody
iunittemplate	 record
sourcename	 string;
firstinstruction	 codespace;
staticnestinglevel	 nestinglevel;
39
Assuming that the symbol ourrentinstruction represents the
instruction to be interpreted in the form of a value of type
iinstructicnunitl, and that the symbol, TOS stands for the larbnodel on
top of the expression stack at any particular point in time, the weaning
of the instructions is as follows;
ease currentinstruetion.opeode of
raisex	 Corresponds to the Ada statement "raise x;"
where x is the naMe of some exception and is
mapped by the compiler to the integer operand
utheconstn.
rerai.se	 Corresponds to the Ada statement "raise,*".
cleararms
	
Initiate processing of an accept or selective
wait.
setarmaccept establishes an.	 open	 accept	 alternative	 for
this	 selective	 wait.
yCurrentinstruction. continu is the	 address	 of
the	 statements	 following	 the	 rendezvous.
Currentinstruction. tinplate is the index of the
template	 for the accept statement.	 Pops from
MS the.index of the entry being accepted. kt
setarmdelay Establishes an open delay alternative for this !
selective wait.	 Currentinstruction.altpath is
the	 address	 of	 the	 code , for.	 the.	 delay'
alternative.	 Pops	 from	 MS	 the	 delay r
interval=
setarmterminate Establishes . an open terminate alternative	 for
I
this.	 selective
	
wait.
Currentinstruction. chkaddr is the	 address	 of
the code for the dependent cheek.
setarmelse Establishes an else part	 for	 this	 selective
wait.	 Currentinstruotion. altpath	 is	 the
address of the code for the ease part.
select Performs	 a	 selective	 wait	 using	 the
alternatives established since the most recent
lcleararmsl instruction and according to.Ada's
semantics...
  If there were no open alternatives,
raises excption.	 If there are entries on
	 an
open	 alternative, chooses such an alternative
and calls the accept for that alternative.	 TP
there	 are	 no entries on an open alternative:
40 TI
^II
If there are only accept	 alternatives,	 waits !	 .,
for
	
an entry. call.	 If there is an else part, I'+
branches to the else	 part.	 If	 there	 is	 a r
terminate	 alternative,
	
branches	 to the code
for the dependent check (if not all dependents
are	 already	 known to be terminable) or waits it
for an entry call or task removal. 	 If	 there {'
is	 a	 delay	 alternative	 which	 has expired,
branches to that (some) the delay alternative.
If	 there is a delay alternative Which has not
expired, waits for an entry call or expiration
of the del ay.
Branch to currentinstr ruction. destaddr
Pops	 TOS.	 If	 false	 then	 branch	 to
currentinstruction. destaddr !
Pops TOS.	 If true then branch
	
to
currentinstruetion.destaddr
Creates an address from currentinstruetion. tag
(kind of memory containing the obj ect
addressed),
currentinstruction.statienestinglevel (of the
object	 addressed),	 and
currentinstruction.offset (within the object's
	 !
local environment) and pushes it onto TOS.
Pops entry index from TOS. Pushes eE count
onto TOS.
loadintoonst
	
	 Pushes. an integer
	 value
	
made	 from
eurrentinstruction.theconst onto TOS.
load	 Pops an address from TOS. Pushes the value
found. at. that address onto TOS.
store
	
	 Pops a value from TOS. Pops an address from
TOO. Stores the value at that address.
entrycall.NORMAL
	
	 Uses the iinnl parameter map in the template
indexe d by currentinstructi, on. templat to pop
the actual parameters. That template must be
for SOME accept statement which accepts the
anDroDriate entry family: there is nn
jmP
brf al se.
brtrue
ref
loadcount
r
-	 Y
	
41
	
14
I	 ^r
entrycallC.ONDITIONAb	 Uses the Iinnl parameter map in the template
indexed by cur rentinstructionatinplate to pop
the actual parameters. That template must be
for SOME accept statement which accepts the
appropriate entry family; there is no
guarantee that that accept statement will be
the point of rendezvous in the eallee. Pops
from TOS the in (in out) actual parameters.
The bottommost actual gets. the lowest in (in
out) formal address. Pops from TOS the entry
index.	 Pops from TOS the called tasks
address.	 Initiates a conditional entry call,.
If	 no	 rendezvous,	 br anch es	 to
currentinstructi,on. continu.
entryeallTIMED Pops from TOS the delay	 interval.	 Uses. the
IinnI parameter. map in the template indexed by
currentinstructi.on. tinplate to pop 	 the	 actual.
parameters.	 That	 template	 must be for SOME
accept statement which accepts the appropriate
entry	 family; there is no guarantee that that
! accept	 statement	 wi""l	 be	 the	 point	 of
rendezvous	 in	 the callee.	 Pops from TOS the
in (in out) actual, parameters. .The bottommost
actual
	
gets	 the	 lowest	 in	 (in out) formal	 i
address.	 Pops from TOS the entry index.	 Pops
from TOS the called task's address. 	 Initiates	 r
a	 timed	 entry . call..	 if	 . no	 r endezvous,
branches to currentinstruetion. continu.
getdepende.nt Makes	 the	 current	 dependent	 the	 next
dependent.	 If	 no . more`	 dependents, , branch to
currentinstruction.altpath.
getownedtask Makes	 the	 current	 child	 the	 next	 child
(declared	 or	 allocated	 depending	 on most
recent reset) . 	 If no more children, branch to tl
currentinstruction. al;tpath. -:
resetdependentlist Makes	 the	 current	 dependent	 the	 first
depende nt.
resetdeclaredtasklist Makes the current	 child	 the . first	 declared
child.
'	 reseta'llocatedtasklist Makes the current child 	 the	 first	 allocated
.. child.
abortdependent Aborts the current dependent and	 branches	 to
our rentinatrnetion. continue. 	 ff
aborttask
f
Pops task address from TOS *	Aborts that task.	 }
4
-	 t
I	 eel
j.;	 al^ll
142
activatechild
	
if the current child . has been created but . not
yet told to do its activation, tells it to do
its	 activation.	 Branches	 to
currentinstruction. continue.
awaitactivationdoneforall Waits until all children have announced they
are activated or encountered errors.
letchildbegin
	
If the current child has announced it is
activated ' tells it to proceed with its began
statement.
	 Branches	 to
currentinstruction. continue,
myactivationisdone	 Announces to parent (not master) that this
child is activated. Awaits permission to
.proceed with begin statement.
cheekdependent Checks whether the current dependent is 	 ready
to	 terminate.	 If so, holds the dependent in
limbo	 and	 branches	 to
current'instruction. continue. # ,.
letdependentproceed Takes the current dependent out of 	 limbo	 due
to	 a	 lcheekdependenti instruction. 	 Branches
to curreatinstruction.continue.
iamterminable Sets an internal flag announcing the task may
terminate.
removedependep;t Eradicates the current dependent. 	 Branches to
currentinstruction. continue.
delay Pops delay interval From TOS.	 Delays at least I
that g ong.
ALLOCATE NOT.	 IMPLEMENTED	 YET.	 (used . 	for	 dynamic ,
-
allocation of other than tasks).
l
OOTHROMB NOT IMPLEMENTED YET (used for a goto within 	 a.
I:
block whose	 destination is	 outside of that
block) .
RETINBLOC NOT	 IMPLEMENTED	 YET	 (used	 for	 a	 return.
statement wbieh.is textually within a block)
return Propagates any pending/unhandled exceptions to !
the	 calling	 unit	 and/or
 surrounding unit or
nowhere as appropriate	 for
	 Ada: s:
If	 returning	 from a task, becomes terminated
t
- and	 awaits	 removal ` by	 its	 master. `
	 if
i
returning
	 from	 an	 accept,.	 either	 raises
tasking_error in the caller of	 the ` entry
	 or
- copies	 the	 out	 (in	 out)	 formal
	
entry
43
	
IL.	 ^
parameters back onto the TOS of the caller of
the. entry. leaving lowest. formal addressed
bottommost. If returning from an accept,
allows the caller of the entry to proceed. If
returning from a subprogram, pushes the out
(in out) formal parameters onto TOS. leaving
lowest formal addressed bottommost. if not
returning from a task, resumes at the
continuation address saved at the call or
select..
call
	
Pops from TOS index of the template for the
unit being called. (The index is not a fixed
operand in order that generic .subprogram
parameters may be implemented.) Uses as
continuation address (return address)
eurrenninstruction. continue If the unit is a
'subprogram. (determined from the template),,
pops from TOS the actual parameters into the
in (in out) formal parameters. The bottommost
actual gets the lowest formal address.
-k
Pops from TOS the integer number of the
	 i
abstract processor on which the instruction
creates a task with the unit at nesting level
eurrenti.nstruetion. masterindex as master and
using	 the	 template	 indexed	 by
currentinstruetion.template.
Currentinstruction. isallocated	 determines
whether the new task becomes a 'declared or an	 {
allocated child. The task = s address is left
on TOS.
	
I	 1t
is
On entry to a unit, exceptions are inhibited.
This allows subsequent exceptions to be I( .
raised. Any exceptions which would otherwise
	
4
t
have been raised previously in this unit are
raised here,.	 1
create task
enableexceptions
effector	 Pops from TOS an intg or bool arbnode. If the I	 j^
abrnode	 is..	 boot	 or not
currenti.nstruction. isastring,
	
lases the
arbnoders	 value,-	 otherwise	 considers	 the
arbnode t s	 value	 to
	
be	 a	 constant string
address..	 Writes . the	 value.	 or the first 10
characters of the string onto
	
the	 RP's Mort
addressed by currentinstruction, aarea _,	 gg
index	 Fops an indexfrom TOS.
	 Pops an address sfrom
TOS.	 Pushes	 onto	 TOS	 a new address formed	 }
from the first increased: by the index,
44
addinstr Pops an integer from
	 TOS.	 Pops	 an integer
from TOS.	 Pushes their sum onto TOS.	 A
subinstr Pops an integer minuend	 from	 TOS.	 Pops	 an
integer
	
subtrahend from	 TOS.	 Pushes	 the
difference. onto TOS.
mulinstr
I^
Pops an integer from	 TOS,,	 Pops	 an	 integer
from TOS.	 Pushes their product onto TOS,
divinstr Pops an integer divisor 	 From	 TOS.	 Pops	 an	 }
integer dividend from TOS.	 Pushes the integer
quotient onto TOS.
modinatr Pops an integer divisor
	 from	 TOS.	 Pops	 an
integer dividend from TOS.
	
pushes the integer
remainder onto TOS.
eginstr Pops a value from TOS.
	 Pops a value From MS.
If	 their tags are different, pushes a boolean
false onto TOS, else if they	 are	 integer or
Boolean,	 pushes	 the	 boolean	 result	 of	 i
comparing them for	 equality	 onto	 TOS,	 else
raises exception.
.9
neinstr Pops a value from TOS.	 Pops a value from TQS.
If	 their tags are different, pushes a boolean
true onto TOS, else if
	 they	 are	 integer	 or	 ;
boolean,	 pushes	 the	 boolean.	 result	 of
comparing them for
 inequality onto	 TOS,	 else
raises exception.
ltinstr Pops an integer from	 TOS.
	 Pops	 an integer
from.	 TQS.	 Pushes onto TOS the Boolean result
	 1
of comparing them for ( the	 second	 less	 than
the first).
gtinstr Pops an integer fromTOS.	 Pops	 an	 integer
from	 TOS.	 Pushes onto TOS the boolean result
of comparing them for (the second greater than
the first) .
lei.nstr Pops an integer .from	 TOS.	 Pops
	 an integer
from	 TOS.	 Pushes onto TOS the boolean result
. of . comparing them for '(the second less than. or
equal to the first) .
45 E,
andinstr pops a booleetn from TOS. Pops a Boolean	 from
TOS.	 Pushes . onto	 TOS..theth  Boolean result of
comparing them for (bath true). i
orinstr Pops a Boolean from TOS. Pops a boolean	 from
y
TOS...	 Pushes onto	 TOS the Boolean . result of
comparing theta for (at least one true).
xorinstr Pops a boolean from TOS. Pops a Boolean	 from
TOS.	 Pushes onto	 TOS the. boolean result of
comparing therm for ( exactly one true) ,
notinstr Pops a. Boolean from TOS. Pushes	 its	 logical
_i
complement onto TOS.
J
end of ease
K.
1
s
f
i
Ii
I
746
Sample uses of the instructions:
----------------------------------- ----------------------------------------
BY USING THESE CODE SEQUENCES t THE COMPILER MAKES THE VP LOOK LIKE AN Ada
MACHINE. NOTE THAT THE SEQUENCES ARE GIVEN IN A SYMBOLIC PSEUDO-ASSEMBLY
CODE FoRm wHim THE COMPILER IS *NOT* TO SEND TO THE INTERPRETER.
A legend follows on this page.
--------- MM ---- ---------------	 ---------------- --------------
4y___  m ------
Ada source example
-------------
Assembly language translation where 	 Pertinent information
digits: is a label	 in template space i.e,
surrounds an indication that	 symbolic templates
other code should be expanded
at that location
0	 surrounds a reference to a label
t
1
accept el ( i) ( p ) ; f
i rixtstmt,
#----- -^----------- --
cleararms s1{ eval el (i) }
	 1 naccept el (i) rf
setarmaccept	 (<ts1 >,<2>)	 11<1>
select	 E 1+(nesting level, of unit
1: containing this accept)
return	 I (exceptiorhandlerlist
2: 1 contains all zeros)
jmp	 «>	 JAC(EPT
3: ((parameter maps depend on (p))
(nxtstmtl	 ]
------------------------{ accept el ( i) (p) do s
j	 stmtl;
	 3
end el
------------------ ------
cleararms 	 t sI
ieval, el (i)	 ["accept el (i)"
setarmaecept	 (Ctsl>,<2>)	 1<1>	 „`I
select	 1 1+(nestiIIB 	 tlevel of unit:
1: {stmtl}	 containing this accept)
return	 { (exeeptionhandlerl.ist
2: ;. depends on text in stmt's)
	
I -jmp	 <3>	 }ACCEPT
3: i(Parameter maps de pend on (p))
L
48
-w, ----w+-1r------------ — - - — - — — - — — - — - --w Rwow w—N---
3
select' .. tII
	 ^^I^II when b1 => accept a1(i) (p) do IF
- stmtl ; +1
end el;
stmt2;
or i
when bZ => accept e2(i)( p ) do
stmt3 ;
i end e2;
stmt4; 3
i or
when b3 _> delay x1; i. I
( stmt5;
- or
when b4 => delay x2;
or
when b5 => terminate; , -j else
stmt7;
end select; {
nxt stmt; i
clear arms t si
{eval b1} '"accept e1(i)"
	 f
brfalse <1> 1<1 2> 	{oval. e1'U)) . I U(nesting level of unit .
setarmaccept (<ts1 >j<13>) I	 containing this accept) 	 =
1; i (exceptionhandlerlist 	 x
{eval.. b2) { depends on text in stmt))
brfalse <2> {ACCEPT
{eval et(i)) {(parameter maps depend on (p))
setarmaccept (<ts3>,<15>)
2: t s3 E{eval; b3) I naccept el (i)"
	 hIbrfalse <3> 1 <10{eval. x1I 1 1 +(nesting l evel, of unit
setarmdel.ay <16> containing this. .accept)
3: 1,.	 ec.(sptionhandlerlisb
	
!{eval b4} depends on text in stmt2)	 lbrfalse «> - {ACCEPT	 -
{eval x.1) 1 parameter maps .depend on W)
setarmdel:ay <17>... ].
4:
{eval b5}
brfal se.. C5> s
setarmterminate <7>	
.`
L
`^
cra^-as+mo '^ ca 11 R1
^i
49
select
7: resetdependentlist
8: getdependent	 <1V
cheekdependent	 {$}
9: resetdependentlist
1 0: getdepende.nt,	 <6>
letdependentproceed <10>
11: iamterminable
imp	 <9}
12: {stmtll
return
13: {stmt2
imp	 <19? ,.
14: {stmt3l
return
15: {stmt4} j.	 r
Jmp	 <19? !	 ,
16: {stmt5^imp	
<19>
;..	 -
17: {stmt6l c
imp 	 <19>
18: {stmt7I #
imp	<1 9?
19: {nxtstmt}
-------__ -
	
r-----_-__,__	 --_
:
{
r
L
4
f^
50
delay
 x3j nxtstmt;4-----------
(eval x}
delay
{nxtistmt)
-------T--Tp^------_Y=
4--------it----
I t=e(i)(p);
€ nxt stmt,
^----- . -T-----
{ eval. t} .
{eval, e(i)}
{eval. P ..	 ..
entryeal INORMAL	 <some template accepting t. e()> !
(pop resulting.parameters pR from TOS)
{nxtstmtl
.1'---	 ----- - . - ---- i	 1
select
t.e(i)(p); !
stmtl;	 I
else
stmt2
end select,
	 s .
{ nxt.stmt ;
-------------------- I
{eval t} {{f
{eval pl .
entryeallCONATTTONAL
	 ({some template accepting t.e(}?,<15).
{pop resulting parameters pr from TOS}{stmtl l II
amp	 Q>
1: {strat2}
2, {nxtstmt}
{
i
'N RA .
51
select
	 {
{	 t. e(i) (p ); {{	 stMt1 ;
{ or	 {
{ delay x;
stmt2;	 {
j end select;	 {
{ nxtstmt;
4	 - . --	 -^--- -
-{eYal t} --
{eval e(i) }
{eval p}
{eval, %}
	
entrycall.TIMD	 (<some template accepting t.e() >,<1>)
{ppp resulting parameters p= from TOSI
{stmtl }
J rip	 <2>
1; {stmt2}
2: {nxtstmt}
.{------- -------------------------.-----------------------------------M -------{ {bloekt subprogram, or sequenee_of statements__of_a package body call.};
{ nxtstmt;
4 ^-- w--- ------------------m-------	 -^^-----------------------^--^--- --------  {eval p} -- if a subpr9gram
loadinteonst	 (template index)
call. ,
	
<1>
{code for the block goes here $
 if calling a block}
'I
52 '	 I
(unit prologue)	 I
-., entry point ',1
;- The next 3 instruet^.ons are repeated for each task declared)
r.e f
	 (task variable).
loadintconst	 {AP number)
createtask
	 (template index, master' p nesting-level, False)
store i
w.. The ftext 3 instructions are repeated for each. task allocated)
red` 	(access variable)
loadintconst
	
{APnumberj
createtask	 (template index, master ! s nesting_level, true) 1
store
{Whatever else counts as activation)
-- the foll.o•;Fing instruction ocours only in task bodies
nyactivationisdo'ne
-- here up.to
 the enableexoeptions instruction is optional
-- if the declarations declare/allocate no tasks
-- NOTE: a compiler is permitted to use more intelli gence here
--	 omitting sections which don' t apply if tasks are only
--	 declared or only allocated or neither in the unit,
resetdeclaredtasklist
2: getownedtask	 <3>
activatechild	 <2>
3: resetall.ocatedtasklist
4: getownedtask
	 <5>
activatechil.d	 <>
5: awaitactivationdoneforall
resetdeclaredtasklist
6: getownedtask	 0>
IetchildbegLn	 <6>
.7: resetallocatedtasklist
8: getownedtask	 <9}
ietchil.dbegin	 <8>
9: enableexceptions
-_
{set up initial values of variables.}
__-___--- --- ------__...^- ---------------	 -----------------------..^._-__-- ).
f
._	
A	 E
53
-....^^--------------.E.
{ t : = view tasktype, {{ nxtstmt;	 {
ref	 ft)
loadinteonst	 {AP number}
createtask	 (template index for ,tasktype t mast¢r r s 3nesting_levelytrue) 1store
resetallocatedtaskl ist
1: getownedtask	 .<2>
activatechil.d
	 <1?
2: awaitactivationdoneforall
resetallocatedtasklist
3: getownedtask	 <}
l.etehildbegi.n	 <3?
4: {nxtstmt
:F
t
.j r-------------------------------------------------- 1
k{ {dependent clearl-up code at Inulloodeaddressl} {-
.4--- 1
------- -----------------------------------------
-- all units (except accepts) junp here gather than return
1: resetdependentlist
2: getdependent	 «>
_	
f
oheakde.pendent
	 <2>
resetdependentlist
3 : getdependent	 <1> i
l.etdependentproeeed
	 <3>
4: resetdependentlist5: getdependent
	 <6>
rem•ovedependent<5?
-
6:
w==-----
return
--------------:..---_-_---- -- - _ - -_	 ----_°__-----_--____-___-
_
4
i
{
y
r.'
^ f
 l
r
` 4 --------------------------
{ -- hand coded example program	 ^ !;
{	 bask body ET is
	 { k;
{	 procedure main is
f	 ^	 _
k ^
bask a is	 {
_	 end a;
{	 bask b is	 { {
9	 entry e;	 {
end_ b;	 j
task body b is	 i {
{	 begun -- b
loop
{	 accept a do
send_conbrol(2,"B It ); 3
end e;
{	 end loop;
=	 end b;	 {
EI
bask body a is
begin __ a
loop . 1
b+e;
	
I
7
send aonbrol.(1, #A n ); -
E	
end loop;	 E
end a;	 { 3
begin _- vain
k
null;
{	 end main;	 ^ II
begin -- ET	 f
l
i{	 main; i
s .	 end ET;	 {
s
 
1
¢I
r^
.._ -------------
	
is 0
^- LAND-CODED	 IIETir
-- -------------/
	
1<0>	 !;
-1	 t all trod'	 -1 	 t	 h	 1
5:
-	 mis	 zed
resetdep
o
endentl.ist
ara go ng	 a imp	 ere
I{lots of zeros} a
.6: getdependent <8> iTASK
checkdependent <6>
resetdependentl.ist
7: getdependent <5>	 ts17:
letdependentproceed <7> j "main"
8: resetdependentlist 1 <17>
9: getdependent <10> 12 _.
removedependent <9> I{lots of zerosl
10: return (SUBPROGRAM
0: enableexceptions f{maps all false}
loadintconst <ts17>
calf, <5>	 ts15: L
-----------------------------------
^.
 it
11: myaetivationtsdone . { <1.5>
enableexceptions 13
1 2: eleararms Hlots of zeros}
loadinteons 1 [ TASK
se tarmaccept (<t s13>,< 1 4>) I 1
select i
_.,----------------------------- ts71: [ : I
13: 1.0adintcanst: <23> !( b it
effector (2, true) 1 <1 1>
return 3
----. -	
-------^ ----..--- j {^.Ot9 of .zeros}
1 4: Jmp <1 2y (TAS K
jmp <5> 1
-----------------------------------
l
1
1
i
15: myactivationisdone
	 ts1 3 : C	 1
enableexae tionsp	 1 nacGept en
1b: 1<13>	 ^1
ref	 (2,3,vara) -- {task variable b) , 
load	 '{lots of zeros}
Zoadinteonst.	 1	 SACCEPT 
entrycallNORMAL	 <ts13>	 Hmaps all false)
loadintconst	 <2213-	 I
effectors (1,true)
imp <16>
imp <5>
--
1`T: refw- 
---^ --------------,.____------_-_---
(2,2,vara) -- {task variable a}
loadintoonst 1
ereatetask (2,<ts11>,false)
store
ref (2,3,vara) -- {task variable b)
loadintconst 2
Createtask (2,<ts13>, fal,se) 	 is
store
resetdeelaredtasklist
'k 18: getownedtask
	 <19>
activatechild	 <18>
1 9: awaitactivationdoneforal.l s
t resetdeclaredtasklist
20: getownedtask	 <21>
letahildbegin	 <20> -,
21: enableexceptions `33
1 imp	 <5>
----------------------	 --------------------
s
.
22: dW	 °A.	 n
23. dw	 -	 "B	 it
--------------------------------- ---------- 	 .
---=-=-w-	 _.___-----__ ------- 	 ...____-_---_-...-------------==w--===----__
_
1
57
--------------------------- --------
_	 -- hand. coded example program
task body ET is	 } 1
f;
{
procedure main is	 {
.ji^
{ begin ^- main
send control(1,
{ end main;
{ begin -r ET
main; 1.
end ET;
	 4
1
-------------------------- - --- 
-----w-- -T
w- f ^
	 - -------- is o
-- a HAND-CODED 1 "E Tn {
-- ------a------ / {<0>
-- almost: all:. bodies are going to jmp here { 1
5t resetdependentlist ;{lots of zeros}
6; getdependent 	 <8> ;TASK
eheekdependent	 <b> = 1
rese tdepende ntl ist ]
q: getdependent	 <5> tsi1:	 [	 '
l.etdependentproceed
	
<7> !"Main"	 j
8: resetdependentl.ist { <11>
9: getdependent	 <10> 12
removedependent
	
<9>
_ { Lots of zeros}
1 Q, return {SUBPROGRAM
O: enableexeeptions .1 {maps all false}
loadinteonst	 <ts11 > ]
call.	 <5> I,
-----w-r-----IU-------------------------------
11: eI']ableexeeptions :
77
l,oadintconst
	 <12>
effector	 (1 true)
return
-^ It is not required that a subprogram. do Iu
<,5>w_	 amp {
^- if it *cannot* have dependents,
r.
-------------^----------------- 	 {:
2: a-------------
---------------------------------------------
 ---_- --
I!
58
APPENDIX 2
FAULT TOLERANT DISTRIBUTED SYSTEMS USING Ada
John C. Knight	 John x. A. Urquhart
Department of Applied Mathematics and Computer Science
University of Virginia
Charlottesville
Virginia, 2290'
10butract
This paper discusses the use of Ada on distributed systems in which
failure of processors has to be tolerated. We assume that communication
between tasks on separate processors will take place using the
facilities of`-the Ada language,. primarily the rendezvous. We show that
there are numerous aspects of the language which make its use on a
distributed system very difficult. The issues are raised from the
desire to be able to recover, reconfigure, and provide continued service
in the presence of hardware failure. For example, if .a rendezvous takes.
place between two tasks on different processors, failure of the
processor executing the serving task will cause the calling task to be
permanently suspended because the rendezvous will never end. Extensive.
modifieations to the execution support required for Ada are proposed
wh:Lc—h provide all the necessary facilities for programs written in Ada
to withstand arbitrary processor failure. Mechanisms are suggested to
allow processor failure to be detected and for tasks which would be
permanently suspended to be released. Provided the required program
ntrsinri yrac ara ncati _ nnni-i niiaA n"nneacai ncr non hw n.sn.^^,3^..7
A
9
59
Zntroduction
r
Over the next decade, it is expected, that many 	 aerospace	 systems
E:
will	 use Ada as the primary implementation language.	 This .is a logical
choice because the language has
	 been 	 designed	 for	 embedded	 systems.
Also,	 Ada has received such great care in its design and implementation
that it is unlikely that there wi3.l
	
be	 any	 practical	 alternative
	 in !
selecting a' programming language for embedded software,
The reduced cost of computers hardware and the 	 expected	 advantages
f..
of	 distributed	 processing
	 (for example,, increased reliability through
redundancy	 and	 greater . flexibility).	indicate	 that	 many.	 aerospace
computer	 systems	 will	 be distributed.	 The use of Ada and distributed
-	 3
systems. seems. like a good combination for advanced. aerospace embedded
systems. ,i
in this paper we discuss the possibility that a distributed
	 system
may	 be	 programmed	 entirely in Ada so that the individual tasks of the
system are unconcerned with which processor they are executing
	 on.
	 we
assume that communication between tasks on separate processors will take
place	 using	 the	 facilities	 of	 the	 Ada	 language,	 primarily	 the
rendezvous.	 It would be possible to build a separate set of facilities ^.
t
for communication between processors and	 treat	 the	 software
	 on	 each
machine
	 as	 a	 separate	 program.	 This is pointless however since such
facilities Would necessarily duplicate the existing 	 facilities
	 of	 the
r endezvous.
f	 .
60
following such failures. It is shoy;n that this causes considerable
difficulties for Ada programs, Solutions . to these problems are
suggested.
The .Heed -T.4 g= Njth . Hardware Fa^.iure
I	 -	 yf
}
I
^^	 I
^	
t
C_
The kind of architecture we expect to be in common use for embedded
	
-	 i
systems in the future in shown in figure 1. It is based o n the use of a
high-performance data bus which links several, processors. Each
processor is equipped with its own memory. Devices such as displays,
sensors, and actuators would be connected to the bus via dedicated
microprocessors. Thus these devices would be accessible from each
processors
.i
J
r
Memory	 Memory	 Sensor	 Actuator
	
Micro	 Micro	 3
Processor
	
Processor
	
processor	 processor
Communications Network
Figure 1 Distributed Architecture
a
,1
61
A great deal of research has been undertaken in recent 	 years	 to {
produce	 computer	 architeetures with high reliability such as the SIFT1
and F'TMP2 machines.	 Why then should there be any concern 	 for	 software f	 'P
structures	 which	 are	 able	 to	 cope with hardware failure?	 There are
several reasons:
(1) The architectures of highly_ reliable systems are very complex. Such
machines	 are, in effect, highly-parallel multiprocessors and their
reliability is achieved by parallelism.	 These	 architectures	 are
F
the subject of current experimentation and are. stiill. unproven.
(2) Even though designed for	 reliability,	 these	 machines	 may	 still
fail.
(3) Physical, damage could cause a	 processor	 to	 fail	 no	 matter	 how
carefully	 the	 processor	 was	 built.	 Fire,	 structural failure,
excess or unexpected vibration,	 and	 so	 on,	 could	 cause	 enough
damage
	
that	 even a	 highly-parallel	 machine	 would be unable to
continue. Zt i
(4) Electrical damage from unexpected lightning effects could	 cause	 a
processor to fail. i1	 _
(5) In a situation where a major power failure occurred, reserve	 power
might	 only	 be	 provided	 for	 some subset of the processors.	 The
switch from full power to limited reserve power might be orderly in
which	 case	 very	 sophisticated reconfiguration might be possible. 1
However, it
	
might	 be	 preferable	 to	 use	 a	 single,	 consistent
mechanism for recovery to cope with all cases.
R,
62
( 6) Unmanned spacecraft frequently make extensive use of computers but
are usually unable to pay the weight and power coats of extensive
redundancy (such as in quad redundancy) .	 Reconf igurable
distributed systems designed to cope with processor failure is an
attractive alternative. If the design includes higher processing
power than is absolutely needed, and tasks exist which are not
essential to mission success, then some loss . of hardware followed
by reconfiguration may allow the mission to continue successfully.
Thus, although great care may be taken with the construction of a
digital	 computer	 system,	 fai.l_ure	 may	 still.	 occurs	 At least with a
distributed system there is the possibility that if part of 	 the	 system
is lost, what remains could continue to provide service.
Initially, we assume that communication 	 between	 processors	 on	 a
}
distributed
	
system	 will	 be	 implemented using layers of software that , s
conform for the most part to	 the	 ISO	 standard	 seven-layer	 Reference {
Mode13 .	 The
	
hardware . topology	 that is used for a distributed system
need have very little impact on the programming of	 the	 system	 at	 the
application-layer	 level.	 In	 principle,. provided	 the implementation f
knows how tasks are distributed to processors and how
	
communication is
to	 be. achieved,.	 the	 various tasks can synchronize and communicate at
will. with no knowledge of their location. lY
The kinds of hardware failure that we are concerned 	 with	 are	 not
addressed	 by	 the	 ISO	 protocol.	 The
	
ISO protocol is concerned with J
a
communications failures such at dropped bits caused by 	 noise,	 loss	 of
messages	 or	 parts.	 of	 messages,	 etc.	 Also,	 situations	 such	 as a
f
i^r^
-R	 -	 -=G-
krArt
-
...
^I
63
a
pr ocessor
 "slowing down" or incorrectly eomputi.ng	 results 	 are	 not	 of
interest
	 here (though they are important nevertheless). 	 We assume that {'
-
a
such events are taken eare of by hardware checking within the processor,.
The oray class of faults not dealt with elsewhere is the total lass of a
processor or bus with no warning.	 These are the	 difficulties	 we	 will
attempt to deal with.
A(19 Issgps Ind -Di.rXivalties
In this section, some of the difficulties w;Lth- the use of Ada on
	 a
distributed	 system are described. We examine. only the simple rendezvous
}
i
and the timed entry call.. 	 Lack of spade precludes	 examination
	 of	 the
entire language but this as been done elsewhere4 .	 .Proposed solutions to
some of the problems raised here are given in a later section. !	 :"	 ;',i
Siron'g Bende7voua
A simple rendezvous in Ada consists of a calling task C	 making	 an
entry call, S.E. to a serving task S. which. contains an accept statement. 1
for the entry E.
	
The syntax is shown in figure 2.
	 The semantics of the {€i
language	 require	 that	 if	 the	 call is made by C before the accept is
reached by S, C is suspended until the accept is reached. If 	 S	 reaches
the.	accept	 before the . cal.1 is made : by C, S is suspended until ,the: cal:]:
is made.	 In either case,	 C	 remains	 suspended until	 the	 rendezvous
itself is complete. 1
In order to look at the issues arising from a rendezvous
	 in which
the	 tasks	 involved	 are	 on	 different	 processors, it is necessary to
VV
64
Calling Task C	 Serving Task S
ACCEPT E DO
S. E;
END E;
Figure 2 Syntax Of A Simple Rendezvous.
specify an implementation of the rendezvous at the message passing
level. : Only the simple case of a task C calling a2L entry* E in a serving
task S will be considered. Further, It will be assumed that the call is
made before	 S has reached the corresponding accept; the case where the
server waits at its accept is similar.	 One possible message sequence
i
is
shorn in	 figure	 3. The	 numbers inside braekets represent points of
interest in the message sequence.
t
Caller C Messages	 .Server S
s. [41
PUT ONTO, QUEUE------>
Ell G5
ACCEPT E DO
< -------- CRECF,._CALLER . ..
[21 [61 i
CHECK-CALLEf_REPLY--
- -	 [31 171
END. E;
<-HENDEZ YOUS_COMPLETE
Figure 3 •. Rendezvous Message. Sequence
65
The calling task C asks to be. put onto the queue for entry E. When
S reaches its accept for E, it sees that C is on the queue, At this
point S cheeks to see if C has been aborted. When the CHHCLC,ALLER
message arrives at C r C can be considered to be engaged in the
rendezvous, When the reply reaches S, S will start to execute the
rendezvous code. When it is completed the RENDEZVOU,3,00MPLETE p message
would awaken C which would continue.
Effects ,Qf Proeessor, Failure ,Q,n Rende%yQgs
Using this implementation of a, simple rendezvous, what happens if
either processor fails? . There are seven
.
 .cases. of interest and they are
discussed below. The numbers refer to
CALLER.
	
EFFECT ON SERVER
F'AlL 5 AT
s"
_-	 a
_	
R
66
131
	
When the caller s s processor fail s during the rendezvous t the
situation is similar to the ease where the caller is aborted	 h
during the rendezvous. Xn both eases the server can continue.
At the end of the rendezvous the RENDEZVOUS—COMPLETE message
cannot arrive; as before, if the server can detect that there
has been a failure, the server can continue.
SERVER EFFECT ON CALLER
FAILS AT j	
,i
[41 The message PUT OUTCLQUEUE cannot	 arrive.	 The	 situation	 is
similar to the case where the server is abnormal,,, i
a
[53 Here the caller 	 is	 on an	 entry	 queue	 when	 the	 serverTS
processor	 fails.	 As before if the failure cannot be detected
the caller will be trapped.	
- `-
t
[61 The message CHECYL-CALLER has	 arrived at the	 caller who	 now i
k
considers	 that	 the	 rendezvous has started; the reply cannot
arrive. Again without failure de tection	 the	 caller	 will	 be l
trapped.	 (Note	 that	 even if the caller were using a timed ^	 I
entry call, the timer	 would	 have	 been turned
	 off	 by	 the I
message. CHECK._CALLER.)
171 The server T s processor fails during the rendezvous. (Timed and
conditional
	
entry	 calls	 give no protection as they time the
delay to the start of the rendezvous.) 	 Again	 the	 caller
	 is.
1
trapped unless the failure can be detected.
_ 1
i
1
.	
-- - 
"r-	 _
57	
r^l
The serving task is not seriously affected when the calling task ► s
processor fails, At worst, time is lost doing work for a task that is
not there to receive it. The calling task is in a much worse situation
when the servers processor fails. If the rendezvous has not.already
started the caller will wait on the entry queue for ever. In principle,
tamed entry calls (discussed below) can handle this situation, if they
are implemented by having the calling task do the tuning. If the serving i=
task's processor fails after the rendezvous.has started, even a caller
i
who has made a timed or conditional entry call will be trapped for ever. 	 I
What the caller would like to have, and what even timed and
conditional entry calls do not give, is a getg=tee that after a certain
time it will be possible to proceed. The rules of the language imply
that once a rendezvous has started the caller cannot withdraw until it
is completed. Clearly withdrawal, is necessary when the servertz
	
i
W& W__.
i
f
68
task on one processor to a task on another. Even if the underlying
message passing system can guarantee that a message will eventually
arrive correctly, this will be implemented at a lower level by a
protocol which may involve acknowledgement of messages, and the
resending of messages that have been Lost. A message can certainly be i
delayed for some arbitrary length of tame. even physical separation of
the processors may impose a significant delay.
One possible interpretation of the timed entry call would be to
l
'
	
	 count the total time: until the rendezvous is started. Message passing
time and time on the entry queue would be included. This interpretation
probably has to be ruled out because the language definition states that 	 j
a timed entry call with a delay of zero is the same as a conditional
y. •.: -	 i
entry call. if the delay included both message passing time and. time on
the queue, a delay of zero would be impossible and a timed entry call
with a delay of zero would never` succeed. h	 ^
Another interpretation of the delay in a timed. entry call is that
it is just the delay on the entry queue. This has a meaning when the 	 t
specified delay is zero but the important implementation. question 	 - f
becomes "who- is to do the timing". The calling task, cannot do the
timing. it is impossible for it to.measure waiting time on the entry 	 ii
queue accur ately since the message passing time can. vary. Thus it is
essential.. that the serving task does the timing.
A timed entry call gives protection against having to wait too long
on the entry queue. Hotrever, What the task issuing the call needs is
some guarantee that it will not be trapped in an attempt to communicate,
6 9
t
and	 forced to miss a deadline,	 In principle, it does not matter to the
task whether the time is spent waiting on a queue or sending messages,
If the timed entry call is implemented by having the server do
	 the
timing	 and	 the server's processor fails before a rendezvous is started
the caller will be trapped. 	 Even if there is	 no	 failure	 the	 calling
task must	 wait	 for a message from the server.	 If the server is doing ^	 '
the timing, that message may need to be ze e-sent several times, 	 so	 the
calling task may have to wait an arbitrary time..
If the calling task were able to do the	 timing	 then	 an	 infinite
wait	 could	 be. .avoided . 	when the server's processor failed. As we have.
noted however, this method of timing is unrealistic when queue	 time	 is 3
being. measured. P
For a distributed system, we conclude that there are many	 problems
with	 the	 timed entry call.	 It does not provide the kind of protection
z
that is desirable, the semantics are unclear, and it is	 very	 difficult
to	 implement.	 An	 analysis	 of	 the message traffic necessary for the
timed entry call can be performed that 	 is	 similar	 to	 that	 shown	 in
figure 2.	 The issues which arise when considering failure are similar (I
but more ox^;ensive than those arising in the simple rendezvous.
^II
Other  lasues
When the possibility of processor failure is considered, many other
aspects	 of	 Ada	 present	 di:ffichlties similar to `those outlined above. {{
For 'example, 'problems arise	 with	 conditional	 entry	 calls,	 accessing
global;	 variables, task elaboration, and task termination. 	 The problems 1
}
70	 fi
are of a similar cause; namely the prospect that no reply w.^11 ever be
received to a message sent because of the failure of the processor
?i
expected to generate the reply.
Other areas which cause difficulty are 	 global
	 variables
	 and	 the
task master/dependent	 relationship.	 If
	
data which is global to more
than one task is	 ever	 used	 to	 share	 data	 lass	 of	 the	 processor^	 .
w	 {
containing	 the	 data	 causes	 great	 difficulty.	 Final.l_y., 	 when	 the
processor executing a master task is lost, the dependents of the
	 master .;	 a
should	 be	 aborted	 since loss of the processor executing the master is
equivalent	 to	 the	 master	 being	 aborted.	 This	 raises	 special.
difficulties when the master is the main program.
^'af7 ur-A. Petggtjon ^C Signal il]g
Processor failure cannot be dealt with unless it can
	 be	 detected.
Details of the failure must also be supplied to the software which is to
-	 i
cope with the reconfiguration. 	 Hoyt can	 Ada	 programs	 detect
	 hardware
i
failure	 and	 what	 information	 is needed for reconfiguration? 	 In this f
section, we present an approach to hardware failure	 detection
	 and	 the
rational for its choice. [
,Egilur2 Detectio
Failure detection could be performed by	 hardware facilities	 over
And. above	 those :
	provided for. normal: system ape-ration,
	 Alternatively, 1j
failure could be detected by system software. 	 The	 hardware , option	 is
less	 desirable	 because	 it	 requires	 additions to existing or planned`
^^^;,
K
: a
^	
r
systems and the detection hardware itself could fail. We suggest
therefore the use of software failure detection.
Software failure detection can be either passive or active. A
passive system might rely. on tasks assuming that failure had occurred if
actions did not take place within a "reasonable"	 period	 of	 time	 i.e. f'
timing	 nut.	 . Alternatively.. a	 passive	 system could require that all
messages passed	 between	 tasks	 on	 separate	 processors	 be	 routinely
acknowledged.
	
Thus	 the sender would be sure that the receiver had the
r1message and presumably would :ad, on it. 	 This is a . particularly	 simple 1
ease of timing out since failure has to be assumed if no acknowledgement
is received.
The disadvantages of passive detection are: 3
(1)	 Timing out assumes an agreed-upon upper limit for response time. .,
F
(2)	 A failed processor will not	 be	 detected	 until	 communication	 is
attempted and this may be long: after the failure has occurred.
Upper bounds on response time 	 may	 be	 hard	 to	 determine.	 Very
complex situations can arise from an incorrect choice.	 The reason for a =	 13
lack of response from a task on another processor may not be failure 	 of
that	 processor	 but	 merely	 a	 temporary	 rise	 in	 its workload.	 The
consequences could be an assumption by one processor 	 that	 another	 had ^
failed,	 followed by reconfiguration to cope with the Joss. 	 Clearly, if
this assumption is wrong, two processors could begin trying	 to	 provide
the -, me service. I
^t
G
71
f,
f^
I
r
k
f
1-	
77
1
72
Being unaware that . a processor had failed will lead to a loss of
the service it was providing until the failure is noticed. In a system
with many processors each providing relatively few services, the amount
of inter--processor communication might be quite low. Thus, a failed
processor may go unnoticed for so long a time that damage to the
equipment being controlled might result from its lack of service.
It is for these reasons that we reject passive software failure
detection and suggest the use of active software failure detection. In
an . active system, some kind of inter-processor activity is required
"periodically" and if it ceases, failure is assumed. The messages which
are passed are usually referred to as heartbeats.
As soon as a heartbeat disappears, the remaining processors in the
system will be aware that a failure has occurred and they will know
which processor has failed.. This information must be transmitted to. the
software running on each remaining processor so that reconfiguration can
take place. The information is available to the run=time support.
software in some internal format, but how should it be transmitted to
73
r,r,
reconfiguration could be present on each processor and suspended at an
accept statement for the entry which will be called when a failure
	 AI
occurs. This allows each processor to have a "focal point" for
reconfiguration.	 If exceptions are used, the correct placement
	 of	 the
necessary
	
handler
	
is	 difficult	 to	 determine	 because	 it	 will	 be
impossible. to know what tasks will be engaged in 	 what	 activities	 when
the exception is generated.
Thus we propose that a special task be defined	 on	 each	 processor
which	 will	 contain	 entries	 for each hardware component whose failure
requires some processing.	 This task will be normally suspended	 on the
accept	 statements	 for	 the special, entries.	 When a failure occurs, an
entry call will be generated and the task will 	 then	 be	 activated.	 It :i
will	 contain	 code	 following	 each	 accept	 statement	 to	 handle
r eco of igur ati on.
3
It is not sufficient to detect failure and inform the	 software	 of
the failure using the methods described above. 	 As discussed above, . many
Ada	 language	 elements	 (particularly	 the	 rendezvous)	 can lead	 to }
situations	 in which one task is permanently suspended if the processor
on which another task is executing fails.	 These tasks	 which
	
would	 be
permanently suspended must somehow be released.
The mechanise which we propose ' to cope with this situation is shown
in	 figure	 4	 Whenever any earn- uni.cation takes place between tasks on
different processors, the rung-time support systems 	 on	 the	 processors 1
1
involved	 record	 the	 details	 of	 the	 communication in message logs.
Whenever a failure is detected, each processor checks its message log to
_
74
see if any of its tasks would be permanently suspended by the failure.	 i
If any are found, they are sent "fake ff messages. They are called fake	 i
because they are constructed to appear to come from the failed processor
but clearly do not. The message content is usually equivalent to that
!I
which would be received if the task on the failed processor had been
aborted. Thus for example, if a sample rendezvous is taking place and	
_i
the processor executing the server fails, the exception TASKING ERRQR is 	
G
raised in the caller. In this way, each processor is able to ensure 	
{
.i
that none of its tasks is permanently suspended. However, it is the	
f
actions are appropriate.
Mauu t Tq er. nt Br_gZramminjz
This is a very simple example designed to illustrate some of the
ideas discussed above. In a typical Ada application, the program would.
be much larger and would have to take into account all the language
Features mentioned.
The example consists of a calling task CALLER which operates on one
processor
	
CPU	 '^	 and a serving task	 R which operates on another
 ( CPU 2) .	 The calling task . dries same ..real-time processing	 and
calls an entry in the serving task in order to gyt some kind of service.
3
The progr< m is	 written	 to	 cope	 with	 failure	 of	 either	 processor.
Alternates	 are	 provided	 for	 the	 calling and the serving tasks and a
reconfigurat ion task is present on each processor.
;'	 1
a
Normally only the calling and serving tasks 	 are	 executing.
	and
	 a
fault--intolerant version of this example would consist of just these two
tasks '	If processor	 one	 fails	 then	 it . ..is.	 necessary . to ...start.	 an. !
alternate
	
calltag	 task	 on processor two.	 Similarly, if processor two
fails it is necessary to start an alternate .serving	 task	 on. processor
one.
The alternates are	 present	 on	 the	 required	 machines	 when	 the
program	 starts
	
execution.	 Each alternate is waiting on an entry named
ABNORMALwSTART so that the	 do noy	 processing while both	 processors	 are
operational.
	 'When one processor fails, the run-time system generates .an
76
entry call on the other prooessor to an entry in its task
RBCONP1Gt1RF,.CPU: i (where i is the processor number). This task then
calls the ABNORMAL.—START entry for the alternate which is needed and
processing is able to continue. Entries are defined in
nca. "TGURFj„CPU i far e
 each component that might Fail. In this example,
each machine is only interested in the failure of the other so only one
entry is defined in each reconfigLiration task.
If a rendezvous. is in progress when the failure occurs, then the
serving task weed not care that the eal.l.ing task has been last, and the
i
rendezvous can complete. The calling task will care if the serving task
f	 i
s
	
	 has been lost because this will indefinitely suspend the caller. Thus 	
r
TASKING ERHOR is ;ralsed by the run-time system in the calling task.
This frees the calling task and allows it to .prepare itself to use the
alternate server.
v
^-	 Code Resident on CPU 1. :.i
task CALLER is
entry TICK; 3
-- TICK is an entry that is called
-- periodically somehow and keeps the z
-- program synchronized in real time
end CALLER;
task body CALLER is
type STATE	 is (NORMAL, ABNORMAL);
SYS TEK _STATE	 STATE : - NORMAL;
begin
loop
accept TICK* i
begin
case SYSTEt STATE is
when NORMAL ->
-- normal pre-rendezvous
processingT-SERVER. ;^I
-- normal post-rendezvous
.,.. processing f
when ABNORMAL =>
-- abnormal. pre-rendezvous
- processing
ALTERNATE_SERVER. E;
^-- abnormal post-rendezvous
-- processing
f
end case;
exception
when TASKING ERROR=>
SYS TEILSTAT.E := . ABNORMAL;
loop
-- failure has occurred
-^- since reconfiguration may _	 t
-- take time, "outputs {^
-- default values to keep
-- physical devices happy
OUTPUZ-DEFAULTS;
select
-- rendezvous with, the
G- reconfiguration task
-'- to get data this `task -
-- needs to operate
RE CONFIGURE—CPIL-l DATA( ...)
exit;
or
.delay DELTA;
end select;
end loop..
when others =>
d
P,
^^	 f
r+
76
-- handle other exceptions
end;
end loop;
end CALLER;
task ALTERNATFL-SERVER is
entry ABNOHMAL—STAHT( ...
entry E;
end ALTERNATE SERVER;
task body ALTERNATE—SERVER is
begin
accept ABNORMAL:-START(...
loop
-- pre-rendezvous processing
accept R;
port-rendezvous processing
end loop;
end ALTERNATE-SERVER;
task RECONFIGTJIiFL_CPU_l is
entry CPU—2-RAJ;
entry DATA( ...
end RECONFI:GU.RFL_CPU_I;
task body IIECONFIGURE
—
CPU_l is
begin
run-time system calls the following	 i
entry autrmatically when a failure
of CPU 2 is detected
accept CPU_2_F-k3L do
this call will start the alternate
server on CPU 1 - the pa•a-Meters
will. contain the data task rzeeds
AL TERNATIq_SERVER. ABNORMAL_START-(,-
end accept;
accept DATA(...) do
prepare data for CALLER task.
when operating in the ABNORMAL
system state
end accept;
end RECONFIGTJRF,,._CPU_l
-^	 Goda Resident On CPU 2.
task ALTERNATE—CALLER is
entry ABNORMAL START( • • ) ;
end ALTERNATE_,CALLBR;
task body ALTERNATFCALLER IS
begin
-- initialization code
accept A$NORMAL_START(• r .) ;
loop
accept TICK;
--- alternate processing
end loop;
end ALTERNATE; CALLER;
task SERVER is
entry E;
end SERVER;
task body SERVER is
begin
loop
-- pre-rendezvous processing
accept E;
--- post-rendezvous processing
end Loop;.
end SERVER;
^y
i
E
x
79
distributed system. it can be argued that this ;.s not part of the
janguage's responsibility and that Ada should. not address the i *sue.
However, a distributed system that cannot cope with processor failure is
no better Than a uni-opocessor system. Thus, we feel that the issue has
.U;
80
,^
81
82
APPENDIX 3
Programming Language Requirements For Distributed Real-Time Systems
Which Tolerate Processor Failure
J. C. Kni gh t
In this paper, we discuss. the programming of	 distributed	 systems	 that
execute	 applications in which it is essential that
.
 continued service be
provided after failure of some subset of the
	
system's	 processors.
	 w e
assume	 that
	
the applications operate in real-time..
	
The programming of
such systems has usually been done on an ad hoc ba-sis.	 Many	 different j
languages	 have	 been used; most were low-level providing few facilities
for task communication, scheduling or reconfiguration. 	 Ife suggest	 that
it	 is essential that the programmer have control over the actions . which..
occur following a failure. 	 This-	 1mpl	 here must	 be	 ies that t	 facilities
in	 the	 programming	 language	 to	 allow	 the programmer to specify his
needs.	 In this paper we -discuss 	 deficiencies	 in	 existing	 language ,5,
such as Adat p and propose a set of requirements for language .p to be used
for programming crucial real-time, applications on distributed systemso
+Ada is a trademark of the U. S. Department of Defense.
t	 Ast
83
i• TCM-rcmrarr
One of the advantages of distributed processing is that a hardware
1
failure need not.remove all the computing facilities. If one processor
fails, it is possible (at least in principle) for the others to continue
to provide service. This fault-tolerant characteristic is very
desirable for applications requiring high reliability. The use of
distributed processing is 2iv-ther encouraged by the decreasing cost of
computer hardware.
In this paper, we discuss distributed systems that execute orueial
applications. By this we mean appl.i.cati ins for which it is essential
that continued service to provided after a failure. In general.,
responding to a failure by stopping the system and replacing the faulty
component will not be acceptable.
_	 a
A distributed system that is to be highly reliable will be built
t
with a redundant bus structure. &.;dundancy usually includes replicating
the bus along different routes as we Z as replication of the bus	 a
hardware itself on a particular route. Loss of a complete b- need be
of little consequence if it is replicated and can be coped wit}. by the
low-level communications software. A complete break in the bus system
that isolates some subset of the processors (i.e. the network becomes
	 1
partitioned) is much more serious though very unl1kely given multiple
routes and replication. The issues that arise in that ease are
different from those arising from processor failure and. will .
 not be
	
9
}}
idealt with here. We consider only processor failures.	 f
1
84
We assume that the applications operate in real-time:. Thus a
program may have to meet external deadlines, and success or failure of a
program may depend on processor speeds and scheduling algorithms.
The programming of the kinds of systems we. describe has usually
been done on are ad hoc basis. Many different languages have been used;
most were low-level providing few facilities for task communication
scheduling or reconfiguration. This was one of the situations that the
Department of Defense sought to improve by the introduction of Ada Ell.
Although Ada was carefully designed over several years with input from
6	 I ,
Z. DISTRxsuIED =6^
The kind of architecture we expect to be in common use for real.`
jr
7
time control systems in the future in shown in figure 1 . It is based on
the	 use	 of	 a high-performance data bus that links several processors.
Each processor is. equipped with 	 its	 own
	
memory.
	 Devices	 such	 as
displays,	 sensors, and actuators are connected to the bus via dedicated #
microprocessors.	 Thus these.	 devices	 would	 be	 accessible - from	 each
processor.
	
An	 example	 is	 a	 digital	 avionics system for a military
aircraft.
	
In these systemst separate computers maybe used	 for	 flight
control,	 navigation,	 displays,	 weapons	 managements	 and	 so on.	 The
overall system requires some coordination and so the 	 various	 computers
communicate via a data bus. A typical system is described in [4]. :-'`.
Memory Memory Sensor Actuator
Micro	 Mier o
t
{
Processor	 Processor	 processor	 processor
I
Communications Network
Figure '1 `: Distributod Architecture
1}
i
f
86 i
Much research has been undertaken in recent years to produce
computer architectures of great reliability. Where are, however,
several reasons for employing software structures able to cope with
partial hardware failure. Even though designed for reliability, any
processor may still fail.. Also, lightning, fire or physical damage
could cause a processor to fail no matter how carefully the processor
was built. At least with a distributed system there is the possibility
that if part of the system was lost, ghat remained could continue to
Provide service.
A processor will. be assumed to fail,. by stopping and remaining
stopped. All data in the local memory of the processor will be assumed
lost. Thus the case of a processor failing by continuing to process
instructions in an incorrect manner and providing possibly incorrect
data to other. processors will. not be considered. We assume that such
events are taken care of by hardware checking within the processor. An
alternative method using the Byzantine Generals algorithm is suggested
by Schlichting and Schneider [5].
While this may seem a severe restriction, at least three arguments
can be made in its favor:
(1) Faults of the assumed kind rust. be
 taken into consideration anyway
since a processor might fail in this way.
( ^, E h
	
h dw
	
h	 iit er by Q4, are c coking w thin a single processor or by
checking between a dual _Pair of processors, it is possible for an
a
underlying system to simulate . the assumed processor failure mode..
3
;-i
k
i
i
f
I
^
1
l
87
If such a failure mode is not assuned, error recovery becomes
extremely difficult. It becomesssible for 	 4po	 processor to fail, 	 r1
and for the resulting errors to remain undetected until all data is
compromised.
	
Given this assumption, error detection reduces to deteeti.ng that a	 E` .
processor has stopped. Error recovery is simplified by the knowledge
that although data in the failed processor's memory is lost, data on the
remaining processors is correct. 	 ±'	 f
i.
f
I. AURQACHHS M FAMLT 	 {
	
If a distributed system is to provide continued service after one	 {
or more processor failures, then facilities must be provided over and
'.	 4
above those needed for normal. service. We will refer to these as
facilities. If there is a single continuation facility for-
	
the entire system then the system is centralized. If the ' processor 	 E
providing the continuation facility fails, the system stops and this .is
unacceptable. To prevent this, continuation facilities must exist on
all the processors.
r
1
However, difficulties can still.: . arise if, following the Loss of a
processor, a single continuation facility is chosen to perform fault
tolerance for t1te entire system. .. For example, since the processor
	
performing the fault tolerance may fail at arm point, all other	 (	 A
$8	 r'
recovery so that they can take over if necessary. This is unacceptable
and, in What follows, it will be assumed that each processor will have a.
continuation facility which independently assesses damage and effects
whatever local changes are necessary for recovery in that ,processor,
On each processor remaining after a failure, the continuation
facility must take the following actions:
.. Detect Failure
Some mechanism must detect processor failure and communicate its
occurrence to the other parts of the continuation facility.
Assess Damage
Information must be	 provided	 so	 that	 a	 sensible	 choice	 of	 a
response	 can	 be	 made,	 Certainly it must be known what processes
were executing on the	 failed	 processor	 and	 what	 processes	 and
s
processors	 remain. Further, in many applications the response will
depend on other variables and these would also have - to be	 known.
The height of a aircraft, for example, might determine what actions
should be taken when part of the avionics system is lost.
Processes executing on processors which	 survive	 the	 failure	 may
still be affected by the failure. 	 For example, their execution may
depend on processes or contexts that 	 were	 lost	 with	 the	 failed
processor.	 If	 anything	 is to be done about such processes, they
must be known and there must be	 some	 way	 of	 communicating with }
them. t
,l
._	
_—
--IND I
or
^i
89	 r^
Choose a response
After a. failure is reported to the software on a particular
processor, the local continuation facility 'mill Independently
decide on a response and put into effect azw changes required on
that processor. The choice of a response depends on a
reconfiguration strategy and on information available. 	 The
information used might include both system information such as
which processors are left and external information such . as. Fuel
level for example.
It is important that 	 the	 information which
	
the	 reconfiguration
strategy	 uses be consistent aero,9s processors, since if it is not,
i
the continuation facilities on different processors could decade on a
different. responses and thus work at cross-purposes.
Effect the Response
Once a response has been decided on, it must be possible
	 to	 carry'I^
it	 out.	 The
	
continuation	 facilities	 should.	 be	 able	 to abort
processes, be able
	 to	 communicate	 with	 processes	 so	 that	 the
processes	 can take appropriate action on their own, and be able to i
start new processes. In many cases the new processes will	 have	 to
be provided with data, and a consistent set of such data would have
to be available to the continuation facilities.
Various , diffi'culties are raised by	 the	 need	 for	 these	 actions.
Firstly,	 two	 of	 the	 actions	 depend	 strongly on consistent data and
without mWdng quite unrealistic assumptions 	 about the underlying
message
	 gassing	 system,	 it	 cannot be assumed
.
 that data is consistent
"K
	90 	 Fri
when a processor fails. A two phrase protocol. [6] can be used in this
situation. In this protocol, a process executing on processor A sends
	
copies of its data to processors B and C. B and C then each send A an	 ip
aoknowledgement but do not store the data in their consistent databases.
When A has . rece ved acknowledgements from B and C it sends their commit
messages. After receiving a commit message, B and C store the data in
their consistent databases. It can be shown that with some additional
processing 171 this allows processors in a distributed system to either
all have copies of the new data or all know that old copies have to be
used following a failure.
A second problem with the view of continuation taken above is the I-
	
treatment of Unrecoverable objects [8] If an unrecoverable object has	 $
been modified, backward error recovery is not possible following a
failure. The problem is no different on a distributed system than on a i
uniprocessor system. One apparent differences is that all the processors
	
in a distributed system need to be informed of changes to unrecoverable	 i
	
objects and this has to be done in the presence of failures, This is
	
i
actually a manifestation of the data consistent y problem discussed
above.,;
	 {{
	Given that tolerance to hardware faults is required, two. completely	 l
	
different approaches can be considered. In the first approach, the Loss 	
!N
	
of a. processor is dealt.. with totally by.: : the execution-time support
	 #
software. Any processes which were last are restarted on remaining i
processors, and all data is preserved . :by ensuring that Multiple copies I
always exist in the memories of the various machines. We refer to this
	
approach; as trangF.aretlt sirzve in Principle, the programmer is Unaware	 ' -
of its existence. Transparent continuation has several advantages:
(1) The programmer need not be concerned with these aspects of fault
tolerance.
(2) The programmer need not knout about the distribution. Thus the
distribution. can be done by the System.
.(3) The same program can be executed on different systems with
different distributions.
However, since the continuation of service is transparent to the
programmer, the programmer cannot specify degraded or t safer [ 9] service
'y
to be used following processor failure.	 The system	 cannot	 specify	 it
either,
	
and so	 transparent continuation must always provide identical.
service.	 If identical service is impossible, the system stops.
In crucial systems. this is not acceptable. 	 Situations
	 will	 occur. i	 1
where identical service cannot be provided (due to physical, damage, say)
and yet degraded service is .
 essential
	 if	 some	 catastrophe
	 is	 to. be
avoided.	 For
	
example,	 a nuclear power plant may be unable to provide
"r
l
power but nonetheless. roust be able to shut down safely.
`!
There	 are	 also	 many	 technical	 difficulties	 in	 implementing
f
transparent	 continued	 service.	 Since failures can occur at arbitrary
times, the	 support	 software	 must	 always	 be	 ready	 to. reconfigure.
Duplicate	 code must .exist on all machines and up
-to-date copies of data
must always be available on all.	 machines.	 The ,
 overhead. involved
	 in 1
.ensuring	 that
	
all .data is consistent on all machines all the. time will
,I.
. be substantial.. 	 Even if	 transparent. continuation.	 could	 be	 offered
i - 1
T'
92
without massive duplication of computing re,sourcest it would be rejected
for miarq applications	 because	 of	 is	 inability	 to	 offer	 alternate
service
k
In the second approach to dealing with the loss of a processor oinay
minimal	 facilities are provided by the execution-support software. 	 The i
l	 `
fact that equipment has been lost is made kinown to the program and it is
expected	 to	 deal	 with	 the	 situation.	 We refer to this approach as
programmer-controlled or nZ& trans pare nt.
}}^
l	 f.
Programmer-controlled continuation has several. disadvantages:
(1)	 The
	
programmer	 must	 be	 concerned	 with	 all	 aspects	 of	 fault J
tolerance. i
(2)	 The programmer must either specify the distribution or be	 prepared
to deal with any distribution provide d by the system.
( 3)	 The program depends on the hardware 	 system;	 at	 least	 the	 fault ^	 r_s
tolerant. parts do.
The disadvantages are out -weighed by	 the	 fact	 that	 the	 service
provided following failure need not be identical to the service provided
before failure. Alternate, degraded service or 	 t safer	service	 can	 be
.offered
	 if	 circumstances	 so	 dictate.	 In what fol , lo-Ts only the non-
transparent approach	 will be considered. 	 We assume that the actions to
be	 taken by each processor following a, failure are specified within the
software executing on each processor.
AMA.
93
• REQUIRED LANQUAG^t 
.'^
As in other fault-tolerant s^ tuations, fault tolerance in a
distributed system requires facilities to allow failure to be detected,
damage assessed, a response chosen and recovery effected. In addition
hooever, consistent data is required aoross all processors so that the
facilities may be distributed, and work towards the same end.
$.I, source -Qf Facilities 
The necessary continuation facilities can be provided in
	 different
ways
(9) By	 using mechanisms	 in	 the	 programming	 language	 specifically
designed for than: purpose.
(2) By using mechanisms in the programming language which were designed i	 .f
c
for
	
another purpose. 	 If this were done, it would be a coincidence
if	 the	 mechanisms	 worked	 satisfactorily	 since	 they	 were	 not }
designed to support fault tolerance,
(3) By using mechanisms 'outside. 	 the	 pr ogramming	 language	 such	 as
modifications to the execution-time errrirorment or software written
in some other language, perhaps an. assembly . language
The usual approach is the third where nothing specific is 	 provided
in the	 programming language.	 Such an approach is reasonable when only
transparent fault tolerance is offered since in that case all facilities.. ^
will be provided by the underlying system. 	 However, it is not suitable
a
for non-transparent. fault-tolerance.	 If. nothing 	 provided	 in	 :the
'	 1
4.rr 1
94 s'I
language,	 each	 system	 will develop its own methods. 	 Programs will be
non portable and difficult to write and maintain,
	
For example, if
	
more ,	 +;
data is required for a new reconfigaratian strategy, and the part of . . the
system that distributes data has been written in a way	 that	 is	 highly 1
dependent on the particular data collected, then that part of the system. }
will probably have to be rewritten. ^	 }{
The need to provide norr-transparent
	 fault
	 tolerance	 leads	 to	 a
i_
general. programming Language requirement: ^	 Y
A progr amming language which is to be used to program distribut-
ed systems must provide mechanisms to allow programs to be writ-
ten which can cope With processor failure. i	
-	
1
At?—,	 au--Lt^,Pc^jL^GSt^0Y4 j
The first Facility* that is required is detection of the loss
	 of	 a
processor,
	 This	 can	 be done in several ways and may require hardware
,,	 i
el
distribution.
	 If not the programmer must be prepared to deal 	 with
configuration	 chosen	 by	 the.system.	 In a crucial system, this is not
acceptable.	 For example, the system may choose to 	 plaee	 both	 primary
and alternate software for some service on the same machine.
The question of what to distribute then arises.	 It is	 clear	 that
distribution	 must	 be	 relatively	 coarse-grained	 tasks orY	 S	 (	 packages in
Ada, for example) or the job of specifying reconfiguration 	 will	 become
too difficult.
	
A fundamental question is whether or not there should be
units
	 defined	 in	 the	 language	 specifically	 for
	 distribution.
	 A
distribution
	
unit,	 for	 example, could be restricted to use only local ::y
f
variables.	 Thus a third programming language requirement is,
-.
t
A programming language that is to be used to program distributed
systems must specify what units can be distributed, how the dis- :f
tribution is to be done, and must include a syntax for	 express-
i ing distribution. k
J.. -	 k.
Various problems, arise if 	 this	 requirement	 is	 not	 met.	 For
example,	 an implementation	 may	 distribute	 five similar processes by }
providing a ,gingj,g copy of the code (to save space) and arranging for it
to .	 be	 transported	 in	 blocks	 as	 needed.	 The. specification
	 of
ti distribution may have been suitably stated but the implementation of the 'l
distribution	 has	 not	 been adequately defined.	 Failure of the machine kkI
holding the code will cause failure of the other ' processes 	 which	 have
been executing in machines which did not fail.
Z
i
96{i
161.. pgt DQpendenae
In block structured languages, a program unit can assume the
^e
existence of an instance of all objects in the surrounding lexical
blocks. When a system is distributed, it is possible to have a given
program unit on one processor and one of its surrounding lexical. blocks
r
on another processor. If the Latter processor ,fails, we must decide
what to do with the surviving inner program unit.
I'.
1
In most languages it is possible to create objects that survive
their surrounding contexts. Usually this happens when objects are
created by some form of dynamic storage allocation. In Ada, for-
example, such objects exist until the context which contains the
definition of the access type is left.
	
A first possibility is to treat surviving. inner program units by 	 si
	applying rules of survival which already exist in the language.
	
^II
iAlternatively new rules could be made for survival. when a. context is
	
lost by processor failure, or the problem could be passed on to the
	
.,
programmer, who. would have to specify survival conditions for each
distributed unit.
i
'l
The first possibility seems simplest, since it involves just Musing
	
the language as usual. Unfortunately i.f, as is typical, everything
	 f
depends on a 'main' process, failure of the processor executing the
'main' process will result in everything stopping. In Ada, for example,
i
all the tasks defined in a main program depend on the main program.
Thus, in fact, such systems are centralized as a result of the lexical
structure of the language.
x
97
A further complication is that a program text describes a compile-
time structure. Usually, no distinction is made between objects which
exist at execution time and objects which exist only at compile time.
In a distributed system such a distinction is of the utmost importance.
Unfortunately which objects are compile-time and which execution-time
depends on the implementation. Undermost implementations a program
unit which uses a type definition from a surrounding context bears a
completely different relationship to that context than a program unit
which uses a variable defined in the context. If, for example, a unit
uses	 a type definition in a surrounding scope, the unit would depend on
the surrounding scope at compile-time but	 not	 at	 execution- time.	 In
effect	 the	 type	 definition could be given a unique name and be copied
into the unit. The unit could then survive, the loss of	 the	 surrounding
context.	 Similarly	 a	 program unit may (by the rules of the language) 4
depend an another unit which will not exist at executioa-time.	 An	 Ada
library package containing only type definitions may be the master of an
.A
Ada task for example.
This leads to the following programming language requirement:
A programming language which is to be used to program distribut-
ed	 systems must distinguish between compile-time and execution.-
time objects, and define survival rules for distributed	 program
unitsw
4.5.	 Process Communication
Communication between processes on different processors	 is	 risky.
J
For	 example,	 messages	 might	 be lost so that they must be re-sent and
communication time becomes long, or one .of` the.	 communica lting .. processes
......... . .
9 8
might be on a processor that fails. In a crucial system, unless an
arbitrarily long wait is acceptable at each communication, a process
wal need some way of withdrawing from every communication. Thus,
another programming language requirement is:
It should be possible to ensure that a process can meet a dead-
line no natter what happens to processes with which it is com-
municating. Failure of a processor should never cause a process
executing on a surviving processor to be trapped while comzmuni.-
eating.
While a time-out mechanism could deal With both slow response and
lack of response because the responder is on a fail ed.processor, a
better scheme is to deal with the latter case differently. When a
y1)	 yF
is
F
99
M
(1) Replace the lost process by an identical. substitute and
transparently re-route calls.
(2) Replace the lost process by a substitute and inform callers.
(3) Do not replace the lost process and inform callers.
As before, the choice depends on the reconfiguration strategy. Note.
that the above requires that the continuation facilities know which
Processes are communicating, that is . some form. of communications log is
necessary on each processor. This leads to the following programming
language requirement;
A programming language which is to be used to program distribut-
edsystems must provide facilities to deal with the loss of a
communicating process. in particular, knowledge of which
processes were communicating with the lost process and an abili-
ty to redirect communication may be needed.
1
A.L. Consistent Da a
^i
It is clear that some consistent data will always be required by
the continuation facilities on each processor. At a, minimum, when a
processor fails, all continuation facilities must come to the same
	 i
conclusion about which processes have been lost. In general., consistent
data will also. be required. to restart processes or to start` replacement
processes.
Rather than have facilities for providing consistent data available
to each applications t programmer, we believe that such a facility should
be provided by the programming language. One possible scheme would
provide a process-to-processor map for. use by. continuation facilities,
1 O Q
and allow a programmer to specify data that was to be distributed to a
consistent .database in all processors. The language requirement is;
A programming language. that is to be used. to program distributed
systems must provide a mechanism to allow consistent data to be
distributed across processors,
• C.QNCLUSlONS
Almost all languages allowing processes to execute in parallel have
taken the view that Processes sharinaz a uniorocessor or executin g on
^i
^I
i
I	 jf i
1
i
101
have no concept of degraded or safe service. The system is either
Providing full service or no service.
The ability to provide alternate service is so important that a
non-transparent approach must be taken where the programmer has to deal,
with reconfiguration. Existing high-level languages provide little or,
no aid for this; everything must be provided by the execution-time
system, usually in a way that is strongly dependent on a particular
implementation.
We have suggested a set of requirements for programming languages
at
102
REFERENCES
(1) Reference Manual For The Ada Programming Language $
 U. S. Department'^q
of Defense, 1983.
 
(2) Knight, J. C., and J.T. Urquhart, "Fault-Tolerant Distributed
	
Systems using Ada n
 , Proceedings Of The AIAA Computers In Aerospace
	 {I
Conference, Hartford, Connecticut, October 1983.
(3) Liskov, B., "On Linguistic Support For Distributed Programs", IEEE
Transactions On Software Engineering, Vol. SE-8, No. 3 1 1982.
(4) McTigue, T. V., "F/A-18 Software Devel opment -- A . Case Studyn, 	 1
Proceedings Of The AGARD Conference On Software For Avionics, The
Hague, Netherlands, September 1982,'.
E
( 5). Schl.ichting R. D. , and F. B. Schneider, "Fail-Stop Processors: An
	
Approach To Designing Fault-Tolerant Computing Systems", ACM	 ^.
Transactions On Computer Systems. Vol, . 1 , No. 3 , 1983.
I	 ..
(6) Alsberg, P. A., and J. D. Day, "A Principle For Resilient Sharing
Of Distributed Resources tr , Proceedings Of The International
Conference On Software Engineering, San: Francisco, October 1976=
(7) Gray, J. N. f "Notes On Database Operating Systems", in Operating	 I .s
Systems: An Advanced Course, Springer-Verlag, New York 1978. E
(8) Lee, P. A., "A Reconsideration Of The Recovery Block Scheme",
Computer Journal., Val. 21, No. 4, 1 978.
	
('9) Leveson, N. G. s. and P.. R. Harvey, "Analyzing Software Safety", IEEE	 I
Transactions On Software Engineering, Vol. SE- 9, No. 5 1 1983=
^I
^I
x
103	 !
{
111
I
l
APPENDIX 4
A Testbed . for Evaluating Fault-Tolerant Distributed Systems
John C. Knight
Samuel T. Gregory
Department of Applied Mathematics and Computer Science
University of Virginia	
r	 :^Charlottesville
Virginia, U.S. A. 2290( 804) 924-720
ABSTRACT
When a strategy is proposed for enabling distributed software to 	 f
tolerate hardware faults,	 its adequacy should be demonstrated
experimentally in a scientifically believabl y: way. Simply building and	 t
executing a program employing that strategy is not a scientifically
believable . demonstration, . The difficulty lies . in the ivherent
eoncurreney of distributed programs. it is unlikely that a situation
that reveals a deficiency of the strategy will occur during a functional
test. What is needed is an experimental testbed which can model any
network topology and any..software strategy designed to provide fault -
tolerance,l rance. Also, it must allow any achievable software state to be
precisely established, allow arbitrary parts of the distributed system
to be failed, and allow the software to continue from that situation, if
it can. Detailed requirements of such. a. testbed are given, and an
implementation is described. 	 f
F
i
1
f,
1 o4
1• .TnZ'adchiori
We have been examining ways of providing tolerance to hardware
faults in distributed programs written in Ada* 111. We have suggested
a strategy that we claim enables an application program to detect,
survive and recover from failure of one or more of the machines of the
system2 Given this proposal., we needed a testbed with which to validate
or find difficulties in the claims of the .proposed . strategy. A typical.
question which has been asked about the strategy is: "What will happen
in an Ada program if: two tasks wish. to rendezvous and the machine
{ executing the accepting task fails	 er the caller has sent a
rendezvous request message but before the accepting task has .
 examined
its entry queue? Will the proposed strategy actually handle this
i
situation? n Another question involves task creation. The semantics of
task creation seem to require that the task executing the unit declaring
the new task wait for. the new task! s creation to complete. if the
d
machine on which the new task is to execute fails during the task
creation, it is necessary to demonstrate that the strategy is able to
	 ^ r
extricate the declaring task.
	 ' `I
^" 1
In the remainder of this paper, the problem is generalized from the I^
verification of our Ada strategy, and the problem is . analyzed for the
requirements of a solution. the testbed which is our solution is then
described,
-	
f
1 05
Z. .The. Pnobi em
. When a strategy is proposed for enabling distributed software to
tolerate hardware failures, the adequacy of that strategy should be.
demonstrated experimentally In a scientifically believable way.
Although there are exceptions, in practice this is rarely done.
l:nstead, informal logical arguments are often given in an attempt to
show that all cases have been considered or experiments are carried out
in which failures are deliberately introduced, These failures are
usually either randomy or determined by the externaz „fMnctions that the
system is to provide. The intent Is to answer questions about failures
as they relate to the required external system behavior..- For example,
will the fault-tolerance strategy be able to cope with loss of
I
I	 tip
1I
I,
1
1
processor(:.) Just after stlmulus(jl 	 is received.
Although fault tolerance strategies are well thought out 	 Iike	 any t
software	 design,	 they
	
may	 contain	 weak points.	 Simply building and
`i.
executing an instance of a program employing the	 proposed	 strategy	 is
not	 a	 scientifically believable demonstration.
	 The difficulty lies in
the inherent coneurrency of distributed programs.
	 It is unlikely that a 1
situation that reveals a deficiency of the strategy will occur during.
 an
operational' test.	 What is needed is a testbed that allows such unlikely
situations	 to	 be precisely established, and then allows observation of
the software	 as	 it	 attempts	 to	 continue	 from
	
that	 situation..	 A
comprehensive set of such tests constitutes a scientific demonstration.
This problem extends beyond fault' tolerance into the	 general.' area
of	 semantic	 definitions
.
 of programming languages.
	 Where a distributed
106
9
system is to be built usiPS a high-level lan guage $ it is often difficult
to determine What a correct program is supposed to do even in the
absenge of faults. Even when semantic definitions of lanvages er3st,
and implementations are faithful to the definitions, the definitions
tend to become vague with respect to the creation, deletiony and
especially communications of processes. Where the definitions are. less
vague , they are often phrased in terms of intricate yet instantaneous
operations. For example, the conditional entry call in Ada involves a
check of the entry queues of the accepting task and of its ability to
begin the rendezvous nimmediately ff . For a system implemented on a.
single processor, the execution,-time support for a language may achieve
effects that appear to be instantaneous by disabling interrupts during
most of, or all of, the operation. This prevents interference from
other processes. On a distributed system, total suspension of parallel
activity is either not possible or not permissible.
E
T 07 1
3,, 21M ,Requirements ,ox Jim soi utjQg
A distributed system is concurrent because parts of s, distributed
	
system execute on separate computers communicatl.ng in real tame. One 	
9a
	cannot test the parts separately and drag
 reasonable conclusions, so any	 .j
experiments with it z= be done in a concurrent environment. A useful
	experimental testbed must be able to model any network topology and	 i'
implement any software strategy designed to provide fault-tolerance.
Also, it must be possible to daliberately fail grbitrary parts of a
distributed system under test. when its software is in , nv achievable
state„
	For sequential programs, methods have been available for a long
	
r
time for setting up a desired program state and machine state
combination, and stepping the program through its handling of the
y
	situation. These Methods have been implemented in programs Variously
	
ip	 p	 Y	 ^
called ttsimul.ators tt or "interpreters", 	 In a simulators the desired	 jM
state is achieved by executing or single-stepping the program being
	tested to a breakpoint 9 or by forcing a particular value into the
	
i
simulated: program counter-. Similarly, the desired machine state is
achieved by instructing the simulator to force desired values into
simulated registers or memory locations. Once the experiment. is set i?p,
	
results are usually obtained by single-stepping the subject program from
	
r
that point whale displaying the simulated machine state at. each step.
	
A testbed which is analogous to a sequential program simulator, but
	 ^	 t
which
	
. can. simulate distributed (parallel) programs and can test fault- 	 +
	
tolerance strategies, is required to meet the stated problem..
	 ^	 #
W4 W
k
t'
108,
r
Specifically, the testbed must meet the following criteria:
'i
fl
(1) Xt must be able to model an arbitrary physical arch itecture.	 That
lis	 any	 number of processors in any network topology. 	 This is not
to imply that its capabilities be infinite$ but that	 there	 should
not	 be small, and arbitrary hard limits, nor should only one or two i.
configurations of processors be supportable.
(2) The testbed should provide at least the illusion	 (to	 the	 program
executing	 on	 the	 processors)	 of parallel exe cution.
	 It must be
able to present a distributed	 software	 system	 with
	
all	 of	 the
problems	 the	 system will actually encounter on separate machines.
For example., one of the problems is the 	 inability	 of	 the	 fault-
tolerance
	
strategy to avoid interference by temporarily suspending
execution of all but one process.
(3) Arbitrary	 logical	 architectures	 must	 be	 representable.	 The
i
assignment	 of	 processes	 to processors must not be constrained by
R
the design of the testbed.	 Any such.	 constraint	 would	 limit	 the
^
usefulness	 of	 the	 testbed	 in that it would prohibit many fault- tI
tolerance strategies from being exercised on it.
(#} The	 testbed	 should	 be	 able	 to	 control	 inter-processor
:
communication.
	
It	 must	 allow	 the	 experimenter	 to introduce
failures in the communications part of. the modeled.syst,em,
	 and	 to
enforce the implications of the simulated processor topology.
	 This
involves maintaining the visibility and . accessibility	 of	 messages..
i
1
ar7where	 in	 the	 system. _ To .accompany this control, the level of
instructions simulated should not be .	such	 that	 the.	 issuance - of
^:	 i
109
multiple messages is combined.
(5) In order to set up the precise situations desired for experiments,
the testbed must provide explicit control, over process execution.
The experimenter must be able to stop a process on a specific
instruction (i.e. a breakpoint), and to single-step instruction
execution for particular processes. Since an experiment may entail
arranging for many processes to be in many different states,
control, over suspension and release of one process should not be
dependent upon control, over suspension. and release of any other
process.
(6) Since a major point of the testbed is to see if software strategies
can tolerate processor failures, the experimenter must be provided
with the ability to fail and to restart processors at azi desired.
point. `she ability to restart is important because many strategies
call for autanatic replacement of defective hardware.
(7) The teEtbed must maintain simulated time. 	 Since distributed
systems often deal with the concept of time, the testbed's actions
must not violate the abstracted machines' simulated clocks. For
instance, the fact that any or all virtual processes "executing? on
a particular abstracted machine might be held at breakpoints should
have no effect on the relative progress of simulated time in the
various other abstracted machines.
( 8) The testbed must provide means. of monitoring whether the fault-
tole ant strate under test worked or not Since such resultr	 ^*	 s may..
manifest themselves in subtle ways, the entirety of the states .
 of
110
the simulated, processes and processors must be able to be displayed
for examination. For the same reason, the displays should be
selectable so as not to hide in a mass of irrelevant details those
results considered important for a particular experiment.
These requirements have several. implications. The testbed must
maintain or . represent a vi:rtual.. .state for each process in the
distributed system being simulated. It must also maintain some minimal
state information for each processor of the simulated system, if only.
that processor's idea of "current" time and which processes are
considered to be executing . on that processor. 	 This helps both in:
organizing information for display, and in establishing loci of control
over process execution and over inter-processor communication: and {
processor failure. As a consequence of requirements five and seven, the 	 ^ J
testbed must be able to affect the simulated system's process scheduling
algoarithm without violating that algorithm's requirements. each proce ss
can be brought to the desired. state by executing to a breakpoint set for 	 +
that process, and single- stepping for fine adjustment from there. The
sequential simulator method of placing an arbitrary value into the	 i
process's instruction. pointer to bring it to a desired state could cause
1 invalid results firm experiments, since that might prohibit the fault- 	 3
.:	 tolerance strategy from gathering, along the way, whatever information
A
a
	
	
it needed for 'a; proper recovery, In short, the testbed should not. alloy
the experimenter to set up truly impossible combinations of process and
system states, but it should allow every other combination to be reached 	 #
_	 I.
and held as a point for the introduction of a failure.	 ^
4The motivation behind this distributed system testbed lies 	 in	 the it
desire
	 to	 validate	 a	 proposed	 strategy	 for	 :Having distributed Ada
programs survive machine failures [2].	 The fault-tolerance strategy	 is I	 rl
expressed in Ada, and the testbed we are constructing haEQ an. interpreter }}	 '
of an intermediate code at the ` individual message level.
1L.1.	 Over-vie Gr .
r
i
The organization of the testbed is illustrated in . 	 Tkae	 part	 of
i
;.	 =
the	 testbed which supports the execution of individual Ada processes in
^k
a system under test is the set of virtual 2900essors.	 The	 term	 comes i
i
from	 the	 operating system concept that every process in a system is to
have the illusion that it is executing on	 its	 own	 processor	 with	 an
instruction	 set	 enhanced	 by	 special	 "instructions tt usually known as
supervisor calls.
s
The	 testbed	 provides	 a	 user-specifiable	 number	 of	 processors
i
referred	 to as abstract pxacess_ors.	 Each is capable of multiprocessing,
the execution of from zero to all. processes (i.e. virtual processors) in EE
the	 system	 to	 be	 simulated.	 An	 abstract processor employs a user-
i
supplied process scheduling algorithm which defaults 	 to	 a	 round-robin 4	 `
scheme.	 A system being tested Is intended to view an abstract processor I
as an actual machine with some execution-time support codePp	prov'id3ng	 he
multiprogramming- illusion. 	 Thus	 a	 complete 	system	 under test will f
consist of a	 set	 of	 abstract	 processors	 connected' by	 art	 abstract i
communications `facility.
j
'I
11 2
The correspondence between phygiggi- ,prccegZQr; and abstract
processors is similar to that between abstract processors and virtual
processors. Each physical processor is capable of executing from zero
to all possible abstract proeessors^ with the distribution controlled by 	 1
1
an experimenter-supplied map. A physical processor multiprocesses a
group of abstract processors, using a fair scheduling algorithm which,
173
again, can be.
 altered by the experimenter. One physical processor is
called the controller and executes the command interpreter which serves
as a user interface for the experimenter. All. other physical. processors
provide an underlying portability structure for the testbed as will be
described below. Note that physical processors are not necessarily real
processors.
L,Z, Abstract Froceagag
e
i The instruction set that an abstract processor provides its
E
processes is easily modifiable, and both a "crystal," frequency and
instruction and message "execution" times for automatically updating the
abstract processor's clock are accessible to the experimenter. The
abstract processor also provides a place for user code which may. modify
message traffic, if that should be part of a fault-tolerance strategy to
be tested. Thus, the abstract processor part of the testbed represents
what would typically be found in both the processing hardware and the
execution-time support code and "system l:evel n code of the distributed
system to be simulated
i
l
rl
-	 ?1
1.
11 4	
r{
message traffic, if required. The experimenter can also dynamically
influence the delivery of messages from any source (such as the Ada,
program or its support). The default network topology for abstract
processors is modeled after Ethernet.
The message handling facilities also aid the abstract processors in
enforcing time, An abstract processor removes messages from the set of
messages destined for it (or its processes) in a variation of first-in
first-out order. For any particular retrieval, no message which is
z115
execution- time support or an operating system. The form of the virtual
processor desired for a particular system to be tested is expected to
vary widely. Hence, the instruction set Is easily modif4abler its
present form is designed only to support ,Ada.
A virtual processor t s state is represented as a Pascal record which
is accessible to the experimenter. As well as the data structures
necessary to support its instruction set,
	
virtual processor's state
contains the accounting and scheduling information used by the abstract
processor in supporting the multiprogramming illusion. For our problem,
the virtual processor's state includes information which is updated
during interpretation of certain instructions and ,messages, and employed
by the scheduling algorithm. It also contains the task s s stack of local
variables ( referencing envaxoments) and its list of dependent tasks,
and so on.
L:	 ^
116
instruction, we provide several supervisor call instructions .which are
always used together and in the same order. This allows the
experimenter to hold a virtual processor.at any interesting point within
the message exchange while he deliberately fails an abstract processor.
,L.a. Xxoriment -Qoi2trol
During an experiment, the experimenter can set and remove
breakpoints	 for	 individual	 virtual,	 processors	 (i.e. Ada tasks), can
direct that an abstract processor's crystal frequency	 be	 increased	 or
decreased,	 can
	
direct
	
that	 an abstract	 processor
	 cease	 to	 exist
(introduction of a machine failure), or be	 restored	 (simulate	 standby
spares),	 can	 single-step	 or	 execute	 any	 one	 or a group of virtual.
procsssors, and can alter delay intervals in message transmissions	 (for t
different	 processor	 configurations).	 This	 is	 a	 small	 number	 of
capabilit;s es, but it turns out to be all that	 is	 really	 needed.	 The i
experimenter could easily be allowed to alter simulated memory locations
.'	 k
as in a sequential simulator, but hardware redundancy technologies
	 seem
to	 have	 that	 kind	 of hardware	 fault	 under	 control,	 making	 it
1
uninteresting to software . strategies for tolerating hardware faults.
The speed of each abstract processor 	 is	 adjustable
	 and	 a	 clock
which	 tells	 that	 abstract	 processor t s. view . of	 wall	 clock. time is
maintained.
	 Each instruction and each message of the system under
	 test,
`'has	 associated	 with
	
it	 the	 number	 of machine cycles needed for its 4
processing.	 As instructions are executed and messages are handled,
	 the
appropriate	 abstract. . processor's clock is incremented .by the product of F
„qq^
I7
F
OF.
1 1 7
F
t
	
Its speed and the appropriate machine cycle count. An abstract
	 i
	Processor which has nothing .to do, ouch as is the case When all of its
	 ^ ±!
	
virtual, processors are held at breakpoints, continues to be scheduled
	 ^ r
!q
and its clock is adjusted by a minimal amount. This prevents canfl,iets
between the artificial suspension of time by breakpoints and the
previously--mentioned refusal of an abstract processor to accept a
message time-stamped in ghat it considers to be the future.
	
The algorithm, scheduling virtual processors can no more violate 	 i
accurate simulation of time than. it could violate wall sock time if
used in an actual distributed implementation. A bad scheduling
:j	 algorithm in the system under test would be equally.bad in actual
-i	
^	 s
	operation, so the testbed should not be responsible for "correcting" it.	 1"
	
The scheduling of abstract processors within the testbed, however, could 	 i
f	 ;
violate tame. For example, suppose there are three abstract processors
-	
1
each executing a virtual processor and that the virtual processors wash
	to communicate with each other. Further, let the communications be 	 t
r	 initiated by the virtual processors on abstract processors 1 and.2 and
	
be directed toward the virtual processor on abstract processor 3. A
	 C
simple ..round-robin scheduling scheme would always ` bias the load of
i pending messages for abstract processor 3. whenever abstract processor
3 was scheduled, it would have a backlog of pending messages, the
handling of which might use up all of. its allotted time, causing it
never ` to make progress in executing its virtual processors. Our
implementation of the testbed avoids this problem of message biases
violating simulated time by providing a particular (default) abstract
processor scheduling algorithm. A random choice without replacement is
ii.
t ,
-W.	 ...	 ..
made from available abstract processors, replacing all of them into the
scheduling pool only when it becomes empty. As will be seen, our
implementation is also capable of providing actual, parallel execution of
the abstract processors,
$.2. 1hv i cal Realization
The testbed's lowest communication layers are modular so that, a
simple substitution and re-compilation allows the testbed to execute
either on a single minicomputer running UNIX* or on an. actual
distributed system consisting of a set of 1B14 Personal. Computers linked
together via Ethernet. When executing on the minicomputer, a set of
UNIX processes are created which correspond to the IBM Personal
Computers and their associated software. 	 The processes communicate
using npipesn and these correspond to the Ethernet. That part of the
testbed executing as a UNIX process or on its own Personal .Computer is a
physical processor.
	
The experimenter = s interface, called the
t
r
119
multitasking operating system if that is the limit of owl s enviroment.
However, the testbad can be configured with a single abstract processor
implemented on each physical processor, and, when executing on the IBM
Personal Computers, the testbed then implements a true distributed
system.
Displays
Each of the physical processors has a monitor assuciated with it,
V
1
I	 ^'
i
j.
i
'r
If a physical processor is a UNIX process, it has its own terminal at
which the experimenter may select displays of simulated activities. If
a physical processor is a program on a Personal. Computer, the Personal.
Computer , s keyboard and monitor serve in the same capacity, The
experimenter can choose among arty of several displays to be continuously
i
updated on the monitors. The displays include 	 #
• .•
j
(1) values output from the Ada program being tested,
YI
	
120	 r^
(2) virtual. processor numbers to abstract processors,	 I
(3) abstract processors to physical processors.
, #_%. IMpletRentation Status
As of this writing, the structures implementing the physical
processors, abstract processors, and virtual processors as described
above,. and the implementation of messa ge traffic are all in place. A
nfriendly rt user interface for the controller is being developed.
Commands for. altering network topology are not being. 	 at tree
	
present time, because such concerns are not _pertinent to our current 	 ^.
problem. of investigating the use of Ada on distributed systems.	 The
default abstract processor scheduling algorithm will not be changed for
our problem, and we do not need priorities for Ada task-sr. so the default
round-robin virtual processor scheduling is also sufficient. .t
rl	 The fault-tolerance strategy to be tested calls for code to 	 i
manipulate and alter the sequences of messages between virtual
l	
rprocessors at the abstract processor ,level, so we wd.l.:,L be using that
feature. That code is being built incrementally as the virtual
processor message handler coding progresses, A..preliminar-; version of !
y
	
	
the testbed that has three pbysi.cal processors but with only a minimal
implemeritatin of the virtual processors, has, been executed as three
j	 processes on a DEC VAX 11/780 running UNIX._ In those trials, the
	
ability to. fail abstract processors, to transport messages to their
	 }{
proper destinations,_ and to monitor various parts, of the state data
VAX is a trademark of Digital Equipment Co rporation.
WrLa
1 ^1
122
REFERENCES
(1) Reference Manual For The Ada Programming Language, U. S. Department
of De Tense, 1903,
(2) J. C. Knight and J. I. A. Urquhartr #Fault-Tolerant Distributed
Systems Using Ada tt , Proceedings of The AIAR C.omputeirs.Xn Aerospace:
Conference, Hartf ord, Connecticut, October 1.983.
( 3) R. B. K. Dewar,	 G. A. Fisher,	 E. Schonberg,	 R. Froeblich,
S. Bryant, C. F. Goss, and M. Burke, "The NYU Ada Translator and
Interpreter," Proceedings of the ACM-STGPLAN Symposium on the Ada
Programming Language, November 1980.
I
f
7
a

124
Digital computers are being used increasingly in dedicated control
applications that require high reliability. These systems are usually
embedded and frequently distributed. Several processors may be used
that communicate using a high-speed bus even.. though they. are
geographically close. An example is a digital avionics system for a
military aircraft in which separate computers may be used for flight
control., navigation, displays, weapons management, and so on. The
overall system requires some coordination and so the various computers
communicate via a data bus. A typical system is described by
McTigue.Ell.
One of the advantages. of distributed processing is that a hardware
failure need not remove all the computing facilities. if one processor
fails, it is possible (at least in principle) for the others to continue
to provide serva.ee= This is a desirable characteristic for applications
requiring high reliability. The use of distributed processing is
further encouraged by the ,decreasing cost of computer hardware.
Ada Z21 was designed for the programming of embedded systems (such
	as - those mentioned above) and has many characteristics designed to
	
1,
}
promote the development of reliable software. In thisl	 paper we examine
	
the problem of programming distributed systems in Ada. In particular,	 E
we are concerned with. the issues that arise where some form of
acceptable processing must be provided using the . hardware facilities
remaining after a failure.
1.25
In section II we present some motivation for considering
distributed systems where hardware failure must be tolerated, and define
in detail the failures we will consider. In section III we look at the
general problem of providing service after processor failure, and we
present the general facilities needed to support fault tolerance in
section IV. The considerable difficulties that arise when such systems
are programmed in Ada are discussed in section V. We show in sections
VT and VII that these difficulties can be overcome by careful
programming and by making extensive additions to the execution-time
support system that would normally be needed to support Ada. These
additions make no changes to the language itself and their use in Ada is
discussed in section VIII. An example of an application program using
such mechanisms is given in appendix 1, and a detailed examination of
task communication difficulties is presented in appendix 2.
Z ZA$DWA3tE _TOEQLQGT A^1D FAILURES
	
The kind of architecture we expect to be in common use for embedded	 I'
	
systems in the future in shown in figure 1 . It is based on the use of a	 ^ €
high performance data bus that links several, processors. Each processor ` 	 f
is equipped with its own memory. Devices such as displays, sensors, and
actuators would be connected to the bus via dedicated microprocessors.
Thus these devices would be accessible from each processor.
A great deal of research has been undertaken in recent years to a
i
produce computer ` architectures of great reliability such as the SIFT [31
and FTMP [4] machines. However, even though designed' for reliability,
.	 a
a
r
Memory
	
Memory
	 Sensor
	
Actuator
	
Mier o	 Mier o
	 f
Processor
	
Processor	 processor	 processor
Communications Network
3
r
Figure 1 - Distributed Architecture
these machines may still fail. 	 Lightning, fire or other physical damage
could	 cause
	 a	 processor to fail no matter how carefully the processor }
was built.	 Where is, therefore,	 good	 reason	 for	 employing	 software
structures able to cope with partial hardware failure. .
It is clear that continued service after failure implies
	 a	 system t
distributed	 `over	 two	 or more processors.	 Distribution, however, does
not	 necessarily	 imply
	
continued	 service	 after	 failure.
	 Several 1
processors	 sharing	 a	 computation	 that	 stops	 whenever	 any	 single
processor fails will be
	 viewed	 as	 a	 single	 processing	 unit.	 This
distinction
	 between	 distributed	 systems	 that allow continued service
after failure and those that	 do	 not	 is	 important.	 We	 will	 call	 a
distributed
	 system	 that	 does	 not	 allow
	
continued	 service
	 after a
Processor fails a c n rglized distributed system.
	
From the	 perspective
1 27
3
r.
of fault tolerance, such a system is no better than a uniprocessor.
We assume that communication between processor -n on a distributed
system will. be implemented using software that conforms for the most
part to the lower layers of the ISO standard seven layer Reference
Model. [ 51. The kinds of hardware failure that we are concerned with are
not addressed by the ISO protocol. The ISO protocol is concerned with
communications
	
failure	 such	 as	 dropped. bits caused by noise, loss of E
messages or parts of messages, etc. 	 The only class of faults not	 dealt
r
}with	 elsewhere	 is	 the total loss of a processor or a data. bus with no
warming.
3
1
A processor will be assumed	 to	 fail	 by	 stopping	 and	 remaining
1
y
stopped.	 All	 data in the local memory of the processor will be assured _	 e
lost. Thus the case of a processor 	 failing	 by	 continuing	 to	 process
instructions	 in	 an	 incorrect	 manner and providing possibly incorrect I	 ^^j
data to other processors will not be considered. 	 We	 assume	 that	 such
-
events	 are	 taken	 care of by hardware checking within the processor or
the methods of Schlichting and Schneider [6].
While this may seem a severe restriction, at least three 	 arguments {
-	 can be made in its favor:
'l
(1)	 Faults of the assumed kind must be taken into consideration 	 anyway
since a processor ,might . fail in this way. i
(2)	 Either by	 hardware	 checking	 within	 a	 single	 processor	 or	 by
checking	 between	 a dual pair of processors, it is possible for an
underlying system. to simulate the assumed processor failure mode.
i
i,
.-xq
AV
yYy'
1i
(3) if such a failure mode is not assumed, error recovery becomes
extremely difficult. It becomes possible for a processor to fail,
^N
and for resulting errors to remain undetected until all data is 	 'I
compromised.
'Given this assumption, error detection reduces to detecting that a
processor has stopped, Error recovery is simplified by the knowledge
that although the data in the failed processor's memory is lost, data on
the remaining processors is correct. f
A distributed system that is to be highly reliable 	 will	 be	 built
with a redundant bus structure.
	
Redundancy usual ly includes replicating
the bus along different	 routes	 as	 well	 as	 replication	 of	 the	 bus I	 .,,
hardware	 itself	 on a particular route. 	 Loss of a complete bus need be
of little consequence if it is replicated and can be coped with 	 by
	 the I
low-level,	 communications
	 software.	 A complete break-in the bus system
that isolates some subset of the processors ( i.e.	 the	 network	 becomes
partitioned)	 is	 much
	
more . seri.ous though very unlikely given multiple
r
routes and	 replication.	 The	 issues	 that	 arise	 in	 that	 ease	 are !
I
different	 from
	 those' arising from processor failure. 	 They are beyond
4
'	 the scope of this paper and will not be dealt with here,
I,
APPROACHES
.M FAULJ TOLERANCE"
Where are two completely different approaches
	 that
	
can
	 be	 taken
-	
When.. attempting	 to provide . tolerance to hardware faults . I`n the first §
approach, the	 loss	 of	 a	 processor	 is	 dealt	 with	 totally	 by	 the
execution-time. support	 sof tware.	 Any	 serv ices	 that	 were	 lost are
assumed by remaining processors, and all data is preserved by ensuring	 r
that multiple copies .always exist in the memories of the various
machines. We describe this approach as transparent since, in principle,
the programmer is unaware of its existence. This is the approach being
pursued by Honeywell [7J.
Transparent continuation has several. advantages:
(1) The programmer need not be concerned with reconfiguration.
(2) The programmer need not know about the distribution. Thus the
distribution can be done by the system.
(3) The shine program can be run on different systems with different
distributions.
However, as the continuation of service is transparent to the
_I
programmer, the programmer cannot specify degraded or safe [$7 service
to be used following processor failure. Since the system cannot specify
it either, transparent continuation must always provide identical
service.	 If identical service is impossible because insufficient j
processing resources remain, the system. stops.
In many crucial systems this is. r_ot acceptable. Situations will
­7M,	 I-
^a
13^	
r
fi
There are also many difficulties in .implementing transparent
continued service. Since failures can occur at arbitrary times, the
support software must always be ready to reconfigure. Duplicate code
must exist on all machines and up-to-date copies of data must always be
available on all machines. The overhead involved in this process is
considerable. Further, the overhead may not be obvious to the programmer
when the program is being written, A simple assignment statement, for
example, may take considerable time to execute in order to ensure that
the updated value of the variable has been distributed to all the 	
i
memories. However, even if these difficulties could be overcome and
transparent continuation could be offered without massive duplication of 	
I
computing resources, it would sti.11 be unsuitable for many applications
because of its inability to offer alternate service.
In the second approach to dealing with the loss of a processor only
minimal facilities are provided by the execution support software. The 	
}
fact that equipment has been lost is made known to the program and it is
k
expected to deal, with the situation. We will refer to this approach as 	 i
programmer-controlled or non.- transparent.
Programmer-controlled continuation has several .disadvantages:
4
(1) The programmer must be concerned with. reconfiguration.
(2) The prograanmer:must either specify the .:distribution or be prepared	 j
to deal with any distribution provided by the system
'	 (3) The program depends on the system; at least the reconfiguration	 r
parts do.
i
The disadvantages are out-weighed by. the fact that the service
provided following failure need not be identical to the service provided
-	 -
133
before failure, The programmer can have complete control over the
services provided by the software and the actions taken by the software
following .failure. Alternate, degraded, or merely "safe $$ service can be
offered as circumstances dictate. Also, the various inefficiencies
associated with the handling of failure and the necessary preparations
for it are quite clear to the programmer.
In the remainder of this paper we will consider only the non —
transparent approach. We assume that the actions to be taken by each
processor following a failure are specified within the software
executing on that processor.
REQUIRED CONTINUATION EACILITIES
If a distributed system is to provide continued service after one
or more processor failures, then facilities must be provided over and
above those needed for normal service. We will refer to these as
continuation Vaeilities. If there is a. single continuation facility for
the entire system then the system is centralized. If the processor
providing the continuation facility fails, the system stops and this ..s
unacceptable. To prevent this, continuation facilities must exist on
all the processors.
However, diffieulties . ca,n still arise :if.. following the loss of a
k ^^
1 3 2.
recovery so that they can bake over if necessary. This is unacceptable
because of the resulting overhead. In what follows ?
 
we assume that each
Processor will have a continuation Facility that independently assesses
damage and effects whatever local changes are necessary for recovery in 	 1 ^
that processor.
Continued service after one or more processor failures requires
that the fol.l.owing actions be performed:
f
Processor failure must be detected and communicated to the software E
on each of the remaining processors. i
(2) ss
It must	 be	 known what
	
prooesses
	 were	 running	 on	 the	 failed
processor,
	
and	 what	 processes	 and	 processors	 remain. Further,
processes executing on processors	 that	 survive	 the	 failure	 may
still be affected by the failure. 	 For example, their execution may F
depend on processes or contexts that	 were	 lost. with	 the	 failed
processor.
	
If	 anything	 is to be done about such processes, they
must be known and there must be 	 some	 way	 of	 communicating with
them. f
(3) Zelanot 1 .,Resuonse
lnfor=mation must be
	
provided	 so	 that	 a	 sensible
	
choice
	
of	 a
response	 can	 be. made.,;	 The
	
response 	 that	 is	 chosen	 will be
determined by which processors and processes remain$	 but
	 in many i
applications	 the. response	 will	 also	 be	 determined : by	 other
-
variables and these would also have to be knower.	 The height of . an
^1
V
t
NN
9
r.
133
aircraft, for example, might determine what actions should be taken
when part of the avionics system is lost,
After a failure is reported to the software on a. particular
Processor, the local reconfiguration software will independently
decide on a response and pint into effect the changes required on
that processor. The choice of a response depends on the
reconfiguration strategy pratrided by the programmer and on the
information provided, It is important that the infarmatiorn which
the reoonfiguration strategy uses be consistent across processors,
since if it is not reconfiguration processes on different
Processors could decide on . different responses and thus work. at
cross-purposes.
4) _Effec . _T11Q Response
Once a response has been decided on, it ,must be possible to carry
it out. The reconfiguration software should be able to create and
remove processes, start and stop processes, and be able to
communicate with processes so that. they can take appropriate action
on their own. In many cases the new processes will have to be
provided with data, and a consistent set of such data would have to
be available to the reconfiguration. software.
Various :
 difficulties are raised. by these requirements. Firstly, i
they depend strongly on data that is consistent across all machines. i
Without making quite unrealistic assumptions about the underlying
I
message passing system, it cannot be assumed that data is consistent
Y^I
i
LT
^	 t
334
phase protaeol. [ 9, J 0J can be 'used In this situation to ensure that
consistent data (but perhaps not the most recent values) is available on
all machines.
A second problem with the view of continuation taken above is the
treatment of 'unrecoverable objects E11J. If an. unrecoverable object has
been modified, bs%ckward error recovery is not possible following a
failure. The problem is no different on a distributed system than on a.
uniprocessor system. An apparent difference is that all the processors
in a. distributed system need to be informed of changes to unrecoverable
objeet6 and this has to be done in the presence of failures, However,
distribution of status information about unrecoverable objects is just
an ex mpl e of the data consistency problem discussed above.
.Y DISTRIBUTION " CONTINUATION ,TM Aft
We now consider the use of Ada; for programming distributed systems
in which processor failure has to be tolerated. What is needed is a
distributed system that provides the continuation facilities discussed
in oecticn IV.
Distributi2 n
The choice of objects to be distributed is an important question in
the design. of . a distributed system. Ada has a tasking mechanism and,
aecordi;.ng to the Ada Reference Manual [21, it is intended that tasks be
distributed in an Ada program;
t.
135
oil mul,ticomputers, multi,processors t or with interleaved exeeu
tion on a single_t^bYsical ,processor_.
Also$ it is clear from the requirements for the Language [12] and From
the Ada Reference Manual [21 that the tasking facilities are intended to
be used for all taste communication and synchronization even when
different physical. processors are executing the tasks involved, While
it Would be possible to devise a separate mechanism for inter.-task
activities between computers using some form of input and output, this
would be substantially less useful, than existing facilities and probably
program specific.
No facilities. are defined in Ada to control the distribution, of
tasks. It is essential that software that is to be used following the
failure of a particular processor not be resident in the memory of that
processor (otherwise the system would be a centralized). To achieve
this. separation, it is essential that the programmer be able. to control
the placement of both the primary and alternate: software. Surprisingly,
is	 k^
	
there is no explicit mechanism for control of distribution in Ada 	 I
although there are representation clauses to control the bit-level
layout of records, to allow the placement of. objects at.. particular. I
addresses within a ` memory, and to associate interrupt handlers with
k
specific machine addresses,
	It is not sufficient to.be able to control the allocation of tasks 	 !
to processors. The semantics of task distribution must take into
.account the possibility of failures. For example, if there are multiple
tasks ofof a particular task type and they are executing on different
processors,. a separate copy of the code. must be required, for each
r
°°°4444[[[ , ^
	
136
	 { i
processor, Otherwise $ an implementation could provide a single copy of
J
	the code that was shared by all processors; for example by Fetching a	 r;
	
copy when a now task body is elaborated. This would be satisfactory if 	
j^
there were no Failures. However, failure of the processor containing
the original copy of the code would then suspend all subsequent
elaborations.
Cont ,motion
For any particular programming language, the required continuation
facilities discussed in section IV could be provided in three different
ways:
(1) By using mechanisms in the programming language specifically
designed for that purpose.
(2) By using meohanisms in the programming language that were designed
for another purpose. If this were done, it would be a coincidence
if the mechanisms worked satisfactorily since they were not
designed to sur_;port fault tolerance.
() By using mechanisms outside the programming language such as
modifications to the execution-time environment or software written
in some other language, Perhaps an assembly language.
s
Unfortunately, Ada makes no explicit provision for continuation,
Many features of the language raise substantial difficulties in damage
assessment, and in selecting and effecting a response.
Hallur-e, Met - etion
An execution-support s stem for Ada is noty 	 required to provide any
facilities for detection of processor failure. No specific interface is
provided by the language to allow software to be informed of processor
f ailur e.
If failure could be detected, it might be possible to inform the
software by raising an exception or generating an interrupt; the latter e	 :+
fusing an entry call as its interface. In either case, appropriate
I	 '
placement of the corresponding exception handler or accent statement
	 1
becomes a problem since it must be assured that they will be executed	 f,
when required.	 Also, the necessary exception and entry names are not	 r
predefined and so their use is neither standardized nor required. 	 }
1
Damage Assessment
Clearly, the damage sustained as a result of a processor failure
includes
	 loss	 of	 the services that were provided by the software that
was executing on the processor that failed.	 It also	 includes	 loss	 of -	 r
the	 data contained in the memory of the failed processor.	 In addition,
in an Ada program the failure of a processor can 	 cause	 damage	 to	 the
software that remains.
	
Broadly speaking, two forms of damage can. occur.
A task can be suspended waiting for a message that 	 will	 never	 arrive, {
and.a	 task	 can
	 lose part of its context. 	 These will be discussed in
turn. ;I
,wsk Comunioatign
The problems.that.arise in task communication are best 	 illustrated
1
by	 an example. 	 Consider an Ada program that contains two tasks A and 13
where A is executing on one pr-,ocessor and
.
 B on
	
another.	 Suppose	 that
task A has made a call to an entry in task B, and that B has started the
corresponding rendezvous.. 	 If R's
	
processor	 now	 fails,	 task	 A °will.
remain	 suspended	 forever, because the rendezvous will never end.	 Since
the failure takes plane after the start' of the rendezvous, 	 'a	 timed	 or
{
'fag	 I^
conditional entry call will not avoid the difficulty.
Similar problems arise throughout Ada, both in explicit
communication such as the rendezvous and in implicit communications such
as task activation. A detailed examination of these situations is given
in appendix 2 and in 1131.
Lass -.Context
In block structured language s, a program unit can assume the
existence of an instance of all objects in the surrounding lexical
blocks. When a system is distributed, it is possible to have a given.
program unit on one processor and one of its surrounding lexical. blocks
4
on another processor. If the latter processor fails, it must be decided I
what to do with the surviving inner program unit.
A task in Ada relies on several contexts:
(1) The context of the body. This is the lexical, units enclosing the
body of the task.
(2) The creator. This is the program unit which creates the task.
(3) The masters [21, page 9--41. .
All of these contexts may be different. Each of them	 may be	 lost
due	 . to processor	 failure.	 Ada	 defines	 no	 semantics for	 these
situations.
Thus, the damage following	 processor	 failure will	 include	 lost
services, lost	 data,	 the	 permanent	 suspension of tasks . on remaining
,.
o
processors for a variety of reasons, and the loss of contexts of	 some
tasks.
	 This	 damage	 could	 be	 quite	 extensive. As it is presently.. I
i
1.3 9
defined, Ada provides no way of determining the extent of this damage.
Indeed, any attempt to assess the damage could cause the task enquiring
about the damage to be suspended itself. Suppose for example that the
attribute callable was used by a task to determine whether another task
was callable; the interpretation of the value true being that the second
task was still present and providing service. If the tasks were on
separate machines, the implementation of the attribute would require an
exchange of messages. Since no reply could ever be received after a
failure, the enquiring task would effectively be suspended..
Selecting and Effecting 2-h-a Response
The purpose of effecting a response is to replace services that
were lost. The source. of the new services will have to be software that
resides on machines that remain after the failure.
7*I
i
1J
1
!	 `r
14o
Even if replacement services could be started and distribution
could be controlled, it is still necessary for the replacement software
to communicate with the software remaining after the failure. This
means that communication paths used before the failure have to be
redirected. Communication will ' be. primarily by rendezvous. The
rendezvous in Ada is asymmetric and so a .calling task needs to know the
name of any task containing an entry it. wishes to call, but a called
task need not know the names of tasks that will call it. If a calling
task has to be replaced because of a failure, the replacement can call
the	 same	 entry	 that	 was called by the lost task.
	 The entry is still
available in the same task that was being
	 called	 before	 the	 failure.
Thus
	 redirection
	 is	 trivial if a calling task is lost.
	 However, if a }
called task has to be replaced because of
	 a	 failure,
	 the	 replacement
cannot	 be	 given	 the	 same name as the task that was lost.	 This would
duplicate the definition of a task name in the , same
	 scope.	 Thus,
	 in
this	 case,	 redirection
	 is very involved.
	 The replacement called task 1
will have to have a 	 different	 name	 and,	 more	 importantly,
	 all	 the
calling tasks (that may not have been replaced) will have to begin using
a different name in their entry calls.
This difficulty is not	 quite	 so	 serious	 for	 tasks	 created	 by
alloeators.	 Since	 assignment	 is	 allowed	 for	 access
	 variables,
communication can be redirectedassi
 ssigninggn	  ' a	 value	 representing
	 an
alternate	 task	 to	 an access	 variable used to make entry calls.
	 Two
Problems then arise.. 	 First, the entire. program has to be written
	 using
^	 t
access variables to access tasks, and second, alternate. tasks have to be E	 }
of the same t	 as the primary task which.mY^	 P	 ^'Y	 ay not be convenient.
1
i
141 
	 d.j,
Selecting and implementing a. response relies on the availability of
data	 that	 is	 consistent	 across machines.
	 Ada makes no provision for
ensuring that data can be reliably distributed across machines. 	 i	 F'
Finally, in effecting a response, it will be necessary to take care
of	 those.	 tasks
	 damaged	 ^.	 by the Failure.
	 The only way that this can be
done in Ada is to abort them.
	 Further, since some computing 	 fac' ^t' 
	
ng	 ^.3. ,	 esn.
have been Lost, the response that is chosen might also involve modifying
services that were not affected by the failure by
	 aborting	 some	 tasks	 F;
and	 starting	 others.	 Because .. of
	 the	 scope	 rules of the language,
aborting several tasks will be difficult to arrange if the program makes
use	 of	 nesting	 and	 a	 single piece of software is to contain all the
necessary abort statements.
jdl
-U FAILURE SEMANTICS FOj	 1
-	 r
The remaining sections of this paper show how Ada can be used,in
	 a
fault-tolerant	 distributed system.
	 A first step in the construction of
such a system is to fill in the gaps in the semantics of
	 Ada	 mentioned,
in	 the previous .`section. In. particular, the meaning of distribution and
the effect of damage to the remaining software
	 caused
	 by	 a	 processor =i
failure	 must	 be	 specified.	 While	 it	 is	 not difficult to choose a
reasonable meaning for distribution, the problem
	 of	 what	 to	 do	 with
E
damaged	 tasks	 is	 much	 more difficult. It must be emphasized that the
semantics ,
 suggested in this section were chosen
	 so	 as	 to	 follow	 the
existing language as much as possible.
	 ;
:j
:w
r
xi
1 #2
bi gtribrtlon
The primary aim of distribution semantics is to avoid the
possibility of a centralized distributed system. It will be assumed that
only tasks will be distributed. The distribution of a task T to a
processor P will be 'taken to me aft that the task activation record for T
and all of the code for T will be resident on P.
Damaggd ask„ ,
In section IV it was pointed out that the failure of a processor
may affect tasks running on the remaining processors, and that many
Language features can cause these problems. The difficulties do not
arise because tasks were lost when the processor failed. Any -task could
be removed. From an Ada program at essentially any point wlthQut
Processor failure by execution of an _abort statement. Rather, the
difficulties arise because the semantics of the language fail to deal
with this situation. Ada semantics precisely defined for tasks
being aborted and for the subsequent effects on other tasks, and the
execution-time system is required to cope with the situation. We
suggest therefore that that damage following processor failure can be
handled as if the task that were lost had been aborted. This would
allow the language-dependent part of the damage following processor
failure to be treated using existing language facilities.
This choice of semantics leads to a final problem that needs to be
addressed: the status of :the main program foll.owi:ng fai..lure. By
definition all non-library
 tasks in an Ada program are nested inside the
I	 ^
I?
^Y
1 43
	 r^
treated as If _abort . statements had been executed on the lost tasks, then
a serious problem arises with the main program. Vhen a task is aborted,
all its dependents are aborted also. For any task lost through failure
this is reasonable. It means that all the dependents that were not lost
with the task have to be aborted by the system. However, if the main
program is lost, this implies that all the tasks that depend on the main
program (that is almost all of them) will have to be aborted. This
effectively removes the entire program. Clearly this is unsatisfactory.
We suggest therefore that the main program has to be treated as a
special case. For the main program, and only for -the main program, the
execution-support system will have to create an exact replacement if the
main program is lost through processor failure. To ease the overhead
that this involves, we suggest that the main program be limited to a
single x_ ll statement.
1 QPPORT 
.SYSTE STHECTURE ZU AJA
Although Ada; does not support the facilities required for
continuation explicitly, the semantics described in section VI can be
achieved if the execution-time: support structure is suitably modified.
In this section we discuss the necessary modifications. In section VITT
we show how they are used with Ada.
1 44	 rl
less desirable because it requires additions to existing or planned I
i{
	
systems * and the detection hardware itself could fail. We suggest	 ;!
therefore the use of software failure detection.
Software failure detection can be either passive or active. A
	passive system might rely on tasks assuming that failure had occurred if	 f
	
actions did not take place Within a * ireaso.nablen period of time i.e.	 '.rI`
timing out. Alternatively, a passive system could require that all
messages passed between tasks on separate processors be routinely
acknowledged. This is a particularly simple case of timing out since
failure has to be assumed if no acknowledgement is received.
The disadvantages of passive detection are:
(i) Timing out assumes an agreed-upon upper limit for response time.
(2).  A failed .processor will not be detected. until communication is
attempted and this may be long after the failure has occurred.
Upper .bounds on response time may be hard to determine.. Very
r
complex situations can arise from an incorrect choice. The reason for a
lack of response. from a task. on another processor. may not be failure . of.
	
that processor but _ merely a temporary rise in its workload. The	 !
	consequences of timing out could be an assumption by one processor . that.	 I
another had failed, followed by reconfiguration to cope with the loss.
Clearly, if this assumption is wrong, two processors. could, begin . trying
to provide the Sang service,
	
Being unaware that a processor had failed will lead to a loss of	 1
the service it. was providing until the failure is noticed.. In a system
	
with many processors each providing relatively few services, the amount 	 i .
145
of inter--prooessor communication might be quite low. A failed processor
may then go unnoticed for so long that damage: to the system being
controlled might result.
It is for these reasons that we reject passive software failure
detection and suggest the use of active softwtare .failure detection, In
I
an active system, some kind of inter-processor activity is required
periodically and if it ceases, failure is assumed. The messages that
are passed are usually referred to as .heartbeats. Multiple failures may
	
.. occur at essentially the same time but transmission times may vary. 	 1
	
Since it is Important that machines surviving failure have a consistent 	 {
	
view . of the system state before they begin reconfiguration, the	
i
1
	
heartbeats must be organized so that each remaining processor gets the	 I{
same information about the failure. There are many ways to achieve
this. For example, all machines may be required to generate their
I
heartbeats at approximately the same time so that each machine will
receive all the heartbeats of the other machines in a given intervals.
Any not received in this interval can be assumed to have failed.
	
A final question of implementation is whether : the generation and	 I
1
	monitoring of heartbeats should be the responsibility of the programmer
	 1
l
or the Ada. execution-time support system. We favor the execution-time.
support system for reasons discussed below.
,k -T
^I
i
1	 iA
Ada PROGRAM
EXCEPTIONS	 HEART
ENTRY CALLS
fi
MASSAGES MESSAGE
	 SOFTWARE
LOG	 SIGNALLING
SYSTEM
HEART
HEATS .
IS O
UPPER
	 f
PROTOCOL
LAYERS
	
i
i
ISO
LOWER
PROTOCOL
LAYERS
IV
1 X16
r
' r	 ' it
Figure 2. - Implementation Model
Whenever any communication takes place between tasks
	 on	 different
F.
processors,	 the execution-time support system on the processor starting I
the communication records the details . in 	a	 message . jgz.	 Whenever	 a. iI
failure is detected, each processor checks its message log to see if any
of its tasks would . be damaged by the. failure .(permanently suspended
	 for } E
example).	 If	 any	 are	 found,	 they are sent faire messages.
	 They are 1
called "fake" because they are constructed to appear to. come
	 from	 the
failed	 processor	 but	 clearly	 do not.
	 The message content is usually
-
I
..equivalent to: that which Would . be received if the
	 lost..	 task	 had	 been
aborted.
	 In this way, each processor is able to ensure that none of its
tasks is...permanently damaged, and the ,action following failure for
	 each ff	 a
remaining
	 task is	 that	 which	 is associated with an abort.	 It often
takes the form of raising an exception.
.	 h
i
f
t-;
Y
1 47
Clearly it is possible for unsuapectixig tasks to attempt to
rendezvous w$.th tasks on the failed processor- after Failure has been
detected.  signaled, and other communications terminated. This situation
can be dealt with easily if the execution-time. support system returns a
Fake message immediately indicating that the serving 'task has been
aborted and that rendezvous is not possible.
Because of the fact that .a fairly extensive set Of facilities is
required in the exeoution-time system for fake messages, we suggest that
the heartbeats be handled here also. There is a clear need for
cooperation between the heartbeat monitoring system and the fake message
system. Operating. both at .the same level is probably the only practical
approach. This has the additional advantage that the programmer is not
burdened with the need to include the heartbeat system in his program. i
Finally, the heartbeat system is so central to the reliability of the	 »,
entire system that it should operate at the lowest practical level of	 t
the software system. Thus it relies for its operation on the correct
operation of the minimum amount of other software.
i
Selecting 
	 Ydf a dW 3hi^, Res]2!213sg
	
^ r
Since consistent data across machines is essential to allow a
response to be chosen and implemented, the exeeutiork-time support system
for Ada must ' provide a mechanism for ensuring that data can be 	 €,
distributed reliably. s mentioned in section IV a two phaseY• ,  protocol
can be used and we propose that an implementation of it be included in
the execution-time system along with the message log and heartbeat
meth ani sm.
Flo, mCt	 — ^ s X' may.,	 . ^ .
948
In this section we show how a fault tolerant Ada system can be
built using the support mechati=o just desori#ed, As was shown in
section, V, Ada does not provide any specific facilities to support this
type of fault tolerance. Existing . feattwes of the language that were
not designed for the purpose have to be used to interface with the	 II
modified support system. 1
E	 {
a ry l ure. ne„j: eati,on
When failure of a processor is detected by the heartbeat mechanism,
this information must be transmitted to the software running on each
remaining processor so that reconfiguration can take place, The
information is available to the execution-time support software in some
internal format, but it has to be transmitted to the Ada software using
4
an existing feature of the language.
i
As noted in section V, if the Language is not to be changed one
approach is to make use of the language's exception mechanism, and have
the execution-time system generate a predefined exception on each	 I !
I.
processor remaining after a failure, if this is done, it is not clear
where the handler for the exception should be. placed. The handler; wa1>:
be receiving extremely important information (namely that a failure has 	 1
occurred) and,. in order*. to deal with the situation, it mush be
guaranteed that the .handler will be executed..._ Unless handlers for the
p.	 p	 everyexception are laced.. in eve task that might. be running, executiob. of -	 d
the handler cannot be guaranteed. An alternative is to define a special
I
task to contain the handler and to l'ai'c the exceptions this task
149
only.	 Clearly, -the special, task should not be executing until the
i ^
exception is raised, Unfortunatelyp there is no way to activate a task	 1
r,
with an exception in Ada.
Another approach to signaling failure is to view the required
L
signal as being very like an interrupt, and transmit the information to
the Ada software by a call to a predefined entry. Again there is the
I
problem of where the entry should be defined to ensure execution.
However, in this case, the solution of defining a special task and
P
defining the entry within it works very well. If the task is .given a
very high priority, it will be suspended on the entry until the call
that signals failure, whereupon it will immediately begin execution.	 t
Wepropose therefore that a special. task (RECONFIGURATION-1) be
defined on each processor (I being the processor number) that will
E
}
contain a special entry with a single parameter. The .accept statement
t
associated with the entry will be in an infinite loop. This task will
normally be suspended on thecc
	 ,^, statement for special entry and 
when a failure is detected by the heartbeat mechanism, a call to the
i
entry will be generated. The parameter passed will. designate which
	 C
1
element of the system's hardware has failed. The task will then be
activated and will contain code within the accept statement .
 to handle
reconfiguration.. A general form of the body of this task is shown in
.figure 3. Since this task is in an infinite loop, it returns to the
accept statement once a particular failure has been dealt with. Thus
subsequent failures will be dealt with in the same. ray ..and, in
principle, any number of sequential failures can be dealt with.
Further, if physical. damage removes, more than one processor. at the same.
^^0 d
i
task body RECONFIGURAT OI T is
I
g^
begin
ii
-- initialization code 1
loop 1
accept FAM URE(WHICH	 in failure types) do
-- code to handle hardware failures !'
erica accept;
^.
end Loop; #`	 ►
end .RECONFIGURATIOIJ I;
Figure 3.
t'
'	 time, the remaining processors will notice the
	 loss	 of	 heartbeats	 n f
some	 order,	 and calls to the entry in the reconfiguration task will be
i
generated sequentially.	 Thus multiple failures occurring together	 will
be. dealt with as it they had occurred in some sequence.
a	 -
!
Damage Assessment
Given the support system described in section VII, damage	 will	 be ''I
limited	 to	 lost	 tasks	 and	 lost	 data.	 No	 remaining tasks will be
suspended.
	 Each 'task that could have been suspended will have 	 received {
rake	 messages	 giving	 it	 the	 impressions	 that	 the	 task	 it	 was l
communicating with
	 had	 been aborted.	 It	 is	 the	 programmer s
responsibility	 to ensure that the subsequent actions by these tasks are
appropriate.	 In addition, all tasks	 losing	 contexts	 will	 have	 been
aborted:
The :tasks and data.that were lost need to be determined. 	 Pr.ovided
t
f
there is control of distribution (see below)	 this is quite simple.	 The
t
,information
	
about	 which	 tasks	 are	 on which	 processors	 could	 be
r
' yard
FI
f
i
i
it
I
FI
151
maintained and referenced i.n three ways.
(1) It could be stored in a table within the. program itself.
(2) It could be stored within a table maintained by the execution-
support system4 This table will be needed in any case because. it
is required to implement inter-task communisation. Provision could
be made for the program to interrogate it.
(3) The information could be stored implicitly within the program, if
.all. task-to-processor assignments are known at compile time and do
not change, the code used for reconfiguration following failure can
be written with the distribution information as an assumption.
There is no clear advantage to any of these methods. The choice in any
particular case is implementation dependent. In the e.%ample given in
appendix 1 we use the third method.
a ec±dnz	 EffActi= 2ha Re s_e
Algorithms for the selection of a suitable response, and the
algorithms used in that response, depend for their correct operation on
having appropriate.. data available. Each piece of data being manipulated.
by a program for a typical embedded application can be regarded as
either exoandab7e or essential.  Partial computations and sensor
readings are expendable whereas navigation information or weapons'
status are easential.
Expendable data need not be preserved across maclhine. .allures. 	 A
4
r
1.52
given no special attention, and that the replacement software be written
with the assumption that these data items are not available.
Essential data does heed to be maintained across machine failures.
In. an ` Ada program this could be impl eimented in two ways. First, data
Items that. the programmer .considered .to. be essential: could be marked as
such (perhaps by a pragma), and the system would then be required to
ensure that copies of these data items Were maintained on all. machines.
Each time the data item was modified, all the copies would be updated.
In the event of
	 failure,	 one	 of	 the	 backup	 copies	 could	 be	 used
immediately. 	 This
	 is	 simple	 for	 the	 programmer	 but	 potentially
,	 Y
inefficient.	 Consider for example a large array that was designated	 as
essential	 If
	 it	 were	 being	 updated
	 in	 a loop, as each element was
changed, it would be necessary to update all the copies of that element.
The	 entire overhead associated with maintaining consistent copies would
fF
be incurred for each element change.	 In	 practice	 in	 order	 to	 allow
i
reconfiguration, it Would probably be adequate to wait until the all the
elements of the array had been modified and then update 	 all	 copies	 of
the array at once.
	
-
-
A second approach is to provide the programmer with	 the	 tools	 to
l
generate	 consistent
	 copies across machines,.	 In this .way, not only the i
data items to be preserved but also
	 the	 times	 during... execution	 when
copies
	 will	 made will. be
 under - the
	 tP	 programmer's control.	 We suggest
that this could be done by providing a special task (AATA_CONSISTENCX_T) a
on	 each
	
processor
	 that will contain an entry
	 ith ` a single
	 ,	 	 parameter;
1
f
The parameter would be a record with ` a number of variant parts, one part
for	 each essential data item.
	 Calling the entry and passing the latest
4
E
value of a data item in the record together with the appropriate
discriminant causes the necessary copies to be made and distributed
while the calling task Waits. Completion of the rendezvous indicates
that this process has been satisfactorily completed. A general form of
the body of this task is shown in figure 4.
11
Another interface that could be used to provide access to the data
154
R
Y^
3i
required to appear in the specification of the task or task type to
which it applies as is dome with the predefined pragma priority. This
pragma would require the compiler to generate instructions and loader
directives for the designated task to ensure that it is placed in the
required machine. This is the notation used in the example given in
appendix 1. Alternatively, for tasks created by all:ocators, the pragma
could be required to appear in a declarative part and it would apply
only to the block or body enclosing the declarative part (similar to the
predefined. pragm,a pptimize) • Its effect would be to cause all tasks
created by all.ocators in the body or block to be distributed to the
machine designated by the pragma. Similarly, for a particular
implementation, the identifier parameter in the address clause could be
interpreted as a task name, and the expression parameter as a machine
designation.
As was. pointed out in section V, the creation and deletion of tasks
that might be required as part of effecting a response is easily
achieved in Ada using . al.locators and the .ate-t. statement. Thus the
particular aeeent statement within the reconfiguration task that is
executed for a given failure can create and delete whatever tasks are
needed to provide alternate service.. A simpler approach to providing
replacement software is to arrange . for the required replacement task to
be present and executing before the failure, but suspended on an entry.
Surd a task would not consume any processing resources although it would
use memory, but it could be started by the reconfiguration software very
quickly and easily . by calling the entry upon which the replacement task
is suspended. A general fora for a replacement task is shown in figure
C
t
/
Ei	
M
E
r
155
5. This is the mechanism used in the example in appendix 1.
Redirection of communication to alternate software that has been
started following a failure has to be programmed ahead of time into
tasks that call entries in tasks that might fail.. It will be necessary
for the reconfiguration task on each processor to make status
I	 ^!
E	 .4^
^	 r	 ^
information about the system available to all tasks on 	 that	 processor.
E	 s	 t the	 interrogate this information before making a call toach ta k mus 	n ^n
	
	
g
1
an entry on a remote machine in ease the entry has 	 changed because	 of ;	 r
failure.
In summary, an Ada program that uses the support 	 system . described
in	 section
	
VII	 to	 allow	 it	 to	 tolerate	 the	 loss	 of one or more z
processors. would have the following form: J.
(1)	 A main program consisting of a. single nu i statement.
-i
task ALTERNATE SERVICE is j
pragm:a distribute (PR OCESSOR—T); i.
end ALTERNATE SERVICE; .
task body ALTERNATE_-SERVICE is
begin
-- Code necessary : to initialize this alternatealternateservice. -:
accept ABNORMAL START;
--.^ This task will be suspended on this entry until it is
_- called by RECONFIGURATION--;L following failure.	 The
.^ code following the acce pt statement provides the
-- alternate service,
end ALTERNATFL_SERVICE •, i
Figure 5.
x
	15d	 ^
r.;
(2) A static structure in which there is little or no nesting of the
application. and alternate tasks themselves. They may define nested
tasks for their own use. This is necessary to ensure that these i
	tasks are visible to the reconfiguration and data consistency 	 s
tasks.
	(3} A set of tasks providing the various application services; the	 ^.
	
distribution . of the tasks being controlled by an implementation- 	 f
defined pragma or address clause. Each task would contain handlers
for exceptions (such as tasking errog) that might be generated by
the support system if failure occurred while that task was	 engaged
in communication with .a task on a remote machine.
() A set of tasks designed. to provide any alternate service	 that	 the
1
ri i
programmer	 chooses;	 each	 alternate	 task	 suspended on an accept
statement that will be called to start it executing. 1
1	 II
(5) A task on each processor designed to cope with.. reconfiguration, 	 on
that	 processor;	 this	 task. containing one entry for each hardware
component	 that might	 fail.	 These	 entries	 would	 be	 called
i
1
automatically be the support system following failure detection.
(6) A task on each processor designed to distribute copies of essential
_
data for tasks on that processor. 	 Rendezvous with this task allows 1
fany other task	 to	 distribute:	 essential	 data	 at	 any	 time	 the
programmer chooses.
e
^	 4
157
U. ONCLus_rou
Although the probability of failure per unit time for a modern
fault-tolerant processor is low, it is not zero. The loss of processors
in a distributed system is certain to occur and must be anticipated. In
order to benefit from the flexibility of distributed processing, crucial
systems must be able to deal with processor failures.
Ada was designed for the programming of embedded systems, many of
which
	
are	 crucial, and distributed.
	 We have examined Ada's suitability I	 ?^
for programming distributed systems in which processor failure has to be
tolerated
	
and	 found
	 it	 to be inadequate.	 The difficulties have been 9
discussed and proposals	 to	 avoid	 them	 have	 been
	
suggested.	 These ^	 t
Proposals	 involve	 extensive	 modification to the execution- time system. {
used by Ada and careful organization of the Ada program
	 itself	 but	 no
language changes. t
Although the discussion presented here is in terms 	 of	 distributed
systems,	 similar
	
problems	 can
	
arise	 in shared-memory multiprocessor
systems where processor failure has. to be anticipated.
	 if the system is
organized
	 so	 that	 different
	 processors	 execute	 different
	
tasks,
processor failure at an arbitrary point could produce exactly
	 the	 same
damage as was discussed in section V.
We consider the non-transparent approach to be the only one that is
feasible
	 and	 this	 requires	 language facilities for its support.	 The
fact that Ada make s 	no	 explicit. provision: for	 this	 type	 of	 fault
tolerance
	 is	 unfortunate.	 The	 solution presented in this paper uses
existing features of the language wind: is far from ideal..	 Modifications
P,
;I
^`	 1
#II

WW
i
r
is
L.	 '4
1	 FI
f
4	
.^
r.
3	 -'
1,	 ^	 i
159
Appendix 1 of APPENDIX .
 5
A Programming Example
This is a very simple example designed to illustrate some of the
ideas discussed in this paper, in a typical Ada application, the
program would be much larger and would have to hake into account all the
language features mentioned.
The example consists of a calling task CALLER that operates on one
processor (processor one) and a serving task SERVER that operates on
another processor (processor two). The calling task does some real-time
processing and calls an entry in the serving task in order to get some
kind of service. The program as Written to cope with failure of either
processor. Alternates are provided for the calling and the serving
tasks and a reconfiguration task is present on each processor.
Normally only the calling and serving tasks are executing and . a
16o
entry call, on the other processor to an entry in its task RECONFIGURPLI
(where I is the processor number). This task then calls the
ANNORMa_$TART entry for the alternate that is needed and processing is
able to contint,	 Entries are defined in RECONFIGURE_T for each
component that might fail.. In this example, each machine is only
interested in the failure of the other so only one entry is defined in
each reconfiguration task.
Tf a rendezvous is in progress when the failure occurs, then the
serving task need not care that the calling task has been lost, and the
rendezvous can complete. The calling task will care if the serving task
has been lost because this will indefinitely suspend the caller, Thus
TASKING_ERRCR is raised by the run-time system in the calling task,
This frees the calling task and allows it to prepare itself to use the
alternate server.
	
Note that the server does not need to be aware that the caller has	 a
been replaced by an alternate if the caller-Is machine fails because the
!r
rendezvous is asymmetric. The entries in the server can be called by
any task, in particular both the caller and-the alternate caller.
	 4I'I

i
F
11
^	 yl
l
i
152
Appendix 2 of APPENDIX 5
Damage in `bask Communica tion
In this section, the simple rendezvous will be examined in detail
as an example. Other,
 language elements involving bask communication are
considered only briefly since, for , the most part, the difficulties that
arise From processor failure are similar to those that arise ;,n, the
simple rendezvous. The phrase "the processor executing task X fails"
will be abbreviated to "task X fails" whenever no confusion arises.
A simple rendezvous in Ada consists - of a. calling task C making an
entry call, S. E j to a serving task Ss that ipi1tgine an a2gop! statement
for. the entry E., The syntax is shown informally in figure 5. The
semantics of the language require that if the call is made by C before
the -acelpt statement is reached by S, C is suspended until, the aeoepb
statement is reached. If S reaches the g9cept statement before the call
is made by C, 5 is suspended until the call is "fade. In either ease, C
-nArnni na 	9-hea '"ar/iP*l.Vniitt i cal f 4c nmmnl a +-c_
k	 --
I
163
In order to look at the effects of processor failure on the
rendezvous, it is necessary to specify an impl.ementation at the message
passing level.. Only the simple case of a task C calling an entry E in a
serving task S will be considered. Further, . we assume that the call is 	 1
a
made before S has reached the corresponding,acggpt statement, the ease
7
i
Where the serving task waits at its agaeot statement is similar.. One 	 a
possible Inessage sequence is shown in figure q.
t
The calling task C asks to be put onto the queue for entry E, When
S reaches its accent statement for B, it sees that C is on the .queue. C
can be considered to be engaged in the rendezvous after the 	 k'
RENDEZVOUS START message arrives.a.t..C. When the rendezvous is completed
i.
tha RE'tVDEZVOUS^COMg, 'EJI m ss LEP %tld
	
fi	 r s woad nantinue.	 t
We assUne that all messages arrive safely.
Using this implementation of a simple entry call., what happens if
f
either, the serving or the calling task fails? A detailed examination of
all possible cases has appeared elsewhere [13] and will not be repeated	 r'
Calling Task C
	
Serving Task S
li
164
V
here. However, it is clear that there are several, situations in which;
processor failure could cause one of the tasks to be trapped. For
example, if task S fails at any point offer the RENDEZVOUS—START message
is sent but before the RENDEZVOUS COMPLETE message is sent, the Latter
will never be sent, Task C has no way of distinguishing this situation
from a long service time by task S, and so will wait forever. Although
the processor executing C is still working, task C is permanently
suspended by the loss of a different processor.
It might appear that the.. timed, entry call solves some of the
problems raised above but it doe; not. The semantics of the timed entry
call appear to be quite str . aighti-forward [23:
A timed entry call issues an entry call that is canceled if a
rendezvous is not . started within a given delay..
^1
165
definition that a timed entry call with a delay of zero is the same as a
conditional entry call;
if a rendezvous can be started within the specified duration (or
immediately, as for a conditional: entry call, for a negative or
zero delay) , it is performed and the optional sequence of state-
ments after, the call is then executed.
If the delay included both message passing time and time on the queue, a
delay of zero would be impossible and a timed entry call with a delay of.
zero would never succeed.
f
i
Another interpretation of the delay is that it is just the time
spent waiting on the entry queue. We assume that this is the delay	 F
t.
intended by the language definition since this has a meaning when the
i
specified delays is zero.
A 'limed entry call gives protection against having to wait too long
on the entry queue. Thus,. it could be used to provide protection against
processor failure before the rendezvous starts but not afterwards. An
analysis of the message traffic necessary for the timed entry call can
be performed that is similar to that shown in figure 7 The issues that
arise.when considering failure are similar but more extensive than those
that arise with the simple rendezvous. What the task issuing the call
needs is . some guarantee. that it will not be trapped in an attempt to
communicate, and fgrced to miss .a deadline. It does not matter to the	 I
task, whether the time is spent Waiting on a queue, or . atterupting to
t	 send a message, or any other activity.
k	 The conditional entry call is no more helpful: than the _timed entry	 I
call. Again, the. semantics of the conditional entry call appear to be i1
M
I
 
J'7J
166
quite straight-,forward [2J
A conditional entry call issues an entry call that is then can-
celed if a rendezvous is not immediately possible.
By a similar argument to that used with timed entry calls, we conclude
from the rules of the language that "immediately" must mean zero waiting
time on the entry queue. As message passing time can vary,
"immediately" may turn out to be an arbitrary. delay. Apart from the
semantic difficulties arising in a distri buted system, the possibility
of the caller being trapped indefinitely following processor failure
occur s with conditional entry calls as with the other rendezvo us.
We now consider the creation of nested tasks. Again, a detailed
examination of the difficulties arising with. task creation has appeared
elsewhere [13]. Here, we give an example to show the potential.
problems.	 Task creation by allocators will not be considered; the	 t^
difficulties that arise are similar.	
Y
A
.
-taskIs created in two steps. First, it is elaborated at which
point entry calls can be made to it. Second, it is activated, that is,
the declarative part of its body is elaborated and it begins execution. 	 I
Elaboration of a taok occurs as a part of the elaboration of the body of
the declarative part of the parent unit. Activation of a task occurs
}167
To see the difficulties that task creation can raise, consider a
set of three tasks P, A. and B, with A and B nested inside P. A and B
have no other tasks nested within them and each task is to execute on a
different processor; P on processor one, A on processor two, and B on
processor three.	 The elaboration of the body of P includes the
elaboration	 of A and B	 and so messages will be sent fromProcessor one
tt
g^
2.
to the other processors requesting the elaboration of	 A	 and	 B.	 Once a.,	 1
this is done, masks A and B can accept entry calls.
	 Whin task Preaches
t
its bg;RJ]2, all of the objects that it declares have been elaborated, and r
A and B are then activated.
	 This requires an "activate All message bei3sg
sent from processor one to processor two and	 an	 "activate	 B"	 message
being	 sent	 from processor one to processor three. 	 The aotivation.of P
requires that the activation of A and B be complete,
	
and	 so	 P	 cannot
proceed	 until	 it	 has
	 received responses to these activation messages
indicating that A and B have been activated.
	 Clearly there are numerous 'y
difficulties	 that	 can	 arise if any one of the three processors fails.
For example, P will be suspended forever if 	 either	 processor	 two	 or
three	 fails	 at any time before both A and B have completed activation.
In that case, P could not proceed because one of 	 its	 dependents
	
would
never be activated.	 Similarly, A would be trapped if it called an entry
1	 in B and processor three failed after. B was elaborated but before it was .^
activated.	 Both A and B would be trapped if processor one failed after
r	 the elaboration messages, were sent to. A. and B, but before the., activation
messages were sent. :.
Task termination produces difficulties also.
	 A task waiting	 at	 a
f
,se7_eet	 statement	 with	 an	 open terminate arm can be terminated if its
i
468
master is completed and all other, dependents of the master are either
terminated or waiting at select statements with opan terminate arms. in
order to check this condition it is necessary to suspend all the
dependents as they are checked. If this is not done, it is possible for
two dependents each to be waiting at a .p 21gpt statement with as open
terminate when checked] but never to toth be wasting at select
statements simultaneously. If the dependents are executing on a
different processor from the master, it is necessary for the master to
send messages to its dependents suspending them for the duration of the
termination check. If the master task's processor fails before a
message to restate is sent (assuming that termination is not possible), a
suspended dependent will remain suspended forever.
Even accessing a non-local or a shared variable could cause theii
referencing task to be suspended. Access to a variable stored on a
remote machine requires that a request message be sent and that a reply 	 i
i
be received.
	 Failure of the remote machine between.the two messages
	 d
would cause suspension of the requesting task. Although the language
allows an implementation to use a copy of a shared variable, updating it
only at synehronization points, and the definition of a synchronization Ii
point can vary depending on whether or not the variable is declared as
shared, there are still implied update messages at synchronization
	
# -
points. Access to a shared variable on another processor requires that
a dialogue take place, and failure of the processor on which the shared
variable resides could trap the task attempting to reference the
variable.
	
I
169	
;ra
REFERENCES
( 1) McTigue, T. V. "F/A-18 Software Development - A Cass Study',
Proceedings .
 Of The AGARD Conference On Software For Avionics, The
Hague ?
 Netherlands, September 1982.
2) Reference Manual For The Ada Programming Language, U. S. Department
of Defense, 1983.
(3) Department Of Defense Requirements For High-Order Computer
Programming Languages _ STEELMAN, U. S. Department of Defense,
1978.
	
-
(4) Wensley, J. H. et al, "SIFT, The Design and Analysis	 of	 a	 Fault-
Tolerant	 Computer	 for Aircraft Control", Proceedings of the IEEE, -	 s
Vol. 66,  No. lop October 1978.
i
.(5) Hopkins, A. L. , et al, "FTMP -	 A	 Highly
	
Reliable	 Fault-Tolerant
Multiprocessor For Aircraft", Proceedings of the IEEE, Vol,. 66, No. i
1 Or October 19784
(6) Tanenbaum, A. S. ,	 Network Protocols. , ACM Com puting Surveys,	 Vol. I.
13, No. 4, December 1981.
(7) Schlichting R. D., and F. B. Schneider, "Fail-Stop
	
Processors:	 An ll
Approach	 To	 Designing	 Fault-Tolerant	 Computing	 Systems",	 ACM
Transactions On. Computer Systems. 	 Vol. 1 , No. 3 , 1983. 
(8) Cornhill.,	 D..,	 "A . Survivable	 Distributed	 Computing	 System	 For
Embedded	 Applications	 Programs	 Written In Ada", ACM Ada Letters,
Vol:: 3, No. 31 December 1983. d

