Development and evaluation of a fault-tolerant multiprocessor (FTMP) computer.  Volume 4:  FTMP executive summary by Lala, J. H. & Smith, T. B., III
x.,,#_-c,f'-,.,g_,,:..o.@
NASA Contractor Report 172286
NASA-CR- 172286
19840017298
?
DEVELOPMENT AND EVALUATION
OF A FAULT-TOLERANT
MULTIPROCESSOR (FTMP) COMPUTER
Volume IV
FTMP Executive Summary
T. Basil Smith, III and Jaynarayan H. Lala
THE CHARLES STARK DRAPER LABORATORY, INC.
555 Technology Square
....... _.- --" "i',_'(" '''_::
Cambridge, Massachusetts 02139 "'_0_'_" _'_"_"I"_'_'I_'_'°'; '""CONTRACT NAS1-15336
FERUARY 1984 .._.._
r_ _ m_._A_ _o_'__' _
"-"_'J'": ,o_4
NASA _GLE_R_s,_rE_
National Aeronautics and LIBRARy, NASA
Space Administration HAMP[ON, VIRGINIA
Langley Research Center
Hampton, Virginia 23665
https://ntrs.nasa.gov/search.jsp?R=19840017298 2020-03-20T21:55:35+00:00Z

R-1603
DEVELOPMENT AND EVALUATION
OF A
FAULT-TOLERANT MULTIPROCESSOR (FTMP) COMPUTER
VOLUME IV
FTMP EXECUTIVE SUMMARY
by
T. Basil Smith, III
Jaynarayan H. Lala
February, 1984
Contract NASI-15336
with
The NASA Langley Research Center
Hampton, Virqlnia 23365
/
Approved :
N. E. Sears
Department Head
The Charles Stark Draper Laboratory, Inc.
Cambridge, Massachusetts 02139

CSDL-R-1603
DEVELOPMENTAND EVALUATION OF A
FAULT-TOLERANT MULTIPROCESSOR(FTMP)
COMPUTER
VOLUME IV FTMP EXECUTIVE SUMMARY
by
T. BasilSmith, III andJaynarayanH. Lala
February1984
FOREWORD
This report was authored by Dr. T. Basil Smith and Dr. Jaynarayan
H. Lala. Dr. Smith was the project engineer. Mr. Charles Meissner was
the NASA technical monitor for the period January-December 1982, and
Mr. Nicholas Murray was the technical monitor from August 1978 to
December 1981. Following are some of the people who contributed to the
success of this project.
Draper Laboratory
Dr. Albert Hopkins
Mr. Jack McKenna
Ms. Linda Alger
Mr. Kevin Koch
Mr. Alan Wimmergren
Mr. Robert Scott
Mr. Joseph Marino
Mr. David Hauger
Mr. Mario Santarelli
Collins Avionics
Mr. Ron Coffin
Mr. Charles Schulz
ii
TABLE OF CONTENTS
Chapter Page
I INTRODUCTION ....,..................,...................... I
2 THE FAULT-TOLERANT MULTIPROCESSOR CONCEPT ...........,..... 5
3 SYSTEM ARCHITECTURE AND EXECUTIVE DETAIL ,,.....,,,.,..,.,. 9
3.1 Overall Architecture Organization ..................... 9
3,2 Processor Module and System Bus Interaction .......... 12
3.3 Slave Region and System Bus Interaction .............. 13
3.4 Clock Generation Module .............................. 18
3.5 Bus Guardian Units ................................... 18
3.6 System Bus Interface Circuits ........................ 19
3.7 Power System ......................................... 20
3.8 Executive Software ................................... 21
4 RELIABILITY AND AVAILABILITY MODELS ....................... 25
4.1 Survival Probability Models .......................... 26
4.2 Dispatch Reliability of the FTMP Computer ............ 28
5 THROUGHPUT AND RESPONSE TIME PERFORMANCE .................. 31
6 TEST AND EVALUATION EXPERIMENTS ........................... 37
7 CONCLUSIONS ............................................... 43
8 REFERENCES ................................................. 47
iii
LIST OF FIGURES
Figure Page
I FTMP Physical Organization and Packaging ................... 6
2 Software Appearance of FTMP ................................ 10
3 Processor and Memory Redundancy ............................ 11
4 FTMP Probability of Failure ................................ 27
5 FTMP Dispatch Reliability .................................. 30
6 Frequency Distribution of Total Recovery Time
(I second scale) ........................................... 40
7 Frequency Distribution of Total Recovery Time
(100 msec scale) ........................................... 41
iv
CHAPTER I
INTRODUCTION
This report is the final volume of a multi-volume report on the
Fault-Tolerant Multiprocessor (FTMP) project sponsored by the Langley
Research Center of the National Aeronautics and Space Administration
under Contract NAS1-15336 covering the period from August 1978 to
December 1982. This volume is an executive summary of the project.
Previous reports produced under this contract are as follows:
Volume I - FTMP Principles of Operation
Volume II - FTMP Software
Volume III - FTMP Test and Evaluation
The FTMP architecture is an evolving concept which is directed
toward ultra-high reliability aerospace applications. The engineering
prototype was sized to serve as the information processing and control
system for an advanced civil transport application. Its projected
reliability, a probability of failure of 10-9 over a ten hour flight,
is suitable for full time fly-by-wire control in such an application.
The ultimate reliability and throughput of the FTMP can be varied by
addition or deletion of processing and memory modules and is also
affected by such parameters as packaging and circuit technologies. The
FTMP thus has an applicability which is broader than this single design
point, and it can be tailored to a wide range of demanding applications.
The flightworthy engineering model of the FTMP was constructed
under the present contract by the Collins Avionics Division of Rockwell
International Corporation per architectural specifications provided by
the Charles Stark Draper Laboratory (CSDL). The FTMP engineering model
was delivered by Collins to CSDL in June 1980. It was then integrated
with CSDL-developed software, MIL-STD 1553 remote terminal emulators,
special test hardware, and a real-time aircraft simulator.
About six months after delivery to Draper, hardware modifications,
which were required to correct implementation and design flaws, were
completed and an executive kernel capable of demonstrating fault toler-
ance and dynamic reconfigurability of the FTMP was operating. A display
console attached to the FTMP via the 1553 I/O interface provided system
status displays allowing convenient tracking of system reconfigurations
in response to fault insertion. In the following year, the executive and
flight control software was elaborated and the system was integrated into
a jet transport aircraft simulator with cockpit. During this phase of
the program, the fault-tolerance of the FTMP was demonstrated to hundreds
of visitors from NASA, all branches of the armed forces, the Department
of Defense, other government agencies, industry, and academia.
Once the software development was complete and the FTMP was fully
integrated into the simulation laboratory, it was subjected to over
20,000 faults as part of a test, evaluation, and validation plan. FTMP
response to each fault was recorded and were reduced to frequency histo-
grams. Fault insertion and data acquisition were done automatically
under control of a PDP-11/60, using a fault injection device designed and
built at Draper.
The FTMP engineering model and software along with the test
facility hardware and software were delivered to NASA Langley in August
1982.
Overall, the FTMP program was a notable success, demonstrating and
validating many of the underlying design concepts.
The first three volumes of this report dealt with various aspects
of the FTMP in great detail. This final volume summarizes the results of
this project. Chapter 2 highlights the FTMP architecture, hardware
implementation, and executive software. Chapter 3 summarizes reliability
modeling results. Chapter 4 is a brief discussion of performance bottle-
necks and advantages of the FTMP architecture. Results of experimental
test and evaluation are presented in Chapter 5, and Chapter 6 concludes
this report.

CHAPTER 2
THE FAULt-TOLERANT MULTIPROCESSOR CONCEPT
The two underlying philosophic assumptions of the Fault-Tolerant
Multiprocessor Concept are: I) that crucial elements of the fault-
tolerant computing system can be most economically implemented in
hardware, and 2) that the chosen architecture for a fault-tolerant system
should have performance, programmability, and useability attributes which
are as good as those of equivalent simplex or non-redundant systems.
The architectural organization which was selected for the FTMP was
that of a conventional homogeneous multiprocessor. It was felt that this
organization was particularly well adapted to the programming and
performance requirements of the projected commercial transport environ-
ment. Figure I illustrates such a system organization. The multiproces-
sot environment provides an easily expanded and general purpose computing
system which is able to handle the multiple concurrent tasks projected
for such an application. The use of a multiprocessor structure with its
shared memory also provides for the required high speed communications
and data buffering between the tasks. The multiprocessor organization
also has secondary reliability advantages. Degraded versions of such a
5
I C C C i
PROCESSOR o PROCESSOR o PROCESSOR o I
TRIAD m -- TRIAD m -- TRIAD m J--
I m 2 m I m ii
SYSTEM BUS
SYSTEM TIME CONTROL/ 'O O O O
MEMORY CLOCK/ STATUS
COUNTER
Figure I. Software Appearance of FTMP
system can function with reduced numbers of processors, but these advan-
tages would not have justified a multiprocessor organization had that
organization not been a natural choice for its intended application.
The second design choice was to provide redundancy and fault-
handling mechanisms which are transparent to the software organization,
both as regards to its structure and performance. This was, in fact,
critical. After matching the architecture to the application (redundancy
and fault-tolerance aside) it is essential that the additions required
for fault tolerance not compromise the rationale for the original archi-
tectural choices. The FTMP solution allows for the independent design
of applications and operating system code design, apart from the design
solutions to fault handling and fault recovery problems. The aggregate
complexity of these two problems is thus much less than if they are
handled jointly. A hardware implementation of the primary
fault-detection and fault-masking mechanisms achieved this goal with
minimal impact on system throughput. This compares to software-intensive
solutions which generally must devote over 50% of available processing
power to such functions.
Experience with the FTMP confirmed the soundness of these princi-
pal architectural decisions.
All of the software of the FTMP was developed, coded and debugged
as if it were to run on a simplex equivalent of the multiprocessor.
Code, debug and test was in fact actually conducted using a simplex
development station. Code was then routinely moved from this development
station to the redundant configuration without difficulty. The
programmability and software productivity advantages hoped for were thus
largely confirmed. Executive design was conventional and the
architecture allowed efficient control from a higher level language.
Programmer productivity exceeded 140 assembler level language statements
(or their equivalent) per programmer day, competitive with the best
practice today.
The choice of hardware mechanisms to provide primary fault
handling also proved to be sound. Performance degradation due to
fault-handling hardware was minimal and the implementation was indeed
largely transparent to the software. Despite implementation flaws (in
areas unrelated to fault-handllng hardware) which caused overall
performance to fall somewhat short of projections, comparisons with the
SIFT approach indicated the substantial advantage of this approach.
Basic fault-handling functions in the FTMP were handled by hardware and
were at least two orders of magnitude faster than the software
equivalent, in SIFT [I,2]. A basic paradox was that this specialized
voting and fault-handllng circuitry of the FTMP proved to be smaller and
less complex and much less trouble-prone than the specialized data
exchange circuitry of the SIFT machine. This can be attributed to the
efficiencies of synchronous (FTMP) versus asynchronous (SIFT) exchange
mechanizations and the trivial amount of circuitry which was in fact
required to implement the basic fault-handling mechanisms.
CHAPTER 3
SYSTEM ARCHITECTURE AND EXECUTIVE DETAIL
This chapter outlines the functional and design concepts of the
Fault-Tolerant Multiprocessor engineering model, FTMPo
3.1 Overall Architecture Or@anization
Figure I of the preceding chapter illustrates the functioning of
this system from a logical software or programmer's viewpoint. Three
processor triads function as the logical equivalent of simple processors
with shared access to a single shared system bus. Through this bus they
share access to common memory, I/O ports, a real-time clock/counter, and
control and status registers. Each processor triad can also directly
write to the communications registers of the other processor triads using
the system bus. This provides processor to processor interrupt
facilities which are required for efficient implementation of the
multiprocessor execution. The entire system is synchronized by the
equivalent of a single system-wide clock.
The actual redundancy underlying the buses, system memory triads,
processor triads, the clock quad, and the I/O system is invisible to the
programmer. This redundancy is shown conceptually in Figure 2 for the
processor and memory portion of a full up system configuration and in
more detail for the entire system in Figure 3. Any three processor
9
PROCESSOR TRIAD PROCESSOR TRIAD PROCESSOR TRIAD
" S ,,Y ,,S
o # 7
! I I
MEMORY TRIAD 1 MEMORY TRIAD 2 MEMORY TRIAD 3
ADD 0000-3FFFhe x ADD 4000-7FFFhe x ADD 8000-BFFFhe x
Figure 2. Processor and Memory Redundancy
PROCESSOR SLAVE
REGION REGION
/ /
_ _ _ LR_Ui_._ _ ,_. ..... LRU2 LRU10
'
I P/C i
L _.
....I _-'_ O0 0
i CONTROL I
=..... I
INTERFACES
BUS
1
F-J
2
3
4¸
5 = ---------
b
Figure 3. FTMP Physical Organization and Packaging
modules can be combined to form a processor triad, as can any three
memory modules.
The system is constructed of ten identical line replaceable units
(LRUs). Each LRU contains a Processor module, a Slave module (Common
System Memory and I/O), a Clock Generation module, two Bus Guardian
units, System Bus Interface circuits, and a Power subsystem. Figure 3
illustrates this physical organization and packaging.
3.2 Processor Module and System Bus Interaction
Processor modules operate in groups of three called processor
triads. Processor triads are formed by assigning the processor modules
of any three LRUs to work together in tight synchronism. In the FTMP it
is possible for up to three processor triads to be in operation simulta-
neously, utilizing processors from nine of the ten available LRUs. The
processor of the tenth LRU serves as a spare. A processor triad
functions as if it were a single processor executing a single instruction
stream. With three triads operating simultaneously, three instruction
streams are in parallel execution. The system then functions as a
conventional three-processor multiprocessor. The failure of a single
processor of a triad does not impact the correct execution of that
triad's instruction stream, because voting is used to mask the effects of
the failure. Comparison techniques also enable the failed processor to
be detected and identified. A spare processor region can then be used to
replace the failed element of the triad. If no spares are available, the
damaged triad is retired from service with the surviving functional
12
elements being used to replenish the spares pool, and thus available to
repair processor failures in the other triads.
The processor triads write data to, or read data from, locations
within the system bus address space by means of the System Bus. This bus
is a quintuply redundant, fully duplex, eight megabit per second serial
bus. During a block transfer, data can be written from a processor triad
at a peak rate of one word every 5 microseconds. Data can be read by a
processor triad at a peak rate of one word every 3 microseconds. The
error correction and error detection relies upon voting and comparison of
data and addresses appearing on redundant elements of the System Bus.
At any one time three of the five redundant bus lines are active.
These active lines are called the bus triad. Each element of a processor
triad transmits data and addresses on a different one of the bus triad's
lines. Since the elements of the processor triad are all operating in
tight synchronism, it is possible for any unit receiving a processor
triad read or write request to compare the separate versions of that
request by examining the separate copy of that request arriving on each
bus of the bus triad. The receiving unit can correct any errors caused
by a single processor region failure or single bus failure by using
majority voting.
3.3 Slave Module and System Bus Interaction
The Slave Module contains a number of subsystems, all of which are
addressed, read, and written as locat_ons in the System Bus address
space. These subsystems are the shared System Memory Module, the I/O
13
Port, LRU Control/Status and Communications Registers, and a Real-Time
Clock/Counter. Certain of these modules are normally operated within
triads of three; others are operated as single units.
Shared System Memory Modules are normally operated within a
triad. The system memory modules of three LRUs are assigned to function
together servicing a block of the system bus address space. Up to three
memory triads can be formed from the system memory modules of nine of the
ten LRUs. The tenth LRU's system memory module would then serve as a
spare. In the FTMP as programmed only two memory triads were required
and they were backed up by four spares. Each memory triad is assigned to
serve a different block of the system memory address space. The failure
of a memory module within a triad does not impact the integrity of data
stored in that block, as voting is used to mask the effects of the failed
module. Comparison techniques also enable the failed module to be
detected and identified. A spare memory module can then be used to
replace the failed element of the triad.
The real-time clock/counters of each LRU are also intended to
operate together as a triad. The real-time clocks are used to maintain
the absolute time base for the system. All real-time clocks are
addressed by the same system bus address. A processor triad write to
that location sets all real-time clocks to the same value. The real-time
clocks of three LRUs can be armed to respond to read requests. Only
those three LRUs respond to any processor triad real-time clock read
requests and as such they function as the real-time clock triad. A
failure of any element of that triad is masked by voting. Comparison
14
techniques enable the faulty unit to be identified. Any one of the
unarmed real-time clocks can be used to replace the failed element of the
real-time clock triad. Note that even the unarmed real-time clocks
respond to write commands from a processor triad. Thus they will always
agree with the elements of the real-time clock triad and can therefore be
used to replace an element of the triad without reinitialization.
Unlike the processor modules, the memory modules, and the real-
time clock/counters, the I/O ports operate independently of one another.
Each I/O port responds to its own unique set of system bus addresses.
Data and command words are transferred from a processor triad to an I/O
port over the System Bus, appearing to the processor triad as routine
system bus writes. As with any processor triad writes, voting at the
receiving end serves to mask the failure of any one of the processor
triad elements. The I/O port buffers any I/O transmissions, assembling
an entire message before initiating an I/O bus transaction. The I/O port
also buffers any incoming I/O transaction, assembling an entire remote
terminal message. The entire transaction is then transferred as a block
to a processor triad in response to a read request from that processor
triad. The I/O port utilizes MI_STD-1553A data bus protocols and
signalling standards in its communications with the exterior. Two
twisted shielded pairs can be used, one for transmitting and one for
reception, creating a fully duplex data link. If these two pairs are
tied together to a single twisted pair, then they conform to all specifi-
cations of MIL-STD-1553A. A MIL-STD-1553A avionics data bus is a I MHz
serial data bus employing Manchester encoding to send both clocking and
15
data information on a single shielded twisted pair bus line. Maximum I/O
transaction length can be 32 data words, one command word, and a status
word requiring up to 700 microseconds for the transaction to be made.
During this period the I/O port can act independently, and the processor
triad may release the System Bus for regular bus traffic. Since each I/O
port can operate independently, it is possible for the FTMP to be engaged
in up to ten I/O bus transactions simultaneously, one on each of the I/O
buses dedicated to each of the ten I/O ports.
The remaining elements of the Slave Region are System Control/
Status and Communications Registers. These elements are used to control
various parts of an LRU, to read the status of the error detection
circuitry of an LRU, and to provide direct processor triad to processor
triad communications.
The control registers are all write only. They are assigned fixed
locations within the system bus address space depending upon LRU identi-
fication number. These LRU control registers control which bus lines the
LRU uses for voting, triad assignment for the processor region, memory
relocation factor for the system memory, whether the real-time clock is
armed or not, and other LRU assignments or functions.
The status registers, or error latches, can only be read by a
processor triad. They report any bus errors observed by the error detec-
tion circuits of the LRU. Like the LRU control registers, the status
registers are assigned fixed system bus addresses dependent upon their
LRU identification number.
16
The communications registers are used to implement direct proces-
sor triad to processor triad communications. Each communications
register can only be written using the system bus. The communications
register can be read by the processor region of the LRU directly, appear-
ing as a local memory location on its internal processor module data
bus. The system bus address assignment of each communication register
within the LRU is keyed to the processor region triad assignment of that
LRU. This assignment is contained in one of the control registers of the
LRU. The local processor region transfer bus address of each communica-
tions register is fixed and is the same for all LRUs.
Only one LRU responds to control register writes or status
register reads, that LRU being determined by the system bus address of
the register being accessed. All LRUs with the appropriate processor
region triad assignment will respond to communication register writes.
When multiple LRUs are responding to a communication register write, they
act in tight synchronism with one another.
Each LRU's Slave region is assigned to transmit on only one
element of the redundant system bus. These assignments are made so that
each element of a system memory triad or real-time clock/counter triad is
assigned to a different bus. Each element of a processor triad can
therefore mask a fault within a responding triad by appropriate majority
voting circuitry. When reading from a simplex source, such as the I/O
port or status register, the processor triad does not receive redundant
information, but instead must accept the data from the single system bus
line on which it appears and verify its accuracy by other means, which
are described in Chapter 5.
17
3.4 Clock GenerationModule
All elements of th_ m_itiprocessoroperate using a common time
reference. This time base is providedby the Clock GenerationModules of
four LRUs which are phase locked to one another. The Clock Generation
Regions of the remaining LRUs are then phase locked to any three of
elements of the clock quad. Each clock generatorthus provides a timing
source for its LRU which is in synchronism with all other correctly
functioninqqenerators [3]. Such a system can tolerate the failure of
any one of the clock generators. All correctly functioning clock
generatorsremain synchronizeddespite such a failure. A failurewithin
the quad is detected and identifiedand another clock generator can be
assigned to replace the failed unit. Of the ten clock generators,four
are assigned to the quad clock and the remainingsix are either spare and
in standbymode, or failed.
3.5 Bus Guardian Units
The overall integrity of the system relies upon the ability to
reliably control the access that any element of an LRU has to the system
bus. Each LRU of the system has two Bus Guardian Units, BGUs, which
function to protect the system bus from a failed or malfunctioning LRU or
other element within the LRU. Each BGU has bus enabling registers which
control the LRU access to each individual line of the system bus. In
order that the LRU be enabled to transmit on any line of the system bus,
it is necessary that the enabling bit from both BGUs within that LRU be
set. Either BGU can block the LRU's ability to transmit on a line.
18
When the configuration control program creates a processor triad,
for example, it must first assign the processor regions of three LRUs to
the same processor triad. It does this by writing into the system
control registers of the selected LRUs. It must then assign each proces-
sor region to transmit on separate lines of the system bus triad. It
does this by writing to the enabling registers of both BGUs of each
selected LRU, assigning each LRU to the appropriate processor transmit
lines of the system bus.
The register loading mechanism of each BGU responds to a unique
system bus address keyed to that BGU's LRU and location within the LRU.
This register loading mechanism allows each BGU register to be written by
a normal system bus write transaction. Because it is important that a
BGU act independently of the LRU in which it is located, each BGU
receives all the redundant copies of all processor triad transmissions
appearing on the system bus. _AII registers within the BGU are nonvola-
tile so that the bus assignments and line select codes are remembered
through a power failure.
3.6 System Bus Interface Circuits
The actual connections to the system bus are made by the System
Bus Interface Circuits within an LRU. These circuits perform several
functions.
First, and most obviously, they provide the necessary drivers and
receivers for each individual bus line. Selection of these drivers and
19
bi
receivers is dictated by electrical constraints such as adequate power !}
and noise immunity.
Secondly, the interface circuits provide the means for controlling
the fault environment of the system bus. The driver circuit for each
line performs the necessary gating of LRU signals onto system bus lines
depending upon the values obtained from the BGU enabling registers. Thus
it is this driver circuit which actually functions to cut the connection
between an LRU and the system bus. The receiver circuits distribute
independent copies of the system bus data, as required to each BGU within
the LRU and to the other LRU circuitry. Each BGU and the input process-
ing circuitry of the slave region must receive independent copies of the
processor transmission lines so that they can each perform an independent
vote and act independently of one another in response to processor triad
commands. Partitioning within the interface circuitry is designed so
that a single fault can bring down no more than one element of the system
bus or pollute no more than one copy of the received data from the system
bus. The partitioning and fault containment aspects of these interface
circuits are critical to the overall fault tolerance of the FTMP.
3.7 Power System
Each LRU has its own power subsystem. This subsystem consists of
a power supply, which provides all the required voltages used within the
LRU, and a battery backup circuit, which provides low power battery power
for maintaining the CMOS system memory and the nonvolatile registers
2O
during primary power loss. This local power subsystem is overvoltage and
overcurrent protected.
The local power subsystem of each LRU draws power from four 28 VDC
power buses. Each LRU is fused at its connection point to this quad
redundant power bus so that internal shorts within an LRU can at most
only momentarily disrupt power on the primary power buses. The local
power subsystems have adequate energy storage to tolerate these interrup-
tions while the fuses blow. The local power supplies draw power evenly
from all of the power buses. Thus under most circumstances, each of the
power buses is fairly equally loaded. Any one of the power buses is
capable of fully supplying all power to operate the entire FTMP.
The four primary power buses are driven from four independent
primary power supplies. In this particular implementation, each of these
power supplies is identical and converts three phase 208 VAC, 400 Hz
input power to 28 VDC.
3.8 Executive Software
The Executive software is responsible for managing the hardware
and software resources, for maintaining system integrity and for timely
and orderly execution of applications tasks. It forms a link between the
FTMP hardware and the user programs. The FTMP hardware, along with the
Executive, project an image of a virtual machine to the user such that
the hardware redundancy and redundancy management become transparent to
the user. The user is only aware of a multiprocessing environment in
21
which several processors execute tasks in parallel and are linked togeth-
er by a single shared memory.
The part of the Executive software that is responsible for the
execution of the applications tasks is called the Dispatcher. The Dis-
patcher is at the heart of the Executive. The remaining executive or
systems tasks, that is, all systems tasks other than the Dispatcher, are
treated by the Dispatcher like applications or user tasks. These other
systems tasks include a system configuration controller, system status
displays and self-test programs. They are scheduled to run by the Dis-
patcher as user tasks.
All the user and systems tasks are repetitive in nature. They are
executed at one of three different iteration rates. The three rate
groups, called R4, R3, and RI, presently execute at 25, 12.5, and 3.125
Hertz, respectively. The highest frequency tasks, that is, those in the
R4 rate group, are given the highest priority, while the lowest frequency
tasks, that is,those in the RI rate group, are given the lowest priority
for execution. The configuration controller and display tasks are dis-
patched at the RI rate.
The Executive is a timer-interrupt driven "floating" executive.
Each iteration of the R4 rate group tasks is initiated by a hardware
timer interrupt. The Executive can run in any processor triad, no
processor triad is a master or slave triad and all triads have equal
authority. Only one triad is allowed to alter the Executive data bases
at any given time. Access to the shared data bases, such as task queues
etc., by different processor triads, is controlled by semaphores resident
22
in the shared memory. The timer interrupt initiates R4 task iteration or
frame in just one triad. The other triads follow the lead triad when
prompted to do so through an IPC (Inter-Processor Communication)
interrupt. At the completion of an R4 frame, the triad completing
execution of the last R4 task becomes responsible for initiating the next
R4 frame using its timer-interrupt. All I/O is performed in the R4
frame.
The second element of the Executive is the system configuration
controller. The configuration controller detects hardware faults by
analyzing information from the error latches associated with the bus
voting circuitry of each LRU. It is also responsible for identifying
faulty units and reconfiguring the system to replace them with spares, or
gracefully degrading the system if no spares are available. Faults are
isolated to the LRU sub-unit level such as processor, memory, clock, I/O
port, and system bus line. Weak intersections of LRU's and buses, that
is, an LRU unable to transmit or receive on a particular bus line, are
also identified. Faults that do not persist long enough to be isolated
are handled by a transient fault analysis algorithm. Demerits are
assigned to all possible sources of a transient fault. The accumulated
demerits are then analyzed for statistical significance to locate the
source(s) of transient faults. In addition to on-demand FDIR (Fault
Detection, Isolation and Recovery) the configuration controller period-
ically checks critical hardware elements using self-test programs.
Examples of such items are voters and error latches. Spare sub-units are
also constantly cycled into active state to ascertain their integrity.
23
These include processors, system memory units, fault-tolerant clock
elements, I/O ports, and system bus lines.
The third element of the Executive software is the bootstrap/
restart program. This program is responsible for bootstrapping the FTMP
when the system is "cold", that is the system memory has not been ini-
tialized. In the cold start case, the system memory is initialized using
an external device such as a cassette tape on a 1553 remote terminal. The
bootstrap program itself is resident in the cache PROM of each processor
unit. The restart program is also responsible for restoring the system to
a correct operational state after a power interruption. In this case the
system memory is already loaded with all the programs, data, and the
pre-power interrupt system configuration. The FTMP is brought up to this
configuration and all the relevant data bases such as task queues for the
dispatcher are re-initialized.
The fourth element of the Executive is the cache memory manage-
ment. The cache RAM in each processor unit is only 8k words, while the
total amount of data and programs in the system memory can be much
larger. Therefore some method is necessary to page programs/data in and
out of the cache. This is accomplished by an on-demand paging algorithm
that reads the required page of data/program from the system memory into
a cache page. The cache page is chosen on a round-robin basis.
The last element of the Executive is really a hardware enhancement
function. The CAPS-6 processor does not have vectored interrupts. This
function is now provided in the software. That is, all interrupts are
vectored to their respective interrupt handling routines through the
Executive software.
24
CHAPTER 4
RELIABILITYAND AVAILABILITYMODELS
The reliability and availability requirements of the FTMP are
extremely severe due to the intended life-critical application of the
multiprocessor. The desired catastrophic failure probability is less
than 10-9• The preferred maintenance interval is of the order of 200
to 300 flight hours and the likelihood of requiring unscheduled mainte-
nance during this period is to be at most a few percent.
The FTMP architecture to meet these stringent goals has evolved
over a period of time, and reliability modeling has played an important
role in it. Modeling tools were used at various stages to evaluate
alternate architectures and provide feedback in deciding upon degree of
redundancy for various components and revealing any other weaknesses in
chosen architectures from the viewpoint of meeting reliability and avail-
ability goals. An ideal arrangement of elements so arrived at was
modified slightly by the reality of packaging constraints.
The modeling effort concentrated on random hardware failures. It
did not address other classes of faults such as design faults, execution
errors and generic faults, either hardware or software.
The following two sections summarize the results of that effort.
25
4.I Survival Probability Models
The computation of survival probability of the FTMP for random
failures is divided into the following three submodels:
1) probability of failure due to the lack of perfect coverage
using a Markov process model;
2) probability of failure due to exhaustion of spares using a
combinatorial model;
3) probability of failure due to BGU failures in enable mode
using a combinatorial model.
In the FTMP some time is required to detect, isolate, and recover
from any failure. During this time a second failure may arrive in such a
place as to be catastrophic. Therefore, the coverage is imperfect. This
phenomenon is most conveniently modeled using a Markov model. The proba-
bility of failure due solely to exhaustion of equipment can be computed
independently using combinatorial methods. In addition to these, there
is a third failure mode peculiar to the FTMP architecture, which relates
to bus guardian unit (BGU) failures.
The results of the three models are shown in Figure 4. Time
taken by FTMP to recover from a fault is ass_ed to be exponentially
distributed with a mean value of 250 milliseconds. Component failures
are assumed to be of constant hazard type with the mean time between
failures for an LRU equal to 2,610 hours. This LRU MTBF is derived from
failure rates estimated for 18 LRU subunits.
The system failure probability due to near simultaneous occurrence
of multiple faults is seen to dominate the mission time of interest, that
26
-2 EXHAUSTION
10 SYSTEM FAILURE PROB. OF SPARES
-410
-6 BGU ENABLE10 FAILURES
DESIRED RELIABILITY GOAL
-810
bo
10 -9 _
-lo I10
LACK OF COVE_,AGE I
I
-1 I I10
! II
I I
I0 -14 ! , ,
• 1 1 10 100 1000 ilOURS
Figure 4. FTMP Probability of Failure
is, 0 to I0 hours. During this time period the likelihood that the
system will fail catastrophically Increases linearly with time. At one
hour it is found to be 2x10-I0 and at I0 hours it is 2x10-9.
4.2 Dispatch Reliability of the FTMP Computer
Availability of equipment, in general, is an important concern in
the commercial air transport industry. Availability of avionics equip-
ment, in particular, is more important for a central computer with
digital "fly-by-wlre" authority. It is imperative, therefore, that the
dispatch reliability of the FTMP computer be commensurate with its high
survival probability. A preliminary estimate of the dispatch reliability
is summarized here.
Let the "dispatch minimum complement" (DMC) denote the amount of
equipment (processors, memories, etc.) necessary to be operational before
take-off so that flight reliability will still meet goals for a ten-hour
flight. As long as the amount of operational equipment in-flight does
not fall below the "critical minimum complement" (CMC), the system is
said to be successful. The CMC is the amount of equipment necessary to
perform all the fllght-crltlcal functions. The CMC was estimated to be
two processor triads, a memory triad, a bus triad and one power supply.
In the critical minimum configuration, a triad need have only two
operating members to work correctly. The DMC was found to be as follows
[41:
28
Dispatch Minimum Critical Minimum
Complement Complement
Processors 8 5
Memories 6 2
Buses 4 2
Power Supplies 3 I
The question to be answered at this point is, how long would it
take an initially fully operational FTMP to degrade below the DMC and
thereby fail the dispatch criteria? The probability of this event at
time t, assuming no maintenance, is shown as a function of time in Figure
5. It is seen from this figure that there is a seven percent chance that
the computer will be below the dispatch minimums if the maintenance is
scheduled every 300 hours. The probability of requiring unscheduled
maintenance can be reduced to just over two percent by carrying an extra
LRU or by shortening the maintenance interval to 200 hours.
29
1.0 Dispatch i0 LRUs
Failure /
Probability II LRUs
l0-1
10 -2
10 -3.
W
o
10 -4
10 -5
-610 ' ' | ' ,
1 l0 100 500 1000 HOURS
Figure 5. FTMP Dispatch Reliability
CHAPTER 5
THROUGHPUT AND RESPONSE TIME PERFORMANCE
The real-time, life-critical applications for which FTMP is
designed not only place severe reliability requirements on the computer
but also demand quick response and adequate throughput to handle all the
critical tasks in a timely fashion. Performance of the multiprocessor
was modeled using GPSS (a general purpose modeling language) before the
computer was built. This model did not reveal any weaknesses or inade-
quacies in the expected performance of the machine. However, once the
actual hardware was programmed to run the executive and typical flight
control programs, the actual performance was somewhat short of the
desired goal. As it turned out, a critical bottleneck, unforeseen of
course, was left out from the model. The bottleneck became quite appar-
ent with the hands-on experience of programming and using the system.
Pleasantly, at the same time, a number of advantages of the FTMP archi-
tecture were also revealed. These two sides of the FTMP are briefly
discussed next.
The initial performance goal was to have enough throughput to
support execution of flight control, navigation and guidance, air data
displays, and other on-board systems management tasks. It was hoped that
31
some of these tasks would be able to run at as high as 40Hz with other
tasks executing at lower frequencies such as 20, 10, and 5 Hz. The
present version of the executive supports three rate group tasks running
at 25, 12.5, and 3.125 Hz. Additionally, the current application work
load is less than the intended levels. This shortfall in the multipro-
cessor throughput performance can be traced to a single bottleneck
relating to system bus access.
Presently, to access any device attached to the system such as
shared memory, I/O port, error latches, configuration control registers
and so on, it is necessary to load four registers. This is done by an
assembly language level subroutine and takes up to 80 microseconds for
each bus access. This is just the set-up time and does not include the
data transaction time itself. Given the large number of bus transac-
tions, this set-up time alone accounts for a third of all time spent by
the highest frequency task dispatcher. This bottleneck was not ade-
quately recognized at the outset.
However, there is a way to solve this problem. It is possible to
microcode system bus access routines. Adequate storage is available in
the CAPS-6 mlcrocode memory to add these instructions. If this step were
taken, the system bus access set-up time could be reduced from 80 to
about 10 microseconds. Over 50 Hz frequencies for highest rate group
tasks and proportionately higher frequencies for lower rate group tasks
as well could be easily achieved.
On the plus side, the FTMP architecture allows one to achieve a
relatively high level of 'useful' throughput for a given amount of raw
32
machine throughput. The useful throughput is defined here as the effec-
tive machine use available to the user or applications programmer. This
is the same as the total raw throughput less that used by the executive
software for managing and allocating system resources, contending with
system malfunctions, and managing system redundancy.
The real-time malfunction management in the FTMP is done extremely
efficiently in hardware. The microsecond to microsecond masking of
erroneous data is performed by hardware-implemented majority voters.
Hardware also provides immediate symptoms of faults by latching disagree-
ments between voter inputs in error registers. Correlating error symp-
toms to determine the identity of the faulty module is done in software
as are the system reconfiguration algorithms following fault identifica-
tion. This division of labor between hardware and software gives FTMP a
tremendous performance advantage over software-intensive fault-tolerant
systems. Error detection and masking must be performed on all redundant
data and this imposes a load that is proportional to the number of words
as well as frequencies at which they are read. In the FTMP, a majority
vote on three copies of a data word is taken in a few microseconds while
the same task in software would at least take many tens of microseconds.
So the load on the latter system is roughly ten to a hundred times
higher. The fault identification and system reconfiguration tasks which
are not time critical are performed in software. These tasks are tOO
complex to be implemented in hardware and since they are seldom executed,
they place a negligible load on the system. Note that hardware voting
maintains system integrity in the face of a fault and that the config-
33
uration control can therefore proceed with some leisure to effect the
repair.
The bit-by-bit error detection and masking done by hardware in the
FTMP is possible due to the tight synchronization of parallel computa-
tional channels. The tightly synchronous architecture was originally
selected and pursued for its elegance and simplicity in handling fault
detection and identification. By having parallel computers perform
identical operations on identical data, one can expect them to produce
outputs which also match each other bit for bit. Any disagreements not
only indicate a fault but also reveal the identity of the faulty unit
immediately. This makes the fault detection and identification task and
the error masking task much less complex. (The parallel-hybrid redun-
dancy employed in FTMP makes fault identification slightly more compli-
cated. ) These advantages of the FTMP architecture were known at the
outset. What was not appreciated fully at the time was the performance
gain realized by hardware error correction and the transparency of these
functions to the software. This is explained in the following.
Midway through the program it became apparent that the number of
words passing through voter functions had been severely underestimated
and would increase considerably due to 'simplex source congruency '
requirements. This requirement may be defined as the identical distri-
bution of data obtained from a simplex source (internal or external to
the computer) to redundant processing or memory elements. In the absence
of source congruency even redundant systems can suffer single point
catastrophic failure. In the FTMP, source congruency is simply accom-
34
plished by having the processor triad write the word to shared memory
(where each memory module receives three copies of the word, votes upon
it, and then stores it) and reading it back from the memory. Had the
voters been implemented in software and the processors only loosely
synchronized, the burden of source congruency for a reasonable I/O
loading would probably be unbearable.
Last, but not least, there are many features of the FTMP which
make it very attractive and appealing from a user's viewpoint. The
software appearance of the machine is such that the applications program-
mer is not aware of the hardware redundancy underlying each element. It
is as if one had an extremely reliable simplex multiprocessor. From a
user's viewpoint the FTMP architecture is that of a conventional multi-
processor. As a testament to this simplicity, the flight control soft-
ware running on the FTMP was written by someone totally unfamiliar with
the intricacies of the redundancy underlying the processor, memory, and
bus structure.
Complexities of the FTMP architecture are not only transparent to
the user under normal circumstances but also when there is a hardware
malfunction. Fault detection, diagnosis, and recovery is such that
applications programs do not need to be involved in this process. There
are no program rollback requirements, no checkpoints or other constraints
imposed upon the user for the sake of fault tolerance.
All of these comments apply to most of the Executive software as
well. That is, with the exception of the configuration control software
35
the Executive is unaware of the hardware redundancy and is not involved
in fault handling.
This implies that applications programs and most of the executive
can be written and debugged quite early on a simplex version of the
machine. In fact, all software for the FTMP was written and debugged
using only a single simplex processor. No facility was even available to
deP_Igsoftware in triplex operation.
Since hardware costs much less than software and the gap is likely
to continue to become wider, any feature of the architecture that simpli-
fies software requirements has a proportionately positive impact on the
cost as well.
All of these advantages add up to make the FTMP a more 'usable'
computer.
36
CHAPTER 6
TEST AND EVALUATION EXPERIMENTS
Reliability and availability models of the FTMP and their results
were presented in Chapter 3 of this report. Assumptions underlying these
models concerning failure and recovery rates of the FTMP were briefly
mentioned there. To verify these and other assumed characteristics of
the FTMP implicit in the modeling process, a series of test and evalua-
,0
tion experlments were undertaken. In these experiments the FTMP was
subjected to numerous artificially created faults while operating
routinely in a simulated aircraft environment, and its response in each
case was observed and recorded. Other factors that motivated FTMP test
and evaluation were expanding its validation envelope, building confi-
dence in the system, revealing any weaknesses in the architectural
concepts and/or their realization in hardware and software, and a general
stressing and shaking out of the fault detection hardware and the fault
identification and system configuration control software.
Towards this end a total of 21,055 pin level stuck-at class of
faults were injected into one LRU of the FTMP, and in each case time to
detect the fault, to isolate the fault to processor, memory, or bus, and
to remove the faulty module by reconfiguring the system were recorded.
37
The FTMP executed the normal repertoire of system and applications
software while being subjected to faults. This included the operating
system, self-test programs, displays and flight control. The flight
control program operated with dummy inputs obtained from the 1553 Remote
Terminals since the actual aircraft simulation was not operational during
fault injections. Faults were injected using a Draper-designed and
built fault injection device which was controlled by a PDP-11/60 support
computer. The process of fault injection was highly automated with the
fault injection software maintaining closed loop control of the FTMP
status through 1553 based communication with the configuration control
program. The results are summarized next.
A total of 21,055 pin level faults were injected into the FTMP.
Of these, 17418, or 83 percent, were detected. Of the 3,637 undetected
faults at least 80 percent were estimated to be on unused gates and
pins. A few of the remaining undetected faults were analyzed and found
to belong to the 'don't care' class. Further analysis of undetected
faults is required to arrive at a definitive detection coverage value.
Identification and reconfiquration coverages, on the other hand, were
found to be perfect. The system identified all detected faults correctly
and successfully recovered in each case.
The total time to recover from a fault was dominated by time spent
in the detection phase. Time to identify a fault and reconfigure the
system was found to be deterministic and bounded, as expected. Average
identification and reconfiguration times were found to be 88 and 82
38
milliseconds, respectively. Total recovery time averaged over all 17,418
(detected) faults was found to be 1.16 second although it would be only
549 milliseconds if BGU faults were excluded. In the absence of BGU
self-test programs, which was the case here, BGU faults were solely
uncovered by very low frequency routine system reconfigurations. BGU
diagnostics are straightforward to code but were left out due to limited
time and resources.
The FTMP reliability models described in Chapter 4 assumed an
average recovery time of 250 msec. The experimentally determined average
(assuming BGU diagnostics) is about twice as long. This would increase
the FTMP failure probability by a factor by two over the analytically
derived value.
The distribution of the total recovery time, though not exponen-
tial, was found to be favorably skewed. Over 95 percent of all faults
are recovered from in a second or less. Figures 6 and 7 show this
distribution on two different time scales. The 119 second maximum
recovery time was for BGU faults which were detected by slow system
reconfiguration. Assuming BGU diagnostic programs were coded for the
FTMP, the maximum recovery times would be of the order of several seconds
rather than a hundred seconds. This would make the tail of the distribu-
tion very short. An exponential distribution has an infinitely long
tai!. Hence, the experimentally determined fault recovery distribution
for the FTMP is thought to be more favorably skewed than an exponential
distribution.
39
The test and evaluation experiments did uncover a serious omission
in the FTMP Engineering Model. The Bus Gaurdian Units were found to have
no overvoltage protection. Although an overvoltage protection mechanism
was specified for BGUs, it was not built into the hardware. This is dis-
cussed in more detail in Volume III. In any case, it is not difficult to
add such a circuit on the BGU cards.
In addition to this hard data, a number of very important though
intangible results were obtained as well. The hardware and software, in
general, and the fault detection hardware and the fault identification
and system configuration control software, in particular, performed
extremely well under the stress of thousands of faults. In a sense the
FTMP architecture, the hardware, and the software have been validated
informally.
The test and evaluation experiments, their positive results, and
the 100 percent availability of the FTMP during 13,000 hours of routine
operation at the Draper Laboratory have all substantially bolstered con-
fidence in the FTMP concept as well as its realization in hardware and
software.
40
CARO: ALL 15:5q:00 11/18/82
FAULTS: 17418
.D
o AVERAGE: 1160
MAXIMUM: 1188q3
Q
0
0
L_ "
eDi-CT"
F--
Z
LUo(_)o
rr- •
LUg-O-
o
o
o_
o
o
o I I I I I I I I 1 I I
0 1000 2000 9000 _1000 5000 806: "7000 8000 9000 10000 11000 120000
TOTAL RECEVERT TIME (MSEC)
Figure 6. Frequency Distribution of Total Recovery Time
(i second scale)
41
CARD: ALL 15-514:00 11/1B/$2
. FAULTS: 17410
o RVE,_AGE: 1160
o MRXIHUM: I188L13
6_
o
o
0
0
CE
F--
Z
b-]o
(-)o
CI:•
tO_b-]_
0
E3
--I
0
0
--t
c:;! I I I i I I I I I I I
o loo 200 300 _oo soo 600 7o_ E._o 900 looo lloo 12oooo
TOTRL RECOVEq'r T'KE (HSEC)
Figure 7. Frequency Distribution of Total Recovery Time
(i00 msec scale)
42
CHAPTER 7
CONCL US IONS
The FTMP project has been a multi-year technology development.
The immediate focus was the demonstration of a concept for providing
generalized fault-tolerant computing for future transport aircraft. It
became clear during the course of this project that such a concept is
feasible. It also became clear that computing technology is a very broad
and swiftly moving field and that if fault-tolerant computing is to
mature it must mature in reasonable harmony with companion developments.
Fault-tolerant computing must therefore provide the throughput, perform-
ance, operating systems, and software environment of mainstream comput-
ing. A fault-tolerant machine must be as usable as any state-of-the-art
machine. Toward this end it must emulate or duplicate virtually all of
the widely accepted and desirable features of state-of-the-art, non-
fault-tolerant, general purpose machines. To suggest that fault-tolerant
computing can grow without acceptance of the bulk of experience and
practice from other computing technologies is arrogant, naive, and
errant.
The FTMP architecture sought to provide a software environment
which was equivalent to that of a general purpose homogeneous multipro-
43
cessor. The choice of a multiprocessor was dictated by throughput
requirements. In this regard the machine architecture mimics the archi-
tecture of state-of-the-art multiprocessor configurations. The use of
tightly synchronized parallel computation paths in all components of the
multiprocessor created a software environment which was identical to that
of a non-redundant multiprocessor of equivalent throughput. Hardware
synchronization and voting techniques were provided to manage all of the
moment to moment error masking and error correction independently of the
software.
The operating system of the FTMP, as a result, is quite ordinary.
The structure, coding, and organization of the operating system is almost
a textbook example of ,a 'simple' multiprocessor executive, as might be
covered in an undergraduate computer science course. The simplicity and
efficiency of the executive were dictated not by requirements related to
fault tolerance but by the time criticality and throughput requirements
of the real-time nature of the FTMP applications set. That the FTMP
executive is so ordinary was a significant achievement of this project.
The applications code of the FTMP similarly benefited from the
software environment. All applications were coded and run as if they
were being executed on a non-redundant multiprocessor. The organization
of such code is unencumbered by the fault tolerance of the machine. The
FTMP applications code organization is thought to be reasonably represen-
tative of that which might have been employed in real life, and was
adequate to demonstrate closed loop control of a simulated aircraft in
the CSDL Simulation Facility. This code probably represents a somewhat
44
idealized and sophomoric view of aircraft control and its complexity, but
the significant factor is that it is structured solely to address the
designer's perception of the real-time control problem and is not back-
ward engineered to meet an artificial or unnatural environment created by
a fault-tolerant architecture. An alternate structure could have been
demonstrated with equal ease, the chosen structure being an example.
The unique piece of software, the code which is linked to this
architecture and this implementation, is the configuration control
program. The FTMP employs dynamic redundancy to achieve its high degree
of reliability. Failed components are eventually replaced with spares
and hardware voting maintains system integrity until the repair is
effected. This process of reliability renewal or repair is directed by
software, embodied in the configuration control program. The complexity
of this configuration control was modest and well within the grasp of one
programmer.
Cast in the light of a systems problem, the FTMP can be seen as
problem oriented. The machine is a multiprocessor because it was felt
the problem demanded a multiprocessor. A standard executive structure
was chosen because it met the problem needs and was fairly well under-
stood. Applications code was developed as an example. The total design
hides fault tolerance at a lower implementation level, separate and
partitioned from the system problem. The techniques and technology which
enabled this partitioning proved to be tractable and successful. Over
" 20,000 faults injected during the test phase demonstrate this success.
Newer versions of the FTMP would no doubt benefit from this experience
45
base and provide even greater performance and a richer I/O environment.
Additionally, the technology is available to provide the same degree of
transparency for alternate architectures, single processor systems, or
attached processor systems, for example.
The success of the FTMP was that it shows that problem-oriented
solutions can be _%de fault tolerant and that the fault tolerance can be
hidden such that advances in operating systems, languages, instruction
sets, and applications code structures can be capitalized on and used.
46
CHAPTER 8
REFERENCES
I. Goldberg, Jack, et al., "Development and Analysis of the Software
Implemented Fault-Tolerance (SIFT) Computer," NASA Contractor Report
CR-172146, 1983.
2. Palumbo, D.L. and Butler, R.W., "SIFT - A Preliminary Evaluation,"
Proceedings of the 5th Digital Avionics Systems Conference, Seattle,
Washington, October, 1983.
3. Holt, A.W. and Meyers, J.M., "An Approach to the Analysis of Clock
Networks," C. S. Draper Laboratory Contract Report CR-1289, NASA
Contract Report CR-166028, August 1979.
4. Hopkins, A.L., Jr., Smith, T.B., III, and Lala, J.H., "FTMP -
A Highly Reliable Fault-Tolerant Multiprocessor for Aircraft,"
Proceedings of the IEEE, Vol. 66, No. 10, October 1978.
47
1. ReportNo. 2. GovernmentAccessionNo. 3. Recipient's Catalog No.
NASA CR-172286
4. Title and Subtitle 5. Report Date
Development and Evaluation of a Fault-Tolerant February, 1984
Multiprocessor (FTMP) Computer, Volume IV, FT_fP 6.Performi,gOrganizationCode
Executive Summary
7. Author(s) 8. PerformingOrganization Report No.
T. Basil Smith, III and Jaynarayan H. Lala R-1603
9. PerformingOrganizationName and Address 10. WorkUnit No.
The C. S. Draper Laboratory, Inc.
Cambridge, Massachusetts 02139 11.ContractorGrantNo.
NASI-15336
13. Type of Reportand Period Covered
i2. SponsoringAgencyName andAddress
Contractor Report
National Aeronautics and Space Administration
Washington, D.C. 20546
14. SponsoringAgencyCode
15. SupplementaryNotes
Langley Technical Monitor: Charles Meissner, Jr.
16. Abstract
This report is the final volume of a multi-volume report on the Fault-
Tolerant Multiprocessor (FTMP) project. It is an executive and technical
summary of the project.
The FTMP architecture is a high reliability computer concept modeled after
a homogeneous multiprocessor architecture. Elements of the FTMP are operated
in tight synchronism with one another and hardware fault-detection and fault-
masking is provided which is transparent to the software. Operating system
design and user software design is thus greatly simplified. Performance of
the FTMP is also comparable to that of a simplex equivalent due to the
efficiency of fault handling hardware.
The FTMP project constructed an engineering module of the FTMP, programmed
the machine and extensively tested the architecture through fault injection
an(!other stress testing. This testing confirmed the soundness of the
I;'TMPconcepts.
17. Key Words Suggested by Author 18. Distribution Statement
Fault-Tolerant
Distributed Processing Unclassified - Unlimited
Synchronous I
Distributed Executive Subject Category 62
Reconfigurable
19. Security Classif.(ofthisreport) 20. Security Classif.(ofthis page) 21. No. of Pages 22. Price
Unclassified Unclassified 52 A04
F_ salebytheNaUo_lTechnicallnf_mationService,Sprin£field,Vir£inia22161


