Modeling and measurement of fault-tolerant multiprocessors by Shin, K. G. et al.
NASA Contractor Report 3920 i NASA-CR-3920 19850025419
Modeling and Measurement of
Fault-Tolerant Multiprocessors
Kang G. Shin, Michael H. Woodbury, _ ,_
and Yann-Hang Lee "-'_"_'::r._:., ,/'
GRANTS NAG1-296, NAG1-492, and NGT 23-005-801
AUGUST 1985
;' ,_,3D
!_',NCL?'_'r_i-[ _r ."H CLr,_I-f?
https://ntrs.nasa.gov/search.jsp?R=19850025419 2020-03-20T17:56:20+00:00Z

NASA Contractor Report 3920
Modeling and Measurement of
Fault-Tolerant Multiprocessors
Kang G. Shin, Michael H. Woodbury,
and Yann-Hang Lee
University of Michigan
Ann Arbor, Michigan
Prepared for
Langley Research Center
under Grants NAG1-296, NAG1-492, and NGT 23-005-801
N/ A
National Aeronautics
and Space Administration
Scientific and Technical
Information Branch
1985

TABLE OF CONTENTS
1. INTRODUCTION ........................................................................................ 1
2. PERFORMANCE MODELING OF REAL-TIME MULTIPROCESSORS
•. °............... o. H.o..° .. o..° .° °.. • • ,. *..°. •°... °. ° °•*..°° • °.°•-•*...°*" °" • "°'°°°°'*"•°" "° "'' ° ° •" """°" °"" ° •° 2
2.1. Introduction ................................................................................. 2
2.2. System Architecture and Operation ............................................ 5
2.3. Stochastic Petri Net Model .......................................................... 9
2.4. Queueing Model Description ........................................................ 16
2.5. Solutions to the Queueing Model ................................................. 21
2.6. Description of Experimental System: FTMP ............................... 24
2.7. Queueing Model Representation of FTMP .................................. 28
3. MEASUREMENT OF FAULT LATENCY ................................................. 34
3.1. Introduction ................................................................................. 34
3.2. Methodology for Measurement of Fault Latency ........................ 40
3.3. Experimental Results and Analysis on FTMP ............................ 41
4. CONCLUSION AND DISCUSSION ............................................................ 51
ACKNOWLEDGMENT .............................................................................. 54
REFERENCES ............................................................................................ 55
!11
LIST OF FIGURES
Figure 1. System Architecture .......................................................................... 6
Figure 2. Modified SPN Model
°°,.,°.°°,o,o°.o°°..°°o°,,.°,°°.°°°°°o.o°,...°,°°°.°,°o°o.°°o°..o°°°.,,°° 1 1
Figure 3. Queueing Model ................................................................................. 17
Figure 4. A Block Diagram of FTMP (from [15]) ............................................. 27
Figure 5. Probability of Bus Contention vs. Bus Service Rate (its) ................ 35
Figure 6. Probability of Bus Contention vs. Service Rate
of Job Class 1 (Pl) .............................................................................. 36
Figure 7. Probability of an Idle Cluster vs. Idle Service
Rate (Pl) ............................................................................................. 37
Figure 8. Probability of an Idle Cluster vs. P! ................................................ 38
Figure 9. The Experimental Results and Estimated Distributions
for Stuck-at-0 Faults
Figure 10. The Experimental Results and Estimated Distributions
for Stuck-at-1 Faults .......................................................................... 47
Figure 11. The Experimental Results and Estimated Distributions
for Inverted Signal Faults .................................................................. 48
Figure 12. The Experimental Results and Estimated Distributions of Fault
Latencies at System Bus Controller ................................................... 49
iV
LISTOF TABLES
Table 1. Place Descriptions ............................................................................... 13
Table 2. Transition Descriptions ....................................................................... 14
Table 3. Experimental Measurements ............................................................... 29
Table 4. Markov State Descriptions and Steady State Probabilities ............... 31
Table 5. Parameter Values ............................................................................... 32
Table 6. Idle Processors and Bus Contention Probabilities ............................. 33
Table 7. Experimental Results and Estimated h! (tj) ...................................... 44
Table 7. Experimental Results and Estimated hI (tj) (cont'd) ........................ 45
Table 8. Least-Squares Estimation of the Distributions of Fault Latencies .... 50
Y

1. INTRODUCTION
This report deals with both the modeling and measurement of fault-tolerant mul-
tiprocessors. A detailed analysis of systems of this type is desired because of the increas-
ing number of mission-critical situations in which they are used. One would like to be
able to predict the performance of such systems for various workloads and how well they
recover from system errors. The speed and effectiveness of the recovery procedures for a
fault-tolerant multiprocessor have a direct effect on its performance.
In the first part of this report we present a model to analyze the performance of a
unibus I multiprocessor. A closed queueing network is developed to study the effects of
workload variation on bus contention, processor utilization, and performance. This
development entails representing the computer system with a modified Stochastic Petri
Net(SPN). This aids in illustrating the operation of the specific system and determining
which factors have the most significant effect on performance.
A second component of this report pertains to the measuring of fault latency in a
multiprocessor environment. This entails explicitly determining the distribution of fault
latency and its significance in system modeling and analysis. The result of this research
shows that fault latency is significant and that the common assumption of a negligible
fault latency may be incorrect.
An existing system, the Fault-Tolerant Multiprocessor (FTMP) located at the
NASA AIRLAB[17-20], is used as a modeling example. Many experiments have been
made on this system to measure fault latency and performance related factom, such as
bus contention and idle processors. It is the results of some of these experiments that
justify the conclusions drawn concerning fault latency.
1This unibus can consist of redundant buses which logically act as a unibus.
1
The rest of this report is organizedas follows.Section 2 deals with the modelingof
fault-tolerant unibus multiprocessorsand is dividedinto seven subsections. In Subsection
2.1 the performancemodelingis introduced. Subsection2.2 describesthe specific archi-
tecture being addressed, a real-time unibus multiprocessor,and its operation. Subsec-
tions 2.3 and 2.4 describe the SPN model and the closed queueing network model,
respectively. The results of the queueing model and closed form solutions are presented
in Subsection 2.5. The experimental system, FTMP, is describedbriefly in Subsection
2.6. Subsection 2.7 shows the queueing modelrepresentation of FTMP and some meas-
ured experimentalresults pertainingto its performance.
In Section 3, we present the technique of characterizingfault latency, which is an
important system parameter for modeling computersystems. Subsection 3.1 introduces
the concept and approach to measuring fault latency. A methodology for measuring
fault latency is outlined in Subsection 3.2. An example of the application of the method
on FTMP is shown with experimental results in Subsection 3.3. Finally, the report con-
eludes with Section 4.
2. PERFOR]_dANCE MODELING OF REAL-TIME MULTIPROCESSORS
2.1. Introduction
Representingthe operationofa computersystemby a structuredmodel isa popu-
larand naturalapproachto thestudyof a computer'sperformance.Many factorsneed
to be incorporatedintothemodel sothatitaccuratelydescribesthesystemthatisbeing
modeled. The typeofanalysisdesired ictateswhich factorsofthecomputer'soperation
need to be incorporatedintothe modelingframework. A factorthat isalmostalways
included,especiallyin the study of computer performance,isthe representationf the
2
workloadhandled by the computer system being analyzed. The workload is an essential
part of the performance evaluation of any computer system, because how well a com-
puter performsis directlyrelated to the type of workloadit is handling.
First, we present the beginning stages in the development of a model to study the
workload effects on performancefor a specific computer architecture and application.
The type of system being addressed is a highly reliable unibus2 multiprocessorthat is
used in rea!-timecontrol. Dealing exclusivelywith real-timesystems in the evaluation of
multiproeessor performanceis an approach that has not been largely addressed in the
literature. Usually, a general purposemultiproeessoris discussed,as in [1-3]. This type
of approach is difficult because of the largely varied workload general purpose systems
handle. Trying to represent a system of this type with its workload becomesunreason-
ably complex, if one wants to properlydescribethe workloadeffects on performance. It
appears that a number of interesting results can be obtained if one only considers the
structure of a real-time system and its workload.
The detailed analysis of this type of system is desired because of the increasing
number of critical situations it is used for, e.g., control of aircraft, spacecraft, nuclear
reactors, etc., where the failure of the controllingcomnuterwould result in catastrophic
losses. A failure could be the result of a physical malfunction or the result of the system
not reacting quickly enough as required[4].
Many authors have presented designs for synthetic workloads [5-8]. They have
usually reliedon heuristic methods that seemto providean adequate workloadfor a gen-
eral class of computing systems. Recently, Ferrari [9] has made the point that a more
systematic method is necessary, because of the fundamental correlation between work-
2Azmentionedearlier,thiscanbe redundantbuses.
3
load modelingand any performanceevaluation. Developingsuch a method is morecom-
plicated than it might first appear. One first needs to define what the workload model
should cover in its representation,and what standard should be used to determineif a
workload model is a "good"model.
We view a real-timecomputer system as the combination of two closely dependent
components: the controlled process and the controlling computer [4]. Because of this
close dependency, we feel that the developmentof a synthetic workload for this type of
system should not only rely on the actual workload being modeled,but it should also
depend on the type of system handling the workload. It is this basic association that
sets our work apart from those of others. Having a specializedsynthetic workload of
this type provides us with a means for producingmoreuseful results relating to the per-
formanceevaluation of real-timecomputingsystems.
Typically, the workloadof a real-timesystem is a fixed group of tasks that have to
be performed at certain intervals, repeatedly. There is usually a group of short, fre-
quently initiated tasks that monitor internal and external conditions and continually
compensate for their change. There are also tasks that are initiated less frequently that
requiremore computation time. The relative frequenciesof the initiation of tasks, and
the numberof tasks that need to be completed in a certain time frame lead to strict per-
formancecriteria.
It would be desirableto be able to determine if a computer system with the archi-
tecture mentioned above could handle a given workload and set of performancecriteria.
If it can, one would like to know how this might be best accomplished. And finally, it
would be useful if this optimal performancecould be measured. The model presented
here will hopefully aid in solving some of these problems.
Vital factors can be determined directly from the model, such as the amount of
processor idle time, the degree of contention for the single bus, and the tasks that have
the most significant effect on performance. These details will be discussed in a later sub-
section. The model can also be used as a tool for determining the optimal workload dis-
tribution to reach a certain level of performance.
2.2. System Architecture and Operation
As mentioned earlier, the hardware system addressed here is a highly reliable
unibus multiprocessor. The generalstructure of such a system is shown in Figure 1. It
consists of four major componentsprocessing clusters, input/output links, a time-shared
system bus, and system memory. A descriptionof each of these will be discussedas well
as their assumed interdependencies.
A processing cluster is an entity that is capable of operating on one task at a time.
It consists of one or morepairsof a processingunit and its local memory. The degree of
redundancy is considered immaterialto the performanceof the cluster for a given task.
Although, the redundancy does have a significant impact on the reliability and confi-
guration aspects of system operation. What is important is that regardlessof how many
pairs there are in a cluster, they all work together on a single task. For example, a clus-
ter may represent a triple modular redundant (TMR) system of three processingunits
and their local memories. It is also assumedthat all the clusters in the system are of the
same type, i.e., they all contain the same number of processor-memorypairs.
An input/output link is a componentthat enables data to be transmitted to or from
an external device. These allow the system to readdata fromsensors and transmit data
to actuators and displays. These links are also the channels used for human interface
through terminals or other similar devices.
5
PC I PC 2 PC m
I I • • •
< SYSTEM BUS >
1/0 l/O SYSTEM
MEMORY
PC -- PROCESSING CLUSTER
Figure 1. System Architecture LM -- LOCAL MEMORY
!/O = INPUT/OUTPUT LINK
The tirac-sharedsystem bus interconnectsall the processingclusters,I/O links, and
system memory. It is the medium for exchanging all data and control signals. Again,
this bus may be redundant for reliability reasons, but only one cluster transmits and
receivesdata over all copiesof the bus at a time. Therefore,the redundant system bus
logically acts as a unibus. A cluster communicatingover the bus is said to control the
bus.
Finally, there exists a single system meraorllthat is addressable over the system
bus. This memory usually consists of a collection of dynamic RAMs. The system
memory may be redundant with the restrictionthat only one system memory location
may be addressedat a time.
The basic operatingprinciplesof this multiprocessorsystem can be explained as fol-
lows. All tasks to be executed by the system are stored in system memory. These tasks
can be divided into n job classes, where a job class consistsof tasks that are requiredto
repeatedly execute at the same relativefrequency. Morespecifically, tasks of job class i
are executed every ri seconds, where 1 is the frequencyof initiation of a task of job
ri
class i. There m_y be morethan one job class having the same relativefrequency for its
tasks. The set of job classes is a partition of the set of system tasks, wherea task is in
one and only one job class.
Eachjob class is given a priority. This priorityis used to determinewhich process-
ing cluster may use the system bus when there is a contention among clusters for bus
control. A cluster about to workon or currentlyworkingon a task fromjob class i has
priority over another cluster to control the system bus, if the other cluster is about to
work on or is currently working onatask fromjob class j, where 1 _ i < j _ n.
Priorityof clusters workingon tasks of the samejob class is determinedby a first come
7
first served (FCFS) policy. Task queues are kept for each job class and these reside in
system memory also.
An idle cluster wishing to process a task from job class i must first gain control of
the system bus. It does this by waiting for inactivity on the bus and proceeds to partici-
pate in a pollin 9 ,equence. A polling sequence is a decision process to determine which
cluster has the highest priority. This is conveniently done by requiring each of the clus-
ters to transmit their priority number over the system bus and having a voting mechan-
ism determine which cluster has the highest priority. As a resul_ of the polling sequence,
the cluster with the highest priority is given control of the bus.
At this time the cluster reads the task queue for job class i from system memory
and determines the next task to be executed. It then reads in the task and all data
necessary to process that task. This data can be obtained from I/O link reads or more
system memory reads. After obtaining all the information necessary to internally exe-
cute the task, the cluster updates the job queue in system memory and releases the bus.
There are other mechanisms such as counters, queues, and interrupt timers to aid a clus-
ter in determining which job class to request. When a cluster completes a task, it will
again request bus control and transmit its results to the relevant addresses, determine
which job class to work on next, and proceed as before.
At any particular instant, all the clusters could be processing tasks simultaneously
resulting in peak performance. Performance dwindles when a cluster becomes idle wait-
ing for control of the system bus. There is also a penalty in performance, or system
failure, if all the clusters are not able to keep up with the required frequency of task exe-
cution for each job class.
8
A reasonable question to address is how are the job classes formed? More specifi-
cally, given a system workload, what is the best number of job classes and the distribu-
tion of the tasks among these classes! For a general purpose computing system's _7ork-
load, this is difficult to determine [8]. Some of the main problems in representing the
workload in a general purpose multiprocessor system model are (1) showing the inter-
dependencies among tasks in the workload, (2) the fact that the workload may not be
stationary, i.e., tasks of one type might occur at different: rates at different times, (3) the
unlimited number of tasks possible, and (4) the contention for physical components
needed to execute tasks operating concurrently. Providing a model that is able to
represent all these features would be extremely difficult, if not impossible. Fortunately,
when one only considers real-time applications on a unibus system these problems
become relatively easier to address. The workload of a real-time system is usually a
fixed set of tasks that have to be executed in a prescribed order at regular intervals.
This makes determining the physical and logical interdependencies more tractable. It
also implies a stationarity among the relative frequencies of different tasks. Therefore,
natural job classes can be formed and parameterized. However, this still is not an easy
task.
2.3. Stochastic Petrl Net Model
In the development of the model, it was first necessary to represent the overall
operation of the system at some level of abstraction that would be amenableto the type
of performance analysis desired. This representation is needed to depict the various
states a proce3singcluster might be in. The features that have a significant effect on
performance are system bus contention, transmission delays, and possible idle periodsof
a processingcluster. By modeling at the system level, where the components of concern
9
are the tasks, clusters, and system bus, we are able to describe the stages a processing
cluster will go through and how its actions affect the operation of the other clusters.
A useful tool for showing synchronization among system components is a Stocha, tic
Petri Net (SPN) [10-11]. Figure 2 is an example of a modified stochastic Petri net which
describes the synchronous actions of the system referred to in this report. This is a modi-
fied SPN because of the presence of the three function blocks F1, F2, and F3.
A SPN is a structure consisting of place,, tran,ition,, and directed arc, connecting
transitions and places. A place is usually represented in a net drawing as a circle, while
transitions are shown with bars. Directed arcs connect these places and transitions in a
way that there is no arc going directly from a place to another place, or from a transi-
tion to a transition. Tokens or dot markings in a place represent collectively the state of
the SPN.
A transition will fire when it becomes enabled. A transition is enabled when there
exists at least one token in each input place to the transition. The process of firing a
transition results in one token being removed from each place for each arc entering the
transition, and a single token placed in all of the places that have input arcs emanating
from that transition. A transition may fire instantaneously, such transitions are
represented by solid bars (T1 - T9 in Fig. 2), or have an ezponentially di, tributed random
duration, such transitions are called timed tran,ition8 and are represented by hollow
vertical bars (T10 - T21). When an instantaneous transition is enabled, tokens are
immediately removed from input places and sent to output places. \Vhen a timed tran-
sitions is enabled, there is an exponentially distributed delay before tokens are removed
from input places and immediately sent to output places.
10
I
T20_ P20
P21
T21 _.IT4 P7 TI0
PI
P8
P13
TS FI:
P2
3
Tl P JF2: P9 O PI7
T6 L
B P3 L
U PI0 I
S T7
P4 PII
R ; F3: TI8
E T2
L D
E I
A P5 Pl2 SS
E A
, B Figure 2. Modified SPN
P8 T9 L Model
The function blocks in Figure 2 are not defined components of a true SPN. They
are used here to simplify the appearance of the figure. The functions represented by
each block have been expressed as SPNs themselves. They act on the input arcs to the
block and produce tokens at the output arcs. The functions they represent are trivial in
nature, but the SPNs are complex and cloud their simplicity. For example, F3 has been
expressed using 13 places, 11 transitions, and 46 directed arcs.
Figure 2 is an SPN for a three cluster, unibus multiprocessor. [low a single cluster
is incorporated in this model will now be explained. The extension from one cluster, to
three, and more will be simple to envision. There are seven places (P1, P2, PT, P8, P13,
P16, and P21), three instantaneous transitions (T1, T4, and T5), and four timed transi-
tions (T10, Tll, T16, and T21) necessary to describe the operations of a single cluster.
What they represent is described in Table 1 and Table 2.
The F1 (Poll) function block is activated whenever there is a token present in
places 8, 10, or 12, i.e., when there is a poll request. It performs the action of removing
these tokens if they are present and deciding which of the requesting clusters should
obtain control of the bus. On output one token will be placed in either place 13, 14, or
15, depending on which cluster has gained control of the bus. There will also be tokens
placed in places 7, 9, and 11, if that cluster has lost the poll sequence. For example, sup-
pose cluster 1 and cluster 2 both initiate a poll sequence and cluster 1 is to succeed in
the poll. Initially, there would be a token in places 8 and 10. This indicates that cluster
1 (place 8) and cluster 2 (place 10) wish to initiate a poll sequence. The Poll function
would remove these tokens, and after a delay representing the time it takes to perform a
poll, will place a token in place 13 (cluster 1 has succeeded) and one in place 9 (cluster 2
has lost the poll).
12
Place A tokeninthis placemeansthat ....
P1 a systembusrequesthas beenmadeby the cluster.
P2 the systembusis freeasseenby the cluster.
P7 the clusterhas losta pollsequence.
P8 the clusteris initiatinga pollsequence.
P13 the cluster has succeededin a poll sequenceand has been granted
buscontrol.
P16 the clusterhas completedits bustransactionsand is to becomeidle.
P21 the clusteris readyto beginprocessinga task.
Table 1. PlaceDescriptions
13
Transition The ru-ingof this transition represents ...
T1 the cluster determining that the bus is busy.
T4 the cluster acknowledges that it has lost a poll sequence
and must wait to make another request for the system bus.
T5 the cluster initiating a poll sequence.
T10 the cluster transmitting on the system bus.
T11 the cluster transmitting on the system bus.
TI0 the cluster remaining in an idle state.
T21 the cluster internally executing a task.
Table 2. Transition Descriptions
Functions F2 (Bus Release)and F3 (Disable)act to indicate that the bus has
becomefreeor is busy.FunctionF2 acts by keepingtrackof whichclustersarein a poll
sequence,thus transmittingon the systembus, or whichare communicatingoverthe
systembus. Whenall activityis completedby all the relevantclusters,the F2 function
will indicatethat the bus is free by placinga token in places2, 4, and O. FunctionF3
acts to disableotherclustersfrominitiatinga pollsequenceif the bus is currentlybusy.
Therefore,whena poll requestis madethe F3 functionwilldeterminewhichof the clus-
tersshouldbe disabled.
Figure2 completelydescribesthe systemwe are interestedin. Oneis ableto follow
the actionsof a singleprocessingclusterand observethe effectsof these actionson the
restof the system. The modelservesthe purposeof enablingus to see whichactionsof
a computingclusterhave the greatesteffecton systemperformance.Forexample,by
supplyingtransitionratesforthe timedtransitions,onecoulddeterminehowoftena bus
requestis made. Combiningthiswith informationonthe durationof a typical transmis-
sion willgiveus an ideaof howoftenthe bus is busy. With this result,it can be intui-
tively stated that the higherthe busrequestfrequencyis, the greaterthe possibilityof
buscontention.
It can be observedthat this model,wereit completelyexpressedwith valid SPN
components,wouldbecumbersomeandconfusing.Malloy[10]has shownthat SPNsare
isomorphicto continuousparameterMarkovchains. An SPN can be convertedto a
Markovchain and completelyanalyzed. Onedrawbackof this methodis that the state
spaceforsuch a Markovchain is large. It is unmanageablylargeforthe exampleof Fig-
ure 2 (keep in mindthat the SPN for a functionblock is largerthan the restof the
modelshown). Therefore,it is obviousthat usingthis modeldirectlyas a tool forper-
15
formance evaluation of the system of interest is inappropriate. A simpler model has to
be derived that expresses the same relationships. Such a model is introduced in the next
section.
2.4. Queueing Model Deserlptlon
The model presentedin this section is designedto representthe states of a unibus
multiprocessorsystem. The state of the systemis definedby a combinationof the states
of all the processingclusters. With the aid of the modeloutlinedin the previoussection,
the states that were determined to be relevant to system performanceare when a pro-
cessing cluster is (1) competing in a poll sequence,(2) communicatingon the system bus,
(3) processing a task from job class i, or (4) idle, i.e., not processing a task. The rela-
tionship between these states of a processingcluster and the relationship between clus-
ters can be inferredfrom Figure2.
These relationships are incorporated into the dosed queueing network shown in
Figure 3. This model has a numberof advantages overthe SPN model,besidesthe obvi-
ous of being simpler to understand. First, it reduces all the actions of bus contention
and the polling sequence into a single non-preemptive priority queue. A non-preemptive
priority queue is one whereeach of the arrivingcustomershas an associatedpriority. A
customer enteringthe queue will move ahead of all the customers in the queue that have
lowerpriorities,and behind those of equal or higher priority. In this manner,customers
of the highest priorityin the queue are servedfirst on a FCFS basis. The second advan-
tage is that the separate job classes can be explicitly parameterizedin this model,
whereas in the SPN modelthey wereall grouped together. Third _nd most importantly,
this model can be easily solvedfor a given set of parameters.
18
NODE 3
NODE 2
NODE 1
NODE n+2
Pv
P!
Figure 3. Queueing Model
Before describing the details of this queueing model, it should first be noted that
the parameters and node representationsof this model_ from those of conventional
queueing models. Typically, the nodes of a queueing model represent serversof some
type, e.g., processors, workers, etc. The _sociated parameter for each node usually
describesthe exponential service rate for the server. The tokens or markings moving
about the model representcustomers that desireservice, e.g., programs,jobs, etc. The
actions of a closedqueueing model can be describedas a token arrivingat a node, wait-
ing, if necessary, a certain length of time for service, being served for a len_h of time,
and moving on to the next node. The model describedhere reverses the conventional
meanings of node and token. In this model,a node representsa customer that needs ser-
vice, and the associatedexponential servicerate describeshow long it takes to complete
that service. The tokens on the other hand representservers,where all the serversare
identical. Therefore, this model represents serversmoving from customer to customer
and pedorming the servicerequested by that customer. This unorthodox convention is
used because (1) it simplifies the model, and (2) it explicitly shows the state the system
is in by showing what state each processingcluster is in.
It is the goal to determine the steady state probabilitiesfor the distributionof clus-
ters among the different states, s Since it is safe to assume that the system will reach
steady state before a cluster fails, the number of clusters remains constant in the
analysis. Typical values for the mean time between failures (MTBF) are in the order of
10s-104 hours. Whereas,steady state can be reachedin a matter of minutes at most.
Once steady state is reached, a cluster may fail. At that point we have a system
with one less cluster, and it is reasonable to assume that this system will reach stcady
aThis will be shown in Section 2.5.
18
state before another failure occurs. The performance of this degraded system will be less
than that of the previous system. To obtain the overall performance of the system
operating over a certain length of time, the performance contributions of each of the
configurations are combined, weighted by their relative time of operation. Therefore, in
the following analysis, we will assume that no cluster fails and that the number of clus-
ters remains constant.
In this model m equals the total number of homogeneous clusters in the system.
Since the number of tokens in this closed queueing model remains constant, it is justifi-
able to have each token represent a cluster. Therefore, there are exactly m tokens
present in the system at all times. The nodes represent the activities that are performed
by a cluster, e.g., a cluster is in the idle state if it is idle.
There are n + 2 nodes in this model. Again, n is the number of different job
classes in the workload. As stated before, tasks that belong to the same job class are
assumed to each have the same distribution of internal processing time. It is assumed
that this processing time is an exponentially distributed random variable. The number
of tasks in a job class has to be greater than or equal to one. Each of these job classes is
given a priority level, where all t_ks of the same job class have the same priority and a
task from class i has priority over a task of class j" when 1 < i < ] <_ n.
Each of the nodes will be described below.
NODE 1 : This node represents the transmission activity over the system bus. It con-
sists of a non-preemptive priority queue and a transmission server. A token
at this node represents a cluster that is either waiting to transmit on the sys-
tem bus or currently transmitting. The parameter Ps describes the the
transmission rate of a cluster, i.e., __1 is the average transmission duration.
Ps
19
A non-preemptivepriorityqueueis used to show that a cluster that has just
completeda task fromclass i is given priorityto transmit overa clusterthat
has completeda task of class j, where again 1 _< i < j _< n. Clusters
completingtasks of the sameclass areable to transmit on a FCFS basis.
NODE 2 : A token at this node representsa clusterthat is idle, i.e., performingno
usefulcomputations. It i3 a multiservernodewith m servers. A nodeof this
type is used to indicatethat all the clusters may be served at this nodewith
no queueforming. This is equivalentto saying that all the clusters may be
idle at the same time. The sojourn time in this idle state for a cluster is
assumedto be exponentiallydistributedwith rate Pl. The rate at whichclus-
ters leave this node is k Pl, where k is the numberof tokens beingserved by
the node.
NODES 3 through n+2 : These n nodes representthe differentjob classes. Node
i+2 representsa processing activity on a task of class i. Again, as with
node 2, these are multiservernodes with m servers.Thus, no queue formsat
any of the nodes. This type of node is used to indicate that all the clusters
could be working on tasks from the same job class. The parameter Pi is the
rate describingthe processingduration of a task of class i. Typically, Pi --_
Pi when i < j. The rate at which clusters leave the node i+2 is kpi,
wherek is the numberof tokens beingservedby the particular node.
The final parameters in the model that need explanation are the branch probabili-
ties. When a cluster completesa transmission,it either drops into the idle state or con-
tinues processing. The probabilitythat the next state is the idle state is PI and simi-
larly, the probability that the next state is a processing state is Pp. Obviously,
20
/'I +Pp-----1. \Vhen a processor is to enter a processing state, there is the probability Pi
fl
of it being the processing of a job of class i, where _ Pi _ 1. Typically, Pi > Pj
i=l
when i < ].
2.5. Solutions to the Queuelng Model
The common approach to solving for the steady state probabilities of a queueing
model is to convert the model to that of a continuous parameter Markov chain [12]. This
approach will be used to solve the queueing model presented here. For the construction
of a Markov chain, we make the following definitions.
Definition 1 : A clu+tcr 8tat_ is a pair (ci,n i ), v:herc c_ E {1,2,...,m } is a number label-
ing a particular processing cluster, and "i E {1,2,...,n-l-2} is the number of
the node where the token representing the cluster is located. There are
m'(n+2) cluster states.
Definition 2 : A _yst¢m +tat¢ is an m-tuple( _,,_2_...,_m ) E S1XS2X "'" XS,_
where Si is the set of all cluster states whose first component is ci. There
are (, +2) m system states.
An example of a system state for a system with three clusters and three job classes
is ((1,1),(2,3),(3,1)). This represents the configuration when clusters 1 and 3 are waiting
to communicate on the system bus or are currently communicating, and cluster 2 is pro-
cessing a task from job class 1.
From an analysis standpoint, a system state contaln_ more information than is
necessary. Wc are only concerned with how many clusters there are at a particular
node. We do not need to kno_ which they are, because they all require the same
amount of time to process the task at a particular node. It is the number of clusters
21
that determineshow fast tasks are completedor delayed at a node. This motivates the
followingdefinition.
Definition 3 : A reduced system _tate is the n+2-tuple (al,a2, . . . j an.2), where ai E
{0,1,...,m} is the total number of tokens representingclusters at node i.
There are m'(n+2) reducedsystemstates.
We can define a formalmapping, q_,from a system state to a reducedsystem state
as follows: q_(81,,2,... ,,m} ----(al,a2,...,an.2}, where ai ._. number of ,i's
(j-----1,...,m)whose second component is i. Referringto the example above, we note that
the system state ((1,1),(2,3),(3,1))is representedby the reduced system state (2,0,1,0,0).
It should also be noted that the system states ((1,1),(2,3),(3,1)),((1,1),(2,1),(3,3)),and
((1,3),(2,1),(3,1))are all representedby the same reducedsystem state.
We use the reduced system states as the states of the Markov chain. The transi-
tions between these states is defined by the relevant servicerates of each of the nodes in
the closed queueing network. It has been stated by Kleinrock [13]that a closed queueing
model of this type with K customers and N nodes has J ---- {N N+ K1- 1} states in its
Markov chain representation. For our model,we have ra customers (clusters) and n +2
nodes. From this Markov chain, a J X J transition rate matrix,A, can be formed and
used to derive the steady state probabilitiesfor each state in the Markov chain. This
involves solving the matrix equation A.x--0, where x (zl,z2, ., z_ }T_--- .. and zi
representsthe steady state probabilityof the system being in state i. A nontrivial solu-
tion results when the constraint _ zi = 1 is considered. The existenceof such a solu-
i=1
tion is based on the fact that we have constructed a finite state, irreducible, and
22
recurrentMarkovchain.4 Since it is possible for a token in the queueingmodel to move
from one node to any other node, either directly or through some intermediate nodes,
and there is a non-zeroprobability that a token leaving a node will return to that node,
the Markovchain is indeedirreducibleand recurrent.
Once the steady state probabilities are determined,two useful results concerning
the multiprocessorsystem can be quickly obtained. One is the probabilitythat a cluster
is idle. This is simply the sum of the probabilities for each of the Markov chain states
that represent having one or moreclustersat node 2. The other result is the amount of
system bus contention. When there is morethan one clusterat node 1, there is a cluster
waiting to obtain bus control. Again, all that has to be done is to sum the probabilities
for each of the M_rkovchain states that representhaving more than one cluster at node
1. Recall that node 1 includes both the priority queue and the transmission server.
These two results are necessaryto producea performancemeasure of any type.
A third result can also be easily obtained. It would be interesting to know how
long a cluster would have to wait, on the average, if there is contention for the system
bus. It has been shown by a number of authors that the averagequeueingtime for cus-
tomers of a given priority class in a non-preemptivepriority queue can be determined
[14-16]. The averagequeueing time for a customerof prior'ty class i is
k
J E
2 i=1Wi----
[1-i=,/_ i ] [ 1-i___,_- ]
where
k = the numberof priority classes.
mj" _ the probability that an arriving customer is of class j.
4A unique steady state solution exists for this type of Markov chain.
23
Pi _ the mean service rate of a customer of class j.
cy _ the secondmomentof theservice-timedistributionforcustomersof
classi. k
The meanqueueingtimeof all customersis We -= _ cryIVy.
y--1
For the model described here, all clusters requesting service at node 1 require the
same amount of servicetime. Therefore, for this example we have k _- n, Pi =- Ps for
2 for all i. \Ve then arrive at the averagequeueing time for a cluster
all i, and ci -- It_
about to work on a task fromjob class i, IV,..
W,.----- 1
[PS - _' ai ] [Ps -/_ _i ]
It should be noted that Wi is the average queueing time only. The total time a custo-
mer spends at node 1 is the sum of the queueingtime andthe servicetime.
The only difficult part of deriving Wi is determining the values of each of the a i 's.
To do this, let p(s) equal the steady state probability of being in state s of the Markov
chain. Let Sii be the set of states of the Markov chain representing j clusters at node
i. The rest of the clusters, if any, may be at any of the remaining nodes. Then,
ri m
ai -- . where ri = lti E E J "P(S).
ri i=1 ,_s,+_,,
j _1
2.0. Description of Experimental System: FTMP
FTMP is a highly reliable multiprocessor installed in the AIRLAB at NASA Lang-
ley Research Center. This machine is intended to be used for real-time control of corn-
mercial aircraft of the next decade. Because of the disastrous effects that could occur if
this computer should fail while in use, NASA has determined that the probability that
24
this system could fail should be less than 10-° for a 10 hour flight. This obviouslycalls
for extremelyrigidperformancecriteria.
The hardware structure of FTMP consists of ten identical Line Replaceable Units
(LRU's)[17]. Each LRU includesa processor modulewhich contains local cache memory,
a shared 16K wordmemory, a 1553I/O port, two bus guardian units, a clock generator,
and a power subsystem. Any three processorscan be grouped together into a triad. The
processor remaining after forming three triads is reservedas a spare processor. Ten
memory modules are also formed into three triads and a spare. Communications
between processorsand the shared memory are accomplished through serial system
buses: that is, a data transmit bus(T-bus), a data receivebus(R-bus), and a polling bus
(P-bus) for resolving bus contention. The system buses are also arranged as triads by
activating three out of five. Therefore,from the programmer's viewpoint, there is only
one system bus.
System configurationsare controlledby bus guardianswhich assign the connections
between processorsand the P-bus or T-bus, and between shared memory and the R-bus.
Two bus guardians at each LRU form a dyad such that any transmission to system
buses will be enabled only when both guardians agree. The bus guardians are also used
as a voter for any processoror memorytriad. Since three processors in one triad are
operating in tight synchrony, their respective bus guardians should receivethree identi-
cal data under a fault-free condition. When there is a disagreement, an error is con-
sidered to have occurred,but masked, and the task executionwill continue. Meanwhile,
the disagreement will be recordedat an errorlatch for later identification of the hulty
module or bus. From the user's or software's standpoint, the FTMP is regardedas a
three processorsystem and has a shared 48K system memoryamong the three as shown
9.5
in Figure 4. The interested reader is referred to [17] for a complete architectural descrip-
tion of FTMP. What has been stated here is sufficient for the present discussion.
The software of FTMP is divided into five groups. They are the Executive
Software, Facilities Software, Acceptance Test/Diagnostic Software, Applications
Software, and Support Software [18]. Most of the tasks in each of these software groups
have to be dispatched at regular intervals to handle repetitive applications such as flight
control, configuration control, fault detection, recovery, as well as system displays. To
do this, FTMP has a dispatch algorithm that initiates tasks at their required frequencies.
Taking into account the type of application, the FTMP developers determined that
tasks had to be executed at three different frequencies, and the type of action performed
by the task determined which rate o_roupthe task belonged in. They termed the three
rate groups R1, R3, and R4. Their respective nominal frequencies are 3.125, 12.5, and 25
tIz. Tasks required to execute at a particular frequency are given priority to access sys-
tem components over tasks that are initiated at lower frequencies. This implies that
tasks in the R4 rate group have priority for bus access over tasks from rate group R3,
etc.
Fault detection, identification, and system reconfiguration are handled by an execu-
tive program called the System Configuration Controller (SCC) which is dispatched at
the slowest rate R1. This is done so the execution of the SCC will have a minimal effect
on the system workload, and the errors generated by a single fault will have an
appropriate system response. For experimental purposes, there are two application tasks
installed on the FTMP: auto-pilot and display programs.
The associated fault injection system is controlled by a host VAX-11/750 computer.
The injection extenders can be inserted into any chips at LRU3 and their respective
2fl
SYSTEH BUS
I
SYSTEN1 TIME CONTROL/ . . .
b4EI'4ORY CLOC:K/ STATUS
COUNTER
Figure 4. A block diagram of FTMP (from [15]).
27
socket holes such that the electrical connection between pins and the circuit board
becomescontrollable. Thus, three types of faulty signals, i.e., inverted signal, 8tuck-at-I,
and _tuck-at-O,can be injected at the pin level. Beforeany injection, the host computer
will signal the FTMP to activate LRU3for the fault injection. The detection, identifica-
tion and reconfiguration intervals are measured by reading a real-time clock and the
responses from the FTMP. This informationis then transferredfrom the FTMP to the
VAX-11/750 via a 1553 I/O port and a communicationinterface. Fault injection opera-
tions are processed by the FIS (Fault Injection System) on the VAX-11/750. The FIS
consists of a command interpreter,an injection handler, and an FTMP-VAX interface
program.
Recently, we have conducted some experimentson FTMP to measuresome factors
relating to bus contention, and the polling sequence. The results are summarized in
Table 3. These results pertain to the fault free system with three operating triads. As
can be seen, with the software presently on the system, there is a large amount of bus
contention. Although a triad usually succeeds in its first poll sequence, it must wait
47% of the time for the bus to becomefree. However,it was noticed in performingthe
measurements that the bus was usually busy for only a very short period. The busy
periodwas of a significant duration in only a few instances. It is also interesting to note
that the duration of a bus transaction is one quarter the time between bus requests.
This is probablywhy the bus is busy so often when a bus request is made.
2.7. Queuelng Model Representation of FTMP
It is obvious that the architectureand software structure of FTMP fit nicely into
our queueing model. One can representthe three triads as clustem, and each of the rate
groups as a job class. Job class I is rate group R4, because of the relative prioritiesof
28
P(Bus is busy when a bus request is made) _ 0.47
P(Bus is free when a bus request is made) _ 0.53
P(Suceeed in first poll sequence) _ 0.92
P(Lose f'w_tpoll sequence) -- 0.08
P(Succeed in second poll sequence) _ 1.00
Ave. idle time waiting for free bus, if lost _ 32.2 ps
poll _quence
Ave. idle time waiting for free bus, if busy _ 21.0 ps
when request was made
Ave. duration of bus transaction -- 36.4 ps
Ave. time between bus requests -- 140.9 p8
Table 3. Experimental Measurements
29
the rate groupsand job classes. Likewise,job class 2 is R3, and job class 3 is R1. There
is some dependence when task from a rate group are executed based on the state of
tasks of a higher priority rate group. However,these can be handled by the model by
increasingthe number of job classes. Forthe purpose of illustration,these dependencies
are assumedto be negligible. In the queueingmodel representationof FTMP we, there-
fore, have five nodesand three tokens representingcluster_,i.e., n=3 and m=3. By the
formula mentioned earlier, the Markov chain representationof this specific model has
J = (_} = 35 states. These states and their respective reduced system states are
described in Table 4.
To solve the Markov chain, the values for the parameters of the queueing model
have to be determined. Sample values are outlined in Table 5. The value for Ps was
obtained from the experimental data. Pr was arrived at from the documentation on
FTMP[18]. The other parameters were arrived at through reasonable assumptions, or
realistic relations among the service rates. The computed steady state probabilities for
the _tates of the Markov chain using these parameter values is shown in column 3 of
Table 4. Columns 4, 5, and 6 of Table 4 are the steady state probabilities when the
parameter Ps is varied, and the rest of the parameter_remain constant.
Using the informationsupplied by Table 4, some simple results can be stated. The
probability that there is an idle cluster is the sum of the steady state probabilities for
the Markov states where there are one or moreclusters at node 2. These are states 2, 6,
7, 8, 9, and 18 thru 25. The idle probabilitiesfor the differentvalues of Ps are shown in
Table 6. These numbersare extremely low, implying that rarely is a triad idle. The
probability that there is bus contention is the sum of the steady state probabilities of
states 1 thru 5 (states representingmore than one cluster at node 1). These results are
3O
Markov States Computed Steady State Prob.
State Reduced System State Ps _0.0275 Ps _--0.00275 Ps _---0.0138 Ps -----0.055
1 ( 3, 0, 0, 0, 0 ) 0.022 0.589 0.096 0.004
2 ( 2, 1, 0, 0, 0 ) 0 0.002 0.001 0
3 ( 2, 0, 1, 0, 0 ) 0.039 0.106 0.087 0.013
4 ( 2, 0, 0, 1, 0 ) 0.039 0.106 0.087 0.013
5 ( 2, 0, 0, 0, 1 ) 0.037 0.099 0.081 0.013
o (1,2, o,o,o) o o o o
7 (I,I,I,O,0 ) 0.001 0 0.001 0.001
8 ( 1,1,0,1,0 ) 0.001 0 0.001 0.001
9 ( 1, 1, O,O, I ) 0.001 0 0.001 0.001
10 ( 1, 0, 2, 0, 0 ) 0.036 0.010 0.039 0.024
II ( I, O, I, I, 0 ) 0.071 0.019 0.079 0.048
12 ( I, O, I, O, 1 ) 0.067 0.018 0.074 0.045
13 ( I, O, O,2, 0 ) 0.030 0.010 0.039 0.024
14 ( 1, 0, 0, 1, 1 ) 0.0{}8 0.018 0.073 0.045
15 ( 1, O, O,O,2 ) 0.031 0.008 0.034 0.021
16 (0,3,0,0,0) o o o o
17 (0,2,1,0,0) 0 0 0 0
18 (0,2,0,1,0) 0 0 0 0
19 (0,2,0,0,1) 0 0 0 0
20 ( O, 1, 2, O,0 ) 0.001 0 0.001 0.001
21 ( o, 1, 1,1,0 ) 0.002 0 0.001 0.oo3
22 ( O, I, I, O, I ) 0.002 0 0.001 0.003
23 ( 0, 1, 0, 2, 0 ) 0.001 0 0.001 0.001
24 ( 0, 1, 0, 1, 1 ) 0.002 0 0.001 0.003
25 ( O, 1, O,O,2 ) 0.001 0 0.001 0.001
26 ( 0, 0, 3, 0, 0 ) 0.021 0.001 0.012 0.029
27 ( 0, 0, 2, 1, 0 ) 0.064 0.002 0.035 0.087
28 ( 0, 0, 2, 0, 1 ) 0.0O0 0.002 0.033 0.081
29 ( O,O, I, 2, 0 ) 0.064 0.002 0.035 0.087
3o ( o,o, 1,1,1 ) 0.120 0.oo3 0.o8o 0.163
31 ( 0, 0, 1, 0, 2 ) 0.056 0.002 0.031 0.076
32 ( O, O, O, 3, 0 ) 0.021 0.001 0.012 0.029
33 ( 0, 0, 0, 2, 1 ) 0.0O0 0.002 0.033 0.081
34 ( 0, 0, 0, 1, 2 ) 0.056 0.002 0.031 0.076
35 ( o,o,o,o,3 ) 0.018 0 0.010 0.024
Table 4. Markov State Descriptions and Steady State Probabilities
31
PS __ 1 -- 0.0275 Pp ==0.95
36.35
pi ----5.p= ----0.0458 /)! ----0.05
Pl 1 -----9.17XlO-S P1 ----0.6
p_ 1 ----4.58X 10-s P2 -----0.3
= -_"_s
Fs -- 1 ----1.63X 10-s Ps _ 0.1
16.87 "/is
Table 5. Parameter Values
32
Ps Idle Prob. Cont. Prob.
0.00275 0.002 0.902
0.0138 0.010 0.352
0.0275 0.012 0.139
0.055 0.015 0.043
Table 6. Idle Processors and Bus Contention Probabilities
33
also shown in Table 6. Figure 5 shows the effect of changing the service rate at node 1
on bus contention. The probability of bus contention increases dramatically as the ser-
vice rate approaches zero, as expected. Figures 6 - 8 show the effects of varying Pl, Pl,
and Pl, respectively. One could derive from these graphs the sensitivity to a change in
performance caused by a change in a service rate or branch probability.
There are numerous other conclusions that could be drawn from the results of this
example. The sensitivities of varying other parameters, average queueing times, and
degraded system performance are just a few. It can be seen that this model is useful in
analyzing many of the aspects that are vital to any performance evaluation. It is impor-
tant to note that all the parameters of the queueing model are ones that can be meas-
ured.
3. MEASUREMENT OF FAULT LATENCY
3.1. Introduction
A hardware fault is defined as an incorrect state caused by the physical change in a
component, whereas an error is defined to be the erroneous information/data resulting
from the manifestation of a fault. Even after a hardware fault occurs in a computer sys-
tem, the system will remain error-freeuntil the fault manifests itself. Before its manifes-
tation, the fault is latent and is not harmful to any system operations. Thus, there are
two time intervals of interest between fault occurrence and error detection: fault latency
and error latency (see [21] for a detailed description of these). Obviously, error latency
depends on the detection mechanisms s used. Fault latency is dependent on the location
and the type of the fault, and the degree of usage of the faulty unit. In other words,
%¢hiehwe termed the function-level detection mechanisms in [21].
34
.00275 .1"1138 .0275 .0413 .055
PS
Figure 5. Probability of Bus Contention VS. Bus ServiceRate (Ps)
35
I I ,
-2 -2 -2l).g17xi0 4.59xi0 9.17xi0
Figure 6. Probability of Bus Contention VS. Service Rate of Job
Class1(_)
3(}
I
.229 .45,8 2.29 4.58
xlO-3 xlO-3 xlO-3 xlO-3
F!
Figure 7. Probabilityof an IdleClusterVS.IdleServiceRate (Pl)
37
.2 .4 .5 .8 1.D
PZ
Figure 8. Probability of an Idle Cluster VS. PI
38
fault latency is closely related to the physical property of a fault, whereas error latency
represents the efficiency of the detection mechanisms used.
In a reliable computer system, the detection and isolation of faults and errors, and
the subsequent reconfiguration are provided to tolerate faults and errors. These steps
must be executed correctly by fault-free subsystems. In the face of multiple faults, the
fault-tolerance capability is reduced and the coverage of failure is incomplete. It has
been shown that an incomplete coverage is the major threat to a highly reliable system
[22-24]. Thus, the accumulation of latent faults and the near-coincident occurrence of
faults should be considered in the modeling and verification of a reliable system. How-
ever, the conventional modeling of a reliable system usually assumes that the system is
recovered from an extant fault if no new fault occurs during the recovery period; other-
wise, a coverage failure results. This is true only when there is no fault latency or a
negligible fault latency during which no new fault occurs. That is, the conventional
works have ignored the possibility of the accumulation of latent faults. Obviously, the
conventional approach becomes invalid if fault latency has the same order of magnitude
as the recovery period. Due to the reasons discussed above, it is essential to accurately
evaluate both fault and error latencies.
In addition to the analysis of the coverage failure, the knowledge of fault latency is
important to the study of transient faults. Clearly, a transient fault manifests itself only
when its active duration is greater than fault latency. If fault latency is long, it is possi-
ble that most transient faults will disappear before they harm the system. In such a case,
the transient faults captured by some detection mechanisms cannot represent the true
characteristics of all transient faults.
39
In the past, several researchers conducted experiments and simulations to investi-
gate faults' manifestations and subsequent error detections by injecting hardware faults
[25-34]. Results were observed through the detection mechanisms following the fault
injections. They measured the probability of detection and the distribution of detection
times which are the s_,rn_of fault and error latencies. Since there does not exist a direct
way to determine the moment of error generation, these experiments fail to indicate the
moment of error generation which divides the detection time into fault latency and error
latency. Instead, a combined effect of the inherent fault property and an associated
detection operation can be observed via these experiments. Thus, these experiments nei-
ther help us understand the behavior of fault and error generation, nor give an accurate
measure of the capabilities of detection mechanisms. In order to remove this inade-
quacy, we develop here a methodology to measure fault latency; with the measured fault
latency and detection time, error latency can also be computed.
3.2. Me*.hodology for Measurement of F_ult Latency
Suppose there are some detection mechanisms which are able to detect the error
generated by a fault f . Let tI represent the fault latency of this specific fault, which is
a random variable with the distribution function FI (t). _Ve inject the fault f ni
times, and each injection is held active for the duration ti. If tI is greater than ti, then
no error will be generated. Otherwise, the fault manifests itself, inducing an error which
will be captured later by the detection mechanisms. If there are di detections among
these n i injections, then the ratio --di indicates the probability that an error is gen-
ni
erated during the fault active duration ti . This is equivalent to the probability that the
fault latency is smaller than t i. Thus, we obtain the distribution function of fault
4O
latency for the fault f as follows:
di
F/ (ti)----- Prob (t! <ti)-- (1)fl i
Notice that this measurement of fault latency is not: affected by error latency. This
also implies that the result of measurement is independent of the efficiency of detection
mechanisms. Thus, as long as the error induced by the fault f can be detected, we can
obtain the distribution of fault latency for the fault f.
We cannot overemphasize the fact that the moment of error generation is not
directly observable. Although the occurrence of a logic failure caused by a fault can be
identified by voting, the logic failure does not always induce an error at the function
level. In other words, there may not exist a sensitized path such that the faulty signal
can propagate to the output stage. Consequently, we have proposed a new methodology
to indirectly measure the fault latency. Due to the "indirect" nature of our measure-
ment, we obtain the distribution of fault latency instead of actual samples of fault
latency. Clearly, this fact does not allow for any rigorous statistical analysis c.r our
experimental data. However, to our best knowledge, the proposed indirect methodology
is the first and the only attempt to measure fault latency.
3.3. Experimental Results and Analyals on FTMP
For our experiments, the original FIS (Fault Injection System) has been modified to
enable us to inject transient faults. 6 Additional features are added to the command
interpreter such that the active duration of a transient fault can be specified and passed
to the injection handler. Injection ends if either the response of the FTMP indicates the
accomplishment of detection, identification and reconfiguration, or the active duration
nl'he original FIS is designed for injecting permanent faults only.
41
becomeslargerthanthespecifiedvalue.Inthelattercase,FISismade towaita few
secondsfora possibler sponsefromtheFTMP.
To measurefaultlatencyand demonstratethemethodologyproposedabove,tran-
sientfaultswereinjectedtofourcircuitboardsoftheFTMP, i.e.CPU DataPath,CPU
ControlPath,CacheController,and SystemBus Controller.The firstthreeboardsare
intheCAP6 processor/cacheregionwhichisconstructedwiththeAMD 2900seriesbit-
slicemicroprocessors.The SystemBus Controllerisresponsiblefortransferringblocksof
wordsbetweena localprocessorregionandthesharedmemory. Italsoservesasa syn-
chronizingmechanismsuchthattheprocessorsina triadcan bebroughtintofullsyn-
chrony.On eachboard,severalpinsareselectedforinjectingtransientfaults.Selection
of boardsand pinsismade arbitrarily.For each pin,stuck-at-0,stuck-at-I,and
invertedsignalsareinjected.
A prim_teatwas appliedtoeachselectedpintoobservewhetherornotan erroris
generatedaftertheinjectionfa permanentfault(whichhasan activedurationof3
second_ or more). In Wimmergren's experimentson the FTMP [33],undetected faultsare
reportedtoexist.Possibleexplanationfortheexistenceofundetectedfaultsare:(I)the
circuits are not exercised, (2) there are "don't care" or redundant pins, and (3) the
injected fault does not cause any logic failure. In our experiments, injection of transient
faults is not made if there is no detection during the prime test. At certain pins, errors
are detectedwhen stucbat-0 and invertedsignalfaults are injected,but not stuck-at-1
faults. In such a case, injection of stuck-at-1 faults is omitted. 7
For each pin, transient faults with different active durations are injected 10 to 40
times repeatedly. At an earlyexperiment,we found that di/n i increases sharplywhen
7Obviously,thereis no useof suchan injection.
42
the transient durations are small. Thus, to have good resolution, the active duration of
the transient faults injected, denoted by ti, are not equally distanced. That is, we used a
finer resolution for small t i's and a coarser resolution for large ti 's. Moreover, since the
fault latency at the System Bus Controller board is much larger than that at the other
boards, ti's used for testing this board are different from those used for the others.
Among more than 20,000 transient faults injected, only 15,111 results are used for
the analysis. The other data are regarded unreliable because: (1) the fault identified by
the FTMP was not in the LRU where the fault was actually injected, (2) the FTMP
crashed during the fault injection, and (3) one of the detection, identification and recon-
figuration times was negative. If the second case occurred, the injection was performed
again. For every i and each type of fault at a pin, using the measured di/hi, we
obtained the averaged di/u i--F ! (ti) for each board, which are listed in Table 7. In
addition, we present hI (t_) in the table which is defined as
hI (t_) --- El (ti+!)-F l (ti)(ti+l-t i )(I-F/(t i )) (2)
The function hI (ti) becomes the hazard rate of fault latency as ti+l-ti--_O.
Despite the fact that negative numbem appeared twice in Table 7, the functions hi (t i)
in the table strongly suggest that the hazard rate of fault latency be raonotor_edecreas-
ing. Thus, two distributions with monotone decreasing hazard rates, i.e., Weibull and
Gamma distributions, are used to fit the experimental results. Estimated parameters are
given in Table 7 where the least-squares errors are also included. The experimental
results and the estimated Weibull distribution are plotted in Figures 9 through 12.
The estimated parameter for exponential distributions is also presented in Table 8
for the purpose of comparison with WeihuU and Gamma distributions. It can be seen
43
s-a-0 s-a-1 inverted
t,( ,,,._/ rt(t_) hl,t,) r_(ti) h_(ts) rl t_) h,(ti)
0.0 0.0 21.0 0.0 9.0 0.0 35.0
0.01 0.21 1.27 0.09 0.36 0.35 3.59
0.10 0.30 0.071 0.12 0.20 0.56 0.40
0.50 0.32 0.23 0.19 0.074 0.63 0.11
1.00 0.40 0.079 0.22 0.054 0.65 0.078
5.00 0.59 0.073 0.39 0.049 0.76 0.025
10.00 0.74 0.058 0.54 0.028 0.79 0.043
20.00 0.89 0.67 - 0.88 -
(a). Experimental Results and hl(ti) on Cache Controller.
s- a-0 s-a- 1 inverted
0.0 0.0 67.0 0.0 85.0 -
0.01 0.67 9.09 0.85 9.63
0.10 0.94 1.67 0.98 -3.75
0.50 0.98 2.00 0.95 1.20
1.00 1.00 - 0.98 0.11
10.00 0.98 0.01 1.00
20.00 1.00 1.00
(b). Experimental Results and hl(ti) on CPU Control Path.
Table 7. Experimental Results and Estimated h/(ti).
44
s-a-O s-a-1 inverted
0.0 0.0 25.0 0.0 34.0 0.0 49.0
0.01 0.25 5.67 0.34 2.65 0.49 7.35
0.05 0.42 1.72 0.41 1.69 0.64 11.7
0.10 0.47 0.566 0.46 0.42 0.85 0.83
0.50 0.59 0.097 0.55 0.44 0.90 0.40
1.00 0.61 0.038 0.65 0.086 0.91 0.28
5.00 0.67 0.030 0.77 0.052 0.92 0.0
10.00 0.72 0.025 0.83 0.041 0.92 0.0125
20.00 0.79 0.90 - 0.93
(e). Experimental Results and hl(ti) on CPU Data Path.
s-a-0 s-a-1 inverted
t,( m s) F](ti) hi(t,) r_, ti) hi(ti) F_ t,) hi( ti)
0.0 0.0 0.040 0.0 0.032 0.0 0.036
5.0 0.20 -.013 0.16 0.050 0.35 0.036
10.0 0.15 0.0106 0.37 0.0079 0.55 0.021
20.0 0.24 0.0026 0.42 0.0138 0.63 0.016
50.0 0.30 0.0037 0.66 0.0011 0.65 0.0074
100.0 0.43 0.0068 0.68 0.0056 0.76 0.0053
200.0 0.82 0.010 0.86 0.01 0.79 0.001
300.0 1.00 - 1.00 - 0.88 -
(d). Experimental Results and hl(ti) on System Bus Controller.
Table 7. Experimental Results and Estimated hl(ti) (cont'd).
45
CPU Control Path
0
g
:_ CPU Data Path
Cache Controller
_ m
trj
C_
€:3 I I _ I I
_0.00 4.00 8.00 12.00 16.00 20.00
latencyperiod(ms)
Figure 9. The Experimental Results and Estimated Distributions for Stuck-at-O
Faults.
48
tfl
t'M
i CPU Control Path
g
?
cn I 1 I I I
_.00 4.00 8.00 12.00 16.00 20.00
latencyperiod[ms]
Figure 10. The Experimental Results and Estimated Distributions for Stuck-at-I
Faults.
47
CPU Data Path
_:_ _ _ Cache Controller
trl
-.t
I I l ! I
_0.00 4.00 8.00 12.00 18.00 20.00
latency period (ms)
Figure 11. The Experimental Results and Estimated Distributions for Inverted
Signal Faults.
48
Lfl
04
,.2"
invertedsignal o
0
0 stuck-at-1
stuck-at-O
I_. I I I I I
'=0.00 60.00 120.00 180.00 240.00 300.00
latencyperiod(ms}
Figure 12. The Experimental Results and Estimated Distributions of Fault Laten-
vies at System Bus Controller.
49
Exponential Weibull Gamma
1[k error 1IX q error 1/k c_ error
s-a-0 4.78 0.24 4.35 0.35 0.03 45.89 0.24 0.02
CC s-a-1 13.07 0.08 15.24 0.51 0.015 61.61 0.38 0.006
inverted 0.46 0.41 0.56 0.20 0.009 82.90 0.11 0.007
CPUC s-a-0 0.009 0.004 0.0076 0.39 0.0006 0.117 0.19 0.0008
s-a-1 0.005 0.003 0.001 0.27 0.0025 0.092 0.09 0.0029
s-a-0 0.515 0.539 1.488 0.21 0.021 153.9 0.12 0.018
CPUD s-a-1 0.628 0.31 0.799 0.23 0.006 56.79 0.13 0.0013
inverted 0.036 0.115 0.030 0.29 0.0026 0.648 0.18 0.032
s-a-O 125.2 0.063 124.9 0.89 0.061 173.2 0.77 0.057
SBC s-a-1 46.9 0.097 54.85 0.58 0.020 176.18 0.44 0.021
inverted 34.4 0.029 39.10 0.70 0.0045 80.44 0.58 0.0066
CC -- Cache Controller, CPUC -- CPU Control Path
CPUD -- CPU Data Path, SBC -- System Bus Controller
Table 8. Least-Squares Estimation of the Distributions of Fault Latencies.
5O
that the constant error generation rate (i.e. exponential distribution) does not model the
error generation well. The mean fault lateneies -- which are 1/X in the estimated
parameterof the exponentialdistribution-- range from O.O005msof stuck-at-1 faults in
the CPU Control Path to 125ms of stuck-at-0 faults in the System Bus Controller. This
was due to the different exercise rates at each board. Since each injected stuck-at-0 or
stuck-at-1 fault does not always represent a logic failure at the moment of injection, the
fault with an inverted signal should have a shorter fault latency: this is confirmed by the
experimental results.
As pointed out earlier, fault latency is not directly observable. This fact has led us
to the development of a new methodology which allows for indirect measurement of fault
latency. Note, however, that our experimental results give the distribution function of
fault latency instead of data samples of fault latency. Hence, statistical analyses or
hypotheses testing are not applicable to these experimental data. The least-squares esti-
mation with commonly used distributions gives only approximate values of the parame-
ters. They cannot test whether an underlying model is (statistically) good or bad.
Indeed, from the least-squares errors in Table 8 it is unclear which distribution has the
best fit. However, since the hazard rate converges to 1/), and 0 for Gamma and
Weibull distributions, respectively, it is possible to distinguish between them once addi-
tional injections with larger active durations are performed.
4. CONCLUSION AND DISCUSSION
In this report, we have presented first a model to be nse_dto study the workload
effects on performance for a highly reliable unibus multiprocessor used in critical real-
time applications. Because of the strict performance criteria required for systems of this
type, a detailed analysisis both desirable and necessary.
51
The operation of the computing system addressed has been illustrated using a
modified Stochastic Petri Net{SPN). It was the purpose of this model to graphically
describe the synchronous operation of multiple processingclusters. It was desired to
show which aspects of the computer's operation have the most significant effect on the
computer's performance. Most certainly, system bus contention, workload distribution,
and idle processingperiodshave a markedeffect on performance.
The modified SPN was useful for the purpose of describing computer activity.
However, as a tool for performanceevaluation, it was shown to be too complex for
worthy analysis. A simplermodel has been presentedthat still describesthe critical per-
formance related facets. This model is a closed queueing network consisting of mul-
tiservernodes and a single non-preemptivepriorityqueue.
The queueingmodelwas shown to be easily solvedfor a set of given parameters. It
was also observed that useful results pertaining to system performancecould be directly
obtained from the solution to the queueing model. The ease of obtaining these results
and the overall importance of the results demonstrate the usefulnessof the model for the
purposeof performanceevaluation.
The area that merits further research is in determining the distribution of the
workload among different job classes. A systematic method has not been developedyet
to construct the various job classes from the workload of a real-time control system.
Characterization of real-time workloadsis a more restricted problem than dealing with
the workloads of a general purpose computer. This motivates continued research in
solving the workload distribution problem.Once a characterization method is developed,
one can then consider the possibility of obtaining an optimal workload distribution to
provide optimal performance.
62
We have also developed a new methodology for indirectly measuring fault latency
with the injection of faults. The methodology has been realized by experiments on the
FTMP. The FTMP experimental results show a large variation in fault latencies for dif-
ferent circuits. It has also been observed that the hazard rate of fault latency is mono-
tone decreasing. This implies that a fault tends to be latent if it did not generate an
error at its early stage. The existence of a long fault latency should not be ignored in
highly reliable systems. To reduce the accumulation of latent faults, additional on-line
diagnostics must be incorporated into the area where a long fault latency exists, s
Although two possible distributions are used to fit the experimental results, no
underlying model for fault latency can be concluded. It is mainly because of the unob-
servability of error generation. More experiments should be designed to investigate the
behavior of a fault and its effect on system execution. An immediate extension of our
experiments is to make the injections under different system workloads or the execution
of different application tasks. We expect to see some variations of fault latency in cer-
tain circuits.
During the FTMP experiments, some interesting points were observed, especially
when the faults were injected into the System Bus Control. At certain pins, identifica-
tion results were different for various active durations of injections. For instance, with a
long (in relative to fault latency) active duration, the SCC indicated that the whole LRU
was faulty, but indicated that only a processor or memory was faulty when the active
duration was short. This situation was sometimes reversed. In other words, the identifi-
cation results by the SCC depend on both the location of injection and the active dura-
tion of the fault. For the injections in the other boards, e.g., Cache controller, CPU
_3uchareascan be identifiedby the methodologyproposedin this report.
53
Data and Control Path, a processor was identified as faulty. Note that the System Bus
Controller is the interface between the processor region and system buses. These obser-
vations show that the errors do not propagate out of the processor boundary. They also
suggest that an error easily propagates from interface circuits, but the identification of a
faulty interface circuit is more difficult.
In addition, we encountered several problems that were inconsistent with the
FTMP's specification. This forced us to abandon some experimental results. Specifi-
cally, fault injections to the System Bus Controller caused the FTMP to generate fre-
quent system crashes or have wrong identifications. Certainly, the FTMP could not dis-
tinguish between the injection of a fault from the true occurrence of a fault. These
abnormalities occurred too frequently to be treated as random failures. In addition, only
210 responses from the FTMP indicated that the detected faults were transient, even
when faults with a 10 micro-second active duration were injected. In fact, all injections
of transient faults in the Cache Controller, CPU Data and Control Path were regarded
as permanent. A thorough verification is needed for the FTMP's detection and identifi-
cation mechanisms. This is a matter for our future research.
ACKNOWLEDGMENT
The authors are grateful to Carlos Liceaga, Ricky W. Butler, Milt Holt, Brian Lup-
ton, and Peter Padilla at the NASA AIRLAB for their assistance in the FTIvfP experi-
ments.
54
REFERENCES
[1] K.S. Trivedi, "Modeling and Analysis of Fault Tolerant Systems," CS-1984-9,
Dept. of Computer Science, Duke University, 1984.
[2] M.A. Marsan, G. Balbo, and G. Conte, "Comparative Performance Analysis of
Single Bus Multiproeessor Architectures," IEEE Trans. on Computers, vol. C-
31, pp. 1179-1191, Dee. 1982.
[3] M.A. Marsan and M. Gerla, "Markov Models for Multiple Bus Multiprocessor
Systems," IEEE Trans. on Computers, vol. C-31, pp. 239-248, Mar. 1982.
[4] C.M. Krishna and K. G. Shin, "Performance Measures for Multiprocessor Con-
trollers", Performance '83, edited by A. K. Agrawala and S. K. Tripathi, North
Holland, pp. 229-250, 1983.
[5] L.J. Miller, "A Heterogeneous Multiprocessor Design and the Distributed
Schedulingof its Task Group Workload,"Proc. 9th Syrup.on ComputerArchi-
tecture, pp. 283-290,1982.
[6] A. Singh and Z. Segall, "Synthetic Workload Generation for Experimentation
with Multiprocessors," Proe. 8rd Inter. Conf. on Distributed Computing Sys-
tems, pp. 778-785, 1982.
[7] M.H. MacDougall, "Instruction-Level Program and Processor Modeling," IEEE
Computer, vol. 17, no. 7, pp. 14-24, July 1984.
[8] D. Ferrari,G. Serazzi,and A. Zeigner,Measurementand Tuning of Computer
Systems. EnglewoodCliffs,NJ: Prentice-Hall,1983.
[9] D. Ferrari, "On the Foundations of Artificial Workload Design," Proc. of 1984
A CM Sigmetries Conf. on Measurement and Modeling of Computer Systems, pp. ,
8-14.
[10] M.K. Malloy, "On the Integration of Delay and Throughput Measures in Dis-
tributed Processing Models," Ph.D. dissertation, Univ. California, Los Angeles,
1981.
[11] M.A. Marsan, G. Balbo, and G. Conte, "A Class of Generalized Stochastic
Petri Nets for the Performance Evaluation of Multiprocessor Systems," Proe. of
1983 ACM Sigmetrics Conf. on Measurement and Modeling of Computer Sys-
tems, pp. 198-199.
65
[12] K.S. Trivedi, Probability and Statistics with Reliability, Queuing, and Computer
Science Applications. Englewood Cliffs, NJ: Prentice-Hall, 1982.
[13] L. Kleinrock, Queueing Systems Volume 1: Theory. New York: Wiley-
Interscience, 1975.
[14] N.K. Jaiswal, Priority Queues. New York: Academic Press, 1908.
[15] D.R. Cox and W. L. Smith, Queues. New York: John Wiley ,_ Sons, 1961.
[16] T.L. Saaty, Elements of Queueing Theory With Applications. New York:
McGraw-Hill, 1961.
[17] T.B. Smith and J. H. Lala, "Development and Evaluation of a Fault-Tolerant
Multiprocessor (FTMP) Computer Volume h FTMP Principles of Operation,"
NASA Contractor Report 166071, May 1983.
[18] J. It. Lain and T. B. Smith, "Development and Evaluation of a Fault-Tolerant
Multiprocessor (FTMP) Computer Volume Ih FTMP Software," NASA Con-
tractor Report 166072, May 1983.
[19] J.H. Lala and T. B. Smith, "Development and Evaluation of a Fault-Tolerant
Multiprocessor (FTMP) Computer Volume IIh FTMP Test and Evaluation,"
NASA Contractor Report 166073, May 1983.
[20] T.B. Smith and J. H. Lain, "Development and Evaluation of a Fault-Tolerant
Multiprocessor (FTMP) Computer Volume IV: FTMP Executive Summary,"
NASA Contractor Report 172286, Feb. 1984.
[21] K.G. Shin and Y. It. Lee, "Error Detection Process - Model, Design, and
Impact on Computer Performance," IEEE Trans. on Computer, Vol. C-33, No.
6, June 1984, pp.529-540.
[22] W.G. Bourieius, W. C. Carter, and P. R. Schneider, "Reliability Modeling
Techniques for Self-Reparing Computer Systems," Proc. 24th Ann. A CM Nat.
Conf., 1969, pp. 295-309.
[231 A.L. Hopkins, T. B. Smith, and J. H. Lain, "FTMP - A Highly Reliable Fault-
Tolerant Multiprocessor for Aircraft," Proceedings of the IEEE, Vol. 66, No. 10,
Oct. 1978, pp. 1221-1240.
[24] K. Trivedi, R. Geist, and M. Dugan, "Modeling Imperfect Coverage in Fault-
Tolerant Systems," Proc. of the 14 Annual lnt'l Syrup. on Fault-Tolerant Com-
puting, 1984, pp. 77-82.
50 ¸
[25] S.J. Bavuso, et ai., "Latent Fault Modeling and Measurement Methodology for
Application to Digital Flight Control", Advanced Flight Control Symposium,
USAF Academy, 1981.
[26] B. Courtois, "Some Results about the Efficiency of Simple Mechanisms for the
Detection of Microcomputer Malfunction", Proc. of the gth Annual lnt'l Syrup.
on Fault-Tolerant Computing, 1979, pp. 71-74.
[27] B. Courtois, "A Methodology for On-line Testing on Microprocessors", Proc. of
the llth Annual Int'l Syrup. on Fault-Tolerant Computing, 1981, pp. 272-274.
[28] Y.K. Malaiya and S. Y. H. Su, "Reliability Measure of Hardware Redundancy
Fault-Tolerant Digital Systems with Intermittent Faults", IEEE Trans. on
Computera, Vol. C-3O, No. 8, August 1981, pp. 600-604.
[29] P. Marchal and B. Courtois, "On Detecting The Hardware Failures Disrupting
Programs in Microprocessors", Proe. of 12-th Int'l Conf. on Fault-Tolerant
Computing, 1982, pp. 249-256.
[30] J.G. McGough and F. L. Swern, "Measurement of Fault Latency in a Digital
Avionic Mini Processor," NASA Contractor Report 3_62, Oct. 1981.
[311 J.G. McGough and F. L. Swern, "Measurement of Fault Latency in a Digital
Avionic Mini Processor - Part If," NASA Contractor Report 3651, Jan. 1983.
[32] V. Tasar, "Analysis of Fault-Detection Coverage of a Self-Test Software Pro-
gram", Proc. of the 8th Annual lnt'! Syrup. on Fault-Tolerant Computing, 1978,
pp. 65-74.
[33] A.L. Wimmergren, "Verification of a Fault Tolerant Multi-Processor Architec-
ture," CSDL-T-782, The Charles Stark Draper Lab., May 1982.
[34] J.H. Lain, "Fault Detection, Isolation and Reconfig_aration in FTMP: Methods
and Experimental Results," Proc. 5th IEEE/AIAA Digital Avionic8 System
Conf., Nov. 1983.
57
I. Report No. 2. Governn'_nt Accession No, 3. R_ipient'$ CatalOg No.
NASA CR-3920
4. Title and Subtitte 5. Report Date
Modeling and Measurement of Fault-Tolerant August 1985
Multiprocessors 6. Performing Organization Code
7. Author(s) 8. PerformingOrganization Report No.
Kang G. Shin, Michael H. Woodbury, and Yann-Hang Lee
10. Work Unit No.
9. Performing Organization Name and Addre_
University of Michigan '11.Contract or GrantNo.
Ann Arbor, Michigan 48109-1109 NAGI-296, NAGI-492,
and NGT 23-005-801
13. Type of Report and Period Covered
12. Sponsoring Agency Name and Address Contractor Report
National Aeronautics and Space Administration
Washington, DC 20546 14. SponsoringAgencyCode
505-34-13-32
15. Supplementary Notes
Langley Technical Monitor: Ricky W. Butler
16. Abstra_
The workload effects on computer performance are addressed first for a highly
reliable unibus multiprocessor used in real-time control. As an approach to study-
ing these effects, a modified Stochastic Petri Net (SPN) is used to describe the
synchronous operation of the multiprocessor system. From this mode! the vital
components affecting performance can be determined. However, because of the
complexity in solving the modified SPN, a simpler model, i.e., a closed priority
queuing network, is constructed that represents the same critical aspects. The use
of this model for a specific application requires the partitioning of the workload
into job classes. It is shown that the steady state solution of the queuing model
directly produces useful results. The use of this model in evaluating an existing
system, the Fault Tolerant Multiprocessor (FTMP) at the NASA AIRLAB, is outlined
with some experimental results.
Als0 addressed is the technique of measuring fault latency, an important microscopi(
system parameter. Most related works have assumed no or a negligible fault latency
and then performed approximate analyses. To eliminate this deficiency, we (i)
present a new methodology for indirectly measuring fault latency, and (ii) derive
the distribution of fault latency from the methodology. The proposed methodology
has also been applied successfully to the measurement of fault latency for FTMP.
The experimental results show wide variations in the mean fault latency of different
function circuits within FTMP. Consequently, Gamma and Weibull distributions are
selected for the least-squares fit as the distribution of fault latency. Based on
experience from these and other experiments, we have made several remarks.
17. Key Wor_ (Suggest_ by Author(s)l 18. Distribution Statement
Fault-tolerant Workload
Real-time Unclassified - Unlimited
Performance modeling
Fault latency Subject Category 62
Reliability analysis
19. Security Oa=if. (of this report] _. Security Cla_f. (of this _ge) 21. No. of Pages 22. Dice
Unclassified Unclassified 64 A04
.-3os F_ saleby theNatio_l TechnicalInfumationService.Springfield.Virginia 2216!
NASA-Langleyl1985

National Aeronautics and
SpaceAdministration BULKRATE
Washington, D.C. POSTAGE & FEES PAID
20546 NASA Washington,DCPermitNo. G-27
Official Business
Penalty for Private Use, $300
POSTMASTER: If Undeliverable (Section 158Postal Manual) Do Not Return
