Research on failure free systems  Final report by Brinker, H.
RESEARCH ON
FAILURE FREE SYSTEMS
DI_IEISHEOP_ELI_+_iP_IARyDATA
Contract Nasw-572
Reference WGD-38521
December, 1963
APPROVED:
S._A¢_. LOMAX, Director
Advanced Development Engineering
BOX 1897
THE WESTINGHOUSE ELECTRIC CORPORATION
SURFACE DMSION
Baltimore 3, :Maryland
TPE 5716
https://ntrs.nasa.gov/search.jsp?R=19660017743 2020-03-16T19:12:32+00:00Z
TABLE OF CONTENTS
Page
PURPOSE .................................... 1
SUMMARY ................................... 3
CONCLUSIONS AND RECOMMENDATIONS ..................... 9
Appendix 1 - Design and Testing of Redundant Systems
Appendix 2 - Reliability of Imperfect Redundant Systems
Appendix 3 - A Survey of Components for Adaptive Restoring Circuits
Appendix 4 - Transor Analysis
Appendix 5 - Comparison of Dynamic and Threshold Restorers
Appendix 6 - Self Repair Techniques
H
H
B
I
I
H
I
B
B
B
I
PURPOSE
This final report is prepared in accordance with the requirements of Contract Nasw-
572, "Research on Failure Free Systems", between the National Aeronautics and Space Ad-
ministration and the Westinghouse Electric Corporation (reference WGD-38521}. The
research that is reported herein has the general objective of the advancement of the state-
of-the-art in the design of highly reliable electronic systems associated with the national
space effort. The design objectives WhiCh are studied are those which permit the proper
operation of systems to be relatively independent of the effects of individual component or
module failures within systems. The scope of this objective includes the use of the more
conventional techniques of multiple-line, majority voted redundancy, as well as the study of
self-repair and advanced voting techniques. The research has been divided into the following
major tasks:
TASK 1:
TASK 2:
TASK 3:
IMPLEMENTATION
ADVANCED VOTING TECHNIQUES
SELF REPAIR TECHNIQUES
1/2
V--
failures which were induced into a laboratory model of a portion of a typical redundant sys-
tem. Potentially serious detrimental failures which might occur are discussed. A major
portion of the report is concerned with the random failure simulations and their results.
Briefly, a computer program generates random failure lists using available reliability data
for each part. Each failure list includes all component failures which might have occurred
in a typical system which had been operated for the specified time interval, and therefore
simulates the actual testing of such a system. The indicated failures are induced into the
system, which is then tested to determine whether it is capable of performing all of its de-
sign functions, or if it has failed. This actual test result can be compared with the analytical
result which would have been obtained with the same group of failures, to test the validity of
the assumptions used for the analytical result. These tests showed that the most common
analytical model is excessively pessimistic for a well-designed system. For these tests it
predicted more system failures than actually occurred by a ratio of more than 2:1. The
reasons for this departure, and more accurate analytical models, are discussed. A new tech-
nique is described which permits the reliability of a redundant system to be estimated by the
product of exponentials, using the failure rates of the components or modules involved.
Finally, several circuit design considerations are discussed.
The results of implementation studies as part of the research on failure free systems
have been previously published in special technical reports. Two major areas of interest
are discussed in Special Technical Report No. 3, "Circuits and Circuit Testing for Space-
borne Redundant Digital Systems". The entire report is reproduced as Appendix 1 of this
final report. The first portion of the report is concerned with efficient initial design and
contains a discussion of several possible circuit implementations. The latter portion is con-
cerned with the diagnostic testing of a multiple line, majority logic redundant system. Sev-
eral techniques are described for detecting and locating failures within an operating redundant
system to greatly increase reliability. The report is summarized below.
Section I contains a discussion of the general problems concerned with the design and
testing of redundant systems. These problems include the most appropriate choice of circuit
implementation, special design requirements, and the realization of high system reliability
with available circuits.
Section II contains a discussion of the possible use of magnetics to reduce the total
power consumption and provide non-volatile storage in redundant spaceborne systems. Mag-
netics appear to be most useful for applications requiring memory associated directly with
simple forms of logic, or for non-volatile data storage when the data is altered at very slow
rates, but is not recommended for general logic use.
|
|
|| SUMMARY
TASK 1 - IMPLEMENTATION.
This portion of the study is concerned with developing suitable circuits, systems, and
testing techniques for use with currently available redundancy techniques. The circuit and
system design is expected to be suitable for general use in spaceborne or ground support
equipment, free from extremely detrimental failure modes, and compatible with whatever
testing techniques are to be applied. The testing techniques are expected to be suitable for
a wide variety of applications. They are, therefore, similarly varied according to the pur-
pose of the testing, the system configuration involved, and the information which is available
for the test. The testing of redundant systems represents a unique problem, since individual
component or module failures do not indicate their occurrence by affecting the system per-
formance. The various purposes for testing are indicated by the following types of diagnostic
tests which have been considered:
The verification that all signal-processing elements are working properly, or
additionally that the voters are capable of transmitting a correct signal, or
further, that all signal processors and voters work properly under all possible
design conditions. This may be further extended to include the verification that
any additional hardware which is added for the testing is also capable of proper
operation. This range of test requirements is also encountered when the purpose
of the tests is not only to detect any failures, but to locate these failures to facil-
itate repair or replacement in redundant systems where repair is desired, or
systematic maintenance is used.
Another type of testing is referred to as "statistical measure of quality", which obtains
a limited amount of information concerning the failure pattern existing within the system to
estimate the reliability of the system. Many different types of tests can be used to obtain
this information, depending on the confidence required for the reliability estimate, the cost
of obtaining the information, and the type of analysis which will be applied to that information.
Much of the preliminary work necessary for the determination of suitable circuitry for
redundant systems has been described in an earlier Westinghouse report, "Failure Effects
in Redundant Systems ''1. The report describes in detail the effects of catastrophic component
1. A.R. Helland, W. C. Mann, "Failure Effects in Redundant Systems", Westinghouse Re-
port EE 3351, March 1963.
3
Section Ill contains descriptions and comparisons of types of semiconductor circuits
suitable for use in redundant systems. Since integrated circuits offer many important ad-
vantages for redundant systems, they are chosen as a basis for system design with semicon-
ductor circuitry. Since custom design of integrated circuits is not especially practical for
low volume operation, the circuit design problem consists of the choice of the most suitable
type of available circuits and the supplier for these circuits. Integrated Diode-Transistor
Logic elements were chosen as the most appropriate for general use. A majority voter re-
storing element, which is not subject to the detrimental failure modes found to be charac-
teristic of conventional elements, is designed using positive logic D-TL NAND elements.
Signetics is recommended as the most appropriate supplier of integrated D-TL elements,
based on currently available information concerning the characteristics of circuits available
from various suppliers.
The discussion of Section IV is concerned with the testing of redundant systems. Various
solutions to the problem of failure detection within a redundant system are discussed in this
section; some are more suitable for simple failure detection, others also provide information
concerning the location of any failures. The failure detection tests alone are expected to be
most suitable for initial acceptance and verification tests to indicate that all parts are work-
ing. The combined detection and location techniques are most applicable to systems where
additional information is required to facilitate repair or replacement of individual parts of
the system.
It is shown that failure location and maintenance of a redundant system does not require
the test equipment and operator skill which are usually required to maintain a conventional
non-redundant system. Techniques are described which permit a redundant system to be
systematically maintained to provide much higher operational reliability than possible with-
out maintenance. It is shown that a major portion of the maintenance may be performed dur-
ing normal system operation.
The partial testing of imperfect redundant systems to estimate future reliability is dis-
cussed in part two of Special Technical Report No. 4, "Transor Decisiofi Functions and Sta-
tistical Measure of Quality". The second part of the report is reproduced as Appendix 2 of
this final report.
The objective of this portion of the study has been to develop a test philosophy from
which a good statistical estimate of the probability of mission success could be made from a
limited amount of test data. Several possibilities have been formulated. The failure masking
characteristics of redundant systems prohibit the use of simple test programs which merely
determine the performance capability of the system at the time of test. Such programs
cannotdifferentiatebetweensystemscontainingmanycomponentfailures withcorrespondingly
manystagesvulnerableto succeedingfailures, or fewcomponentfailures with fewvulnerable
stages. Becausetheprobabilityof missionsuccessafter the time of test is heavilyinfluenced
bythecomponentfailure patternexistingat thetimeof test, a test programmustbedevised
from whichmissionreliability canbepredictedwith a reasonablyhighdegreeof confidence.
Thegeneralcomplexityandmicrominiaturesizeof modernsystemsgenerallyprecludesthe
possibility of testingeachsignalprocessorin eachstage.
In theproposedextentionof this studythevariousphilosophieswill beconsideredin
moredetail, andaneffort will bemadeto evaluatetheusefulnessof eachonewith thepur-
poseof determiningwhichof thecandidatephilosophiesprovidesthe mostaccurateestimate
of probabilityof missionsuccessfor a fixedcostof testing.
TASK2 - ADVANCEDVOTINGTECHNIQUES
This studyis concernedwith advancingthestate-of-the-art in developingnewrestoring
circuits for usein redundantsystems. Severaladvancedvotingtechniqueshavebeenstudied
aspart of the researchonfailure free systems. Theresultsof theAdeline-Neuronstudy
andtheinitial resultsof theTransorstudyhavebeenpreviouslypublishedasspecialtechni-
cal reports. Further studyof Transoranda newdynamicrestorer (theHammingDistance
RestoringCircuit) hasbeenconducted,but theresultshavenotbeenpreviouslypublished.
Theseresults are, however,containedin Appendix5 of this report.
Theresults of thestudyof theAdaline-Neuronadaptivevoter withcontinuouslyvariable
inputweightiaghavebeenpreviouslypublishedasSpecialTechnicalReportNumber1, "A
Surveyof AdaptiveComponentsfor Usein Failure FreeSystems". It is reproducedasAppen-
dix 3 of this report. Briefly, it concludesthat suitableanalogmemorydevicesare notcur-
rently availablefor usein this classof adaptivevoters, althoughthe mercurycell integrator
withphotoelectricreadoutis apparentlythe mostsuitabletechnique.
Sincethe Adaline--Neuronadaptivevoter requiresananalogmemoryfor eachinput,
the selectionof a suitableinputdeviceis importantto realize apractical adaptivevoter.
Severaltypesof analogmemorydeviceswere surveyedin order to evaluatetheir suitability
for usein implementinganadaptivevoter for redundantsystems. It is desirablethat the
devicesbesimple, reliable, relatively linear, andstorethe analogvariableweightingfor a
relatively longtime. It wasfoundthat mostof theavailabledeviceswhichhavebeende-
velopedfor patternrecognitionor learningmachinesare too complex,unstable,or unreliable
for usein adaptivevoters.
Devices which were included in the survey included the Memistor plated resistor, the
Solioniodine ion cell, the Curtis Instrument Company's Mercury cell integrator (with either
capacitive or photoconductive readout), the MAD magnetic integrator, the orthogonal core in-
tegrator, the second harmonic magnetic integrator, and the magnetostrictive integrator. The
mercury cell integrator with photoconductive readout appears to be the most suitable device
amongthose which were surveyed. It incorporates an e!ectroplating technique for providing
the continuously variable input weightingfor adaptive voters, with relatively good stability,
reversibility, and permanent storage. Since it is a four terminal device with electrical cur-
rent as the input and electrical resistance as the output, it is relatively simple and generally
compatible with conventional circuitry. It is, however, currently in a relatively early state
of its development as a device for general use. It appeared that any detailed circuit design
for adaptive voters should not be undertaken before the expected progress in the development
of more effective cells is accomplished.
The proposed continuation of the development of this class of adaptive voters includes
monitoring the state of the art in the development of more effective devices, followed by the
design and breadboard construction of at least one Adaline-Neuron adaptive restorer, or pre-
ferably a small redundant subsystem using these restorers, in order to demonstrate their
effectiveness in redundant systems.
The objective of the Transor study portion of the research was to evaluate the Transor
Restoring Circuit for possible use as a replacement for threshold voters in redundant systems.
In the process of performing this evaluation, another dynamic restorer, the Hamming Dis-
tance Restoring Circuit, was invented. The study was extended to include an evaluation of
both circuits.
The initial portion of this study has been reported in part one of Special Technical Re-
port No. 4, "Transor Decision Functions and Statistical Measure of Quality" which is repro-
duced as Appendix 4 of this final report. In that report, analytical reliability expressions
for systems using Transor restorers are obtained for the case when signal processors are
restrained by certain failure mode assumptions. An appendix to that report shows how the
probability of occurrence of various failure modes might be computed. The results of later
portions of this work are presented in Appendix 5 of this final report. In these results, gen-
eral reliability expressions for the Transor and the Hamming Distance Restoring Circuit are
obtained which are relatively free of restrictive assumptions. A computer simulation pro-
gram which was developed for use in the evaluation, is described and some results obtained
from the program are discussed. Finally, the conclusion is drawn that the Hamming Dis-
tance Restoring Circuit is always superior to the Transor but that it is as good as or better
than the threshold voter only in certain failure mode environments.
7
TASK 3 - SELF REPAIR TECHNIQUES.
This study is concerned with the development of new, more efficient means for employ-
ing redundant equipment. Using these techniques, a system may be designed to absorb more
internal failures without system failure than is possible with the same amount of fixed,
multiple-line redundancy. The results of this study have been previously published as Special
Technical Report No. 2, "Self Repair Techniques for Failure Free Systems". The report is
reproduced as Appendix 6 of this final report.
As a part of the effort to develop hyper-reliable systems, Westinghouse has devised a
class of techniques for using redundant blocks of circuitry more effectively than has been
done previously. The systems using these techniques are similar to the familar multiple-
line, majority-voted redundant systems except blocks of circuitry are allowed to shift around
as component failures leave certain subsystem functions more vulnerable than others to suc-
ceeding failures. The object of this phase of the study have been to devise several general patterns
in which systems could be organized to absorb relatively large numbers of internal failures
without system failure and to develop a means for evaluating the effectiveness of the various
patterns for performing this function.
Three broad classes of organization patterns have been developed, and several specific
patterns within each class have been examined. A versatile computer simulation program
has been written from which approximate reliability vs. time curves and a variety of other
pertinent information about each pattern can be directly obtained. Both of the patterns which
have developed and the computer program have been described in detail in Appendix 6.
A three-part program has been proposed for future study in this area. In the first part,
the computer simulation program will be used as an evaluation tool for establishing a set of
rules for designing optimumor near-optimum self-repairing systems. The rules will be pri-
marily concerned with the organizational patterns to be used and with the maximum allowable
ratio of repair circuitry complexity to signal processor complexity. Secondly, an implementa-
tion study has been proposed to determine effective means for implementing the organization
patterns which have been and will be devised. Finally, an appropriate study vehicle will be
selected and designed with sufficient detail than a breadboard model could be constructed
from the specifications produced. Such a vehicle design is required in order to verify the
usefulness of both the organizational pattern theories and the implementation techniques
which are being developed.
i
i
i
BI1
B
n
CONCLUSIONS AND RECOMMENDATIONS
TASK 1 - IMPLEMENTATION
1. Design of Redundant Systems
Redundancy is a powerful tool for achieving extended reliability, but effective design
is required to achieve the reliability goals with a minimum of additional complexity. Although
magnetic logic is often cited as having several advantages applicable to spaceborne computers,
the use of magnetic logic is limited to special applications. Magnetic logic is not particularly
suited for general logic use in redundant systems, due to the lack of steady output signals,
low speed capability, high peak power requirements, and the complexity required for general
logic functions. It appears that no proven magnetic restoring element exists which is suitable
for general use in redundant systems. Magnetic logic does, however, offer non-volatile
storage and very low average power for slow speed operation. Magnetic devices appear to
be suited to special applications where certain logic functions, such as transfer and OR,
are intermixed with the memory function, and very low speed capability is acceptable. It is
useful for low speed shift registers, counters, and timers which consume negligible stand-
by power.
Integrated semiconductor circuitry offers many desirable characteristics for use
in redundant spaceborne systems, including small size, reduced weight and power consump-
tion and high frequency capability. A comparison of the currently available integrated logic
elements indicates that diode-transistor logic (D-TL) is the most suitable for general logic
use in redundant spaceborne systems. A majority voting restorer, designed using inter-
connected NAND elements, has been described which is not subject to the detrimental failures
of more conventional restoring elements.
2. Testing of Redundant Systems
It is a characteristic of redundant systems that they offer a high reliability for a
period of time after the initially failure free condition, and that the system reliability decreases
rapidly when internal failures are present. It is therefore important to insure that no initial
failures exist in a redundant system to obtain maximum system reliability. Since an initially
failure free, order three system can withstand any single failure, as well as a relatively
large number of randomly scattered failures, it offers very high reliability for the period
of time when the probability of individual failures is low. Techniques are described which
permit even higher reliability by the use of systematic maintenance of a redundant systems.
It has been shown that a relatively simple technique called singular rank testing
may be used to determine that all of the replicated signal processors in a redundant system
are working properly, and that the majority voters are sufficiently failure free to insure that
the system is not vulnerable to single failures. The system is monitored to determine if each
individual rank is able to perform all system functions correctly, in a manner similar to the
verification of a non-redundant system. This testing places no restrictions on system size
or configuration. A somewhat more complicated testing procedure, referred to as interwoven
rank testing, has been described which will completelytest all voters to insure that they will
make correct decisions for all possible input combinations.
Although a redundant system is more complex that its conventional counterpart,
failure location within a working system does not require the operator skill and simulation
equipment usually required to locate failures in a non-redundant system. Since a working
redundant system always has at least one correct signal available at each stage in the system,
these correct signals may be used as a basis of comparison. A difference detector on the
signal processor outputs to restorers may be used to indicate either permanent or sporadic
failures among these signal processors. The failure location techniques described may be
performed during normal operation, since they do not jeopardize system operations.
3. Reliability of Imperfect Redundant Systems
The mission reliability of an operating redundant system which contains internal
failures depends strongly on the number and location of initial circuit failures, as well as
the failure rates of the circuits which make up the system.
One very important task is the design of simple and efficient tests to be performed
at the beginning of a mission. These tests are required to obtain the information required for
the reliability estimates. A maximum amount of information is desired from a minimum
number of tests. The work which has been done will provide a basis for future efforts in
this area.
Several tests are proposed that may be made just before a mission is to begin to
determine, at least approximately, the mission reliability without complete information on
the state of the system. It proposes some procedures for using the results of the tests to
estimate the mission reliability with varying degrees of accuracy. A procedure for making
the decision on the useability of the system without estimating the mission reliability is also
presented.
Although a basis for future study has been provided, the details of these procedures
are still to be worked out and the accuracy of their results are still uncertain. It is recom-
10
|
I
II
I
i
I
I
I
I
I
I
I
I
mended that efforts be made to develop an appropriate measure for comparing the techniques
so that they may be evaluated relative to a common scale.
TASK 2 - ADVANCED VOTING TECHNIQUES
1. Components for Adaptive Restorers
A survey has been conducted of several devices which are potentially suitable for
use in the Adaline-Neuron adaptive voter. The survey concludes that none of the suggested
devices were sufficiently developed to justify the immediate circuit implementation of an
adaptive voter.
In general, magnetic devices do not appear to be suitable for use in adaptive voters,
due to their environmental sensitivity and complexity required for useful operation. Electro-
chemical devices such as the Memistor and Solion do not appear to have sufficient simplicity,
stability and compatibility with electronic circuitry to justify their use in adaptive voters.
The mercury cell integrator with photoelectric readout appears in principle to offer
the most attractive approach because of its simplicity, stability and general compatibility with
conventional circuitry. Since the output is essentially a variable resistance proportional to
the integral of the control input current, the device offers the possibility of providing a simple
interface with standard circuitry. It has been reported that the Department of Defense is
about to let a contract to develop and fabricate a large number of cells. The mercury cell
integrator is, however, still in a rather primitive state of development. It is recommended
that detailed circuit design should not be undertaken until further device development is
completed and that present effort on the design of an adaptive voter be restricted to that of
monitoring the state of the art in device development and to begin detailed circuit design
when more suitable devices become available.
2. Threshold and Dynamic Restorers
The majority voting class of threshold restorers are the most commonly used
restorers in present technology. Because the majority voter requires a majority of correct
inputs to provide a correct output, its error-correcting capability is limited. Since many
circuit failures result in steady-state outputs, restorers which detect only changes in input
states offer the capability of deriving a correct output with less than a majority of working
inputs. Restorers which detect changes in input states are referred to as dynamic restoring
circuits.
The mission of this part of the Failure Free Systems Study has been to evaluate
the potential usefulness of one proposed dynamic restoring circuit implementation,the Transor.
11
The results of section IV have shown that there are certain environments in which Transor
can be used to advantage in improving system reliability• For example, the maximum error
restoring capability of Transor is shown to be R-1 failures of R redundant lines in an enviro-
merit free from transitional failures. This is a significant improvement over the majority
threshold restoring capability under the same conditions. There is need for caution, however,
for in environments where symmetrical transitional errors are possible, error correlation
may make Transor performance inferior to threshold.
During the course of the study of Transor Restoring Circuits, a new class of
restoring circuits was conceived. This class, called "Hamming Distance Restoring Circuits"
is similar to Transor in many ways. It was compared with Transor analytically and by
simulation. From the results obtained by manipulating the analytical reliability expressions
for the Transor and Hamming Distance Restoring Circuits, it may be concluded that the
output of a Hamming Distance Circuit is more reliable than that of the Transor in order-five
redundant systems. This conclusion holds for any ratio of steady-state to transient error
probability or any asymmetry (tendency toward "ones" or "zeros") of error probabilities.
From comparison of the simulation curves, it may be concluded that the threshold
circuit is more reliable than either of the dynamic restoring circuits until the ratio of the
probability of steady-state errors to the probability of transient error exceeds approximately
seven to one. Above this ratio, the dynamic restoring circuit outputs are more reliable.
Further comparison reveals that the difference in the reliability curves tends to stabilize or
slightly decrease as the ratio becomes much larger than 7:1. The stabilizing effect is more
pronounced as the order of redundancy is increased from five to seven.
Also, it may be concluded that in the early life, high reliability region with
approximately a seven to one probability ratio, an order five system using Hamming Distance
Restorers may be as reliable as an order seven system using threshold voters•
Since the improvement available from Transor is limited, and since the Hamming
Distance Restorer is normally superior, further study of the Transor is not justified.
TASK 3 - SELF REPAIR TECHNIQUES
Before self-repairing systems can be implemented, many feasible switching strategies
must be considered in an effort to determine the most effective manner to manipulate the
redundant or "spare" blocks. The extreme complexity of the reliability expressions associated
with these strategies has resulted in the use of a computer simulation program for comparing
the effectiveness of the strategies. The present program includes subroutines for three
classes of switching strategies. Each class subroutine contains a great deal of flexibility,
12
i
i
i
I
othereby including many individual strategies. This method facilitates easy comparison
between members of a class. This comparison allows immediate elimination of many
possible strategies which are obviously uneconomical. For example, the flattening out of the
Percent of System Failed versus Spare Mobility curves indicates that none of the strategies
on the flat part of the curves can be optimum strategies.
From the results of the simulation program, curves for Percent of Systems Failed
versus Spare Mobility have been plotted for the Gamma Class Strategies. These curves have
been referenced to that of a multiple-line majority voted system because this particular
t,_rh,,_qu,_ ---_'°_been _"_,.,,_most _t;tt^_...... ve of the passive, failure masking, circuit level redundancy
techniques. In all cases these curves show not only that great gains can be realized over the
multiple-line redundant configuration, but that by far the greatest part of these gains are
realized for the first few moves allowed to the spare function blocks. Beyond the range of
relatively limited mobility, little or no gain in the average number of failures absorbed is
realized by the additional mobility allowed to the spares. This is an encouraging result
since the great majority of the gain due to self-repair can be retained without the use of an
exorbitant amount of switching circuitry.
All of the computer simulation results have been based on the assumption that the
switching circuitry was perfectly reliable. There is a need to determine the range of allowable
failure rates which can be associated with each strategy for it to be of maximum effectiveness.
These ranges should be studied as a function of the failure rates of the associated signal
processor blocks. As a result, information specifying the optimum switching strategy
corresponding to a given signal processor failure rate should be available before actual system
designs are begun.
It has become obvious that many of the spare function blocks do not experience as many
switching operations as they are capable of performing. When all spares are assigned mobility,
those which use their mobility extend the life of the system substantially. However, in many
cases when system failure has occurred, there are many spares remaining which have not
been used to any great extent. In order to try to capitalize on this phenomenon, a class of
strategies should be investigated which would assign different mobilities to the spare in a
stage.
The curves show a very definite gain in reliability for the self-repair strategies over
multiple-line redundant systems. The curves for the Beta Class strategies show an increase
in reliability for each increase in "repair" capability. Strategy Beta-3 yields the highest
reliability but even strategy Beta-1 shows a significant gain over the multiple-line system.
The reliability curves for the Gamma Class show essentially the same result with respect to
13
Appendix 1
DESIGN AND TESTING OF REDUNDANT SYSTEMS
Circuits and Circuit Testing
For
Spaceborne Redundant Digital Systems
Contract Nasw-_7 2
Reference WGD 38521
by
H. Brinker
A. R. Helland
September 1963
APoROVED:
h/ T_ Westinghouse Electric Corporation
_ _ Electronics Dlvfsion
_// /_ Pox ! _97, _altlmore 3, Maryland
, Director
velopment E_grg.
l-i
ABSTRACT
This report describes the results of the study on the imple-
mentation of majority logic redundancy. Most of the work concerns
spaceborne systems, but some portions are more applicable to gro,And
support equipment. The report is concerned with the initial design
of the system as well as the testing of red,Andant systems.
The possible use of magnetic logic to reduce the total power
consumption and provide non-volatile storage is discussed. _gnetics
seems to be most useful for non-volatile memory and simple forms of
logic where the data rate is very low. Various types of semiconductor
logic are described and compared for ,Ase in redundant systems. In-
tegrated Diode-Transistor Logic elements are chosen as the most suitable
for general use with Signetics the most appropriate supplier of these
elements.
Several methods of testing redundant systems are discussed and
described in the section on detection and location of fail,Jres. Various
solutions to the fail,Are detection problem are discussed in this section.
Some are more s,aitable for simple fail,Are detection; others also provide
information concerning the location of any fail,Ares. It is shown that
maintenance of a redundant system greatly increases system reliability
and reduces the test eq,Aipment and operator skill which are usually
req,Aired to maintain a conventional system. Techniques are described
which permit a major portion of the maintenance to be performed d,Aring
normal system operation.
1-ii
TABLEOFCONTENTS
I. INTRODUCTION
II. MAGNETICLOGIC
A. Introduction
B. Dynamic Storage and Sequential Logic
C. Hybrid Devices
D. All-Magnetic Logic
E. S_m_ry and Conclusions
III. SEMICONDUCTORL GIC
A. Introduction
B. Classification of Basic Types of Logic
C. Comparisonof Logic Types
D. Description of Logic Types
E. Logic Selection
F. Majority Voter Design
G. Comparison of S_zppliers
IV. FAILURE TESTING OF REDUNDA_ SYST_
A. Introduction
B. Singular Rank Testing
C. Interwoven Rank Testing
D. Circuit Implementations
StU_ARY & CONCLUSIONSVe
Page
I
5
5
6
7
13
21
25
25
26
31
34
43
_9
_9
65
75
_3
88
I-iii
Figure
1
2
h
5
6
7
8
9
I0
II
12
13
1L
16
17
19
2O
21
LIST OF FIGURES
TITLE Pa_e
OR Gate 9
Negation IO
Block Diagram, Ahq3 Function II
SRI MAD Shift Register lh
AMP-MAD Flux States 17
A_,_P-MADShift Registor 19
R-TL Resistor-Transistor Logic (+k',OR) 27
DC-TL Direct Coupled-Transistor Logic (+NOR) _8
R-DC-TL Resistor-Direct Coupled-Transistor Logic 28
NS-DC-TL Non-Saturated-Direct Coupled-Transistor
LoKic 29
D-TL Diode-Transistor Logic (+NA_3) 30
NS-D-%% Non-Saturated-Diode-Transistor Logic 30
T-TL Transistor-Transistor Logic 51
Sgeed-Power Performance 37
Majority Element with Input Isolation 43
Reliability of Conventional vs. Redundant Systems 49
Singular Rank Testing 66
IntemToven Rank Testing 77
Interwoven Rank Testing 78
Signal Processor Output Control 8h
86Difference Detector
I-iV
I. Introduction
Past studies of redundancy techniques and consideration of the
basic characteristics of someredundancy techniques have yielded in-
teresting insights and problems, l_ny of these considerations are in
the area of engineering method. Others concern the design of redundant
systems with high reliability and other desirable characteristics, i-his
section is intended to review someof these considerations and to preview
someof the tho,Aghts behind the disc,ission in later sections.
The report itself deals primarily with someof the problems which
are enco,lntered in designing and testing useful redundant digital systems.
Someof these problems are at least comparable to non-redundant design;
othere are rather _nique to red_mdant systems. Possible solutions for
these problems, as well as more detailed problem descriptions, are con-
rained in appropriate sections of the report.
Circ_it and system design _st reflect the fact that red_ndancy
is only a tool to realize reliability. The proper use of redundancy is
often a more efficient and powerf1_ technique to realize a reliability
requirement than are the more conventional techniques such as conservative
design or componentselection. Redundancyis, however, most powerful when
used in conjunction with techniques that increase basic reliability.
It is important to recognize that a redmldant system is expected
to operate with relatively large n_bers of random failures. Since con-
ventional systems us,_ally fail whenany of their parts fail, it is relatively
unimportant what effects these failures have, except when repair is desired.
I-I
Circuits for redundant systems, however, must be designed so that the
effects of individual component failures are mini_zed, and usually limited
to the circuits in which tbe failure occurs. This does not imoly, however,
that redundancy inchldes "useless" parts. Each part of the system must
contribute to the assL_a_ce that the system will perform all of its functions
properly.
The use of redundancy will alter the characteristics and performance
of the system. Redundancy will usually increase design complexity, power
requirements and dissipation, sigral propagation time, size and weight,
number of in%eroounectiens, an_ initial cost. Redundancy, therefore,
emphasizes the need for continuing development of low-power circuitlo_,
_1[icro-miniaturization, and intercormection techniques. The type of circuitry
which is use8 to imDlement a redundant system must be carefully chosen to
meet the system requirements without incurring excessive costs. Dhenever
there is a need for hich reliahili_ _, the cirouitr_: should be chosen to
have a high basic reliability, low sensitivity to _arameter var_ ations, and
low _ower dissipation to minimize temoerature stress. In addition, s_ecific
systems have special requirements which must be considered in _he system
design as well as the choice and design of the circ_'itr_ -. For examp]e, the
total available oower is often severely limited for soacebor._e eq_Dment ,
although the processing rate is usually quite low. It 5s usually desirable
to provide some means of testing to verify that all parts of the redundant
system are working to insure that all of the reliability initially designed
I-2
into the system is available for the duration of the mission. The s3_stem
and the circultr_- therefore must be designed so that accurate and meanlng-
ful tests may be applied to verify that the tarts are working. When
extended lifetime is desired and repair is _ossible, a redundant system may
be systematically repaired to greatly increase the expected time between
system failures. If a system is ccmpletely repaired prior to each mission
in which it is used, it will exhibit the high mission reliability character-
istic for each mission. Such systems must be designed so that complete,
efficient tests ma_: be periodically applied to these s_;stems which will
verify that all the parts are working properly, or that will facilitate
maintenance procedures which will return the _stem to the _nitiall_, _erfect
condition. It is important for this type of maintenance that all failures
be detectable, otherwise these undetectable failures will tend to accumulate.
These accumulated failures will eventually tend to dominate the system
behavior by causing additional system failures.
Man}- failures ma_- be detected as they occur in a redundant _rstem.
These may be reoaired while the system is in operation to obtain a very low
system failure rate compared to the failure rate for the parts of the
system. Periodic maintenance must be performed in addition to the ccnt'nuous
monitor and repair described above to detect those fail,ares whi_ cannot be
detected during regular operat" on of the s2stem.
Systems which will be maintained must therefor e be designed both
with the capability for detecting s_] fai_nres and facilitating the main-
tenance and repair nrocedures. With proner design, many of these failure
I-3
detection, malntenance and reoair procedures maybe accomplished durin_
operation of the system.
The followinK sections of this report will discuss the _roblems
associated with circuit design, choice of the type of circuit_J, failure
detection, andmaintenance of redundant systems. This report describes the
results of the study of these oroblems and possible solutions. The results
are summarizedin the Summaryand Conclusions section of this report.
I-4
i
i
i
i
i
i
i
i
i
i
l
i
i
I
l
l
I
l
I
II. Magnetle Logic
£. IntroductAon
The past decade has witnessed the development of a variety of Jmg-
metic devices suitable for performing storage and logic in digital com-
puters. Perhaps the most important application of laEnetics to digital
tee_ology has been provided by the development of large capacityp rind_
access memory systeRs composed of ferrite cores. Advances in techniques
for performing logic have received some attention, but to date magnetic
logic does not appear to be widely accepted as a superior replacement for
the conventional transistorized counterpart. This general reluctance to
utilize the special attributes of magnetic logic is often Justified by
several difficulties inherent to the device characteristics a_d system
configuration.
Much of the magnetic logic research has been motivated by the
potential ability of magnetic devices to provide higher reliability at
lower cost lhile connuming negligible standby power. These attributes are
_derstandably important in any large electronic system, especially in space
applications where reliability must be high and available power is Invarl-
ably lee. To evaluate the potential ability of magnetic logic schemes to
provide these advantages a discussion of some of the more promising spproaches
appe4rs to be in order. An all inclusive survey a_d treatment of the
a_riad of suggested approaches could easily fill a book.* It appeared
¥
by _erhoff, A. J., Digital Applications of Hagnetic Devices,
New York; Job_ Wiley and Sons, Inc., (1960).
1-5
1-6
reasonable therefore to restrict the detailed discussion to the more pop-
ular approaches and to provide references for other. Of particular in-
terest are those devices which utilize magnetic componets which are either
commercially available or in an advanced state of development.
B. Dynamic Storage and Sequential Logic
The state of a magnetic device is determined by the direction of
remanent flux. Information stored is not directly accessible _d a clock
or read pulse must be used to determine the state. The read process in
most schemes also destroys the information which was stored. An output
signal is available only for that portion of the read cycle during which
dynamic flux change is in progress and thus level output and asynchronous
operation is not obtainable. The ripple-carry binary counter, the parallel
adder, and many familiar digital configurations are not directly amenable to
magnetic implementation. In contrast, the powerful combinational logic
approach utilized in conventional computers consists of a cascade of com-
patible logic modules which form complex functions simultaneously during
the interim between clock pulses. In a magnetic logic machine using
dynamic logic this is not possible and operations involving OR, AND,
transfer, buffering, negation and delay require several clock periods to
generate a particular function. This step by step process usually consumes
considerable time which may be further extended if the magnetic logic
modules are limited in fan-in and fan-out and thus require additional operations.
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
C. H_orid Devices
The principle involved in using square loop material to store a
remment flux has been known for seam tame. With the development of small
torr_dal structures employing 8intered ceramic ferrites and ferromagnetio
tape materials, magnetic devices began to dem_strate practical utility.
The magnetic shift register has received the meet attention primarily be-
cause of its general utility and simple configuration and has been the
subject of maoh of the mapetic li_rature. Although pla_ing an important
part in most digital systeam, several additional devices are required in
order to provide the variety of logical operations required by _pical
ecmpater system.
The task of' pert'_ general logic requires circuitry capable of
being arranged to perform any Boolean output function of a set of input
variables. In order to provide this operation a complex function is usually
formed by using logic modules to perform OR, A_D, negation, storage, delzy,
etc. Xf gates are to be eonnected in various configurations the devices
used mnst provide a clearly identifiable "I" and wO" state, unilateral
information transfer and the capability for fan-in and fan-out. To meet
these requirements with magnetic devioes has not been am easy task.
A major difficulty which impeded rapLd develotment of devices to
meet these requirements has been the inherent bilateral nature of simple
ma_etic structures. In the early devices this was largely overcome by
combining diodes with simple torroids to achieve unilateral information
1-7
flow. Obvious limitations in impedance levels, fan-in and fan-out
drive capabilities necessitated in many cases the further inclusion of
resistors for tailoring imped_ce levels, capacitors for temoorary storage
and transistors for power gain. Although this b_brid logic approach led
to the development of a number of clever magnetic devices, the potential
of achieving high reliability at low cost is seriously challenged by the
requirement for using non-magnetic components and the more complex wiring
and system org_ization which becomes necessary. An excellent survey
of a wide variety of _brid devices has been provided by Ha_nes. I One
such approach, parallel transfer core-diode logic, will be used as a vehicle
for describing the principles of dynamic logic and to indicate the opera-
tion of a typical practical device.
Shown in figure i is the OR gate, the simplest of logical func_ene
which may be implemented with magnetic cores and diodes. The _ and
notations denote cores of the same rank, i.e. threaded by a series con-
nected, current driven clock line. The two phase clock system effects
readout and transfer of data by driving the core to the "0" state. If
a core was previously in the "I" state the clock, in driving the core to
the "0" state_ causes the core to switch _d provides an output sufficient
to drive the next core to the "I" state. If a core was previously in the
"0" state a negligibly small outout occurs when the clock drive is a_plied.
Diodes are shown to prevent output loading when a core is being set.
Additional components such as resistors for tailoring impedance levels
|
I-8
II
i diodes to prevent reverse data transfer m_ be required in a practical design.
I It should be noted also that the core output windings _ust contain more
turns than core inputs in order to allow a true_ttiDg core to set a
I reeelving corepwhich also tends to prevent reverse data transfer.
I
CLOCK A
l
l
l
i
Figure I OR Gate
Operation is initiated byreading inputs X and Y into the
cores. The phase A clock then transmits the state of each of the input
cores into a dual winding storage core. If the storage core was set by
1-9
II
I
amy of the transmitting input cores, a readout signal is generated when the I
storage core is reset by the phase B clock. I
The AND function is not as easily implemented unless a coincident
current threshold technique is employed to set the storage core. This I
technique does not appear to be sufficiently reliable however, due to the
associated threshold and drive tolerances normally encountered in a typical I
system. A more conventional system employs the principle of logical l
negation in combination with the OR gate to provide the AND function.
For example, consider the ne agti no _rangement of figure 2. l
I
DUMMY CORE
( "1" GENERATOR )
I
INHIBIT
CORE
X _ CLOCK B i
.- . |
CLOCKA
1-10
Figure 2 Negation
The upper core is used as a "1 w generator which in the absence of an input
from the X core causes the inhibit core to be set by the phase A clock.
The phase B clock will then generate an output whenever the X signal is
absent md thus represents the negation of the input. When both the "l w
generator _d X input signal appear simultaneously at the inhibit windings
they effectively cancel each other and the inhibit core remains in the "0"
state. The phase B clock in driving the inhibit core to the "O" state
vlll not generate an output signal for this case.
The principle by which the AND function m_ be performed is based
on the well known logic relation X + Y = XY. A block diagram of a typical
AND gate scheme is shown in figure 3.
NEG.
NEGo
X+Y = Xey
It
Figure 3 Block Diagram, AND Fuuctlon
1-11
Since each of the logic modules require two clock periods and each operation
is performed in sequence, the output signal is seen to appear six clock
periods after the inputs were applied. If the resultant output of the
AND function is to be further combined with other AND-OR operations it
becomes evident that the total number of clock periods required ma_ become
prohibitive.
In view of the system complexity end speed limitations suggested by
the simple example described, magnetic logic is seen to introduce problems
of system organization which are alien to conventional DC level logic.
As far as cost and reliability are concerned_ the prospect of winding cores
with several turns and the large number of cores and connections required
do not appear to provide a significant cost advantage. In the hybrid
approach the use of additional components such as diodes and resistors
appear to seriously negate the basic reliability inherent to the magnetic
material. These difficulties not withstanding, several companies are
active in the manufacture of magnetic logic modules. The major emphasis
has been placed on the usefulness of the magnetic shift register to provide
cost, size and power advantages over the conventional approach. Magnetic
shift registers employing the _brid approach have been successfully applied
to a wide range of airborne equipment. Sequential programmers, counters
and timers operating at low clock rates represent the majority of applica-
tions. When operating at shifting rates higher than I0 kc however, the
I-IZ |
advantage that the magnetic shift register has in consuming negligible
standby power is obscured by a power requirement which is often greater than
the solid state counterpart. A leading supplier of hybrid magnetic logic
modules and shift registers is currently marketing a i0 bit shift register
which requires a maximum average power of .4 watts to operate at i0 kc
and 3.7 watts at 750 kc. Since it appears reasonable to assume that these
wower reauirements are reflected also to _eneral logic systems, the awoli-
cat, on of hybrid magnetic logic to _ower-limited environments is l_mlted
to systems whose shift rate is very low.
D. All-Magnetic Logic
The obvious limitations of the hybrid approaches in reliability and
cost has to some extent motivated an effort to develop systems using only
magnetic material and connecting wire. Several novel approaches were
developed which made use of magnetic device geometry to achieve coupling
isolation, flux gain and unilateral information flow. Perhaps the most
popular of these devices is the Multi-Aperture Device (_D), 2.3 a three
aperture ferrite structure similar to the Transfluxor. 4 Inout-output
isolation is oossible because the flux stored around the minor output aoer-
ture may be sensed non-destructively without affecting stored flux about
the innut anerture.
Shown in figure 4 is a typical MAD shift re_ister developed at
Stanford Research Institute.
1-13
o E 0
!
I
I
I
I
I
I
I
I
Figure 4 S.R.I. MAD Shift Register
An advance current is applied to the parallel connection of output and
input aperture windings in order to effect information transfer from the
transmitting core to the receiving core. In accordance with the state of
the flux stored around the transmitting aperture and the resultant magnetic
threshold thereby established, the advance current will divide between the
input and output windings. If the transmitting aperture is in the "0"
or cleared state the advance current will divide equally thus not exceeding
the magnetic threshold of either apertures. If a "I" were stored the output
aperture with its lower threshold is swamped by the advance current and the
transmitter switches flux locally about its output aperture with low values
1-14
of current. By voltage or impedance steering the maJorit_ of advance current
will flow through the receiver input aperture causing it to exceed its
setting threshold and be set. In time as the flux s_itchi_g is completed,
both currents will return to their nominally equal values.
Since the read-out and transfer process is nondestructive to the
state of the core, a clear line threading the major aperture is required
to return the core to the reset condition. In order to provide informatlon
flow from left to right a basic four clock cycle is required with the
following sequence: ...., ADV.O@E, CL.O, ADV.E@O, CL.E, ... The
ADV O-_E pulse switches flux locally about the output aperture of the 0
element and causes the E element to be set. The CL 0 pulse then clears
the 0 element and in so doing switches flux through the output winding.
This results in a loop current flow that negatively sets the E element
receiver without affecting the flux state about the output aperture of the
E element. Note that neither the ADV. O-_E nor CL. 0 pulse causes any
flux to be switched in the output leg of the E element thus eliminating
the need for a diode to prevent backvard data transfer. In this manner
unilateral data transfer is possible using only MAD devices and conducting
wire.
Thus far our discussion has been devoted to techniques for achieving
unilateral data transfer with the S.R.I.-NAD approach. The problem of
achieving reasonable flux gain and fan-out is one which could not be solved
1-15
in a practical sense with the simple transfer schemepreviously discussed.
H.D. Crane has done muchof the work in arousing interest in the all-
magnetic MADapproach. In a paper5 describing the design of a moderate
sized computing system using S.R.I.-MAD devices however, the basic transfer
gate had to be seriously modified in order to operate in the system.
Problems inherent to the flux threshold relationship between receiving
and transmitting apertures, flux gain, fan-out as well as flux dec_ and
build-up in circulating loops madesuch modifications necessary. As a
consequencethe revised gate module required flux doubling and clipping
operations in addition to the previously described clear and advancecycles.
The complexity involved in the resultant device implementation appears to
be a serious encumberance. The system chosen to demonstrate the ability
of all-magnetic devices took the form of a decimal arithmetic unit with
the ability of performing addition, subtraction, and multiplication. The
system was madeexclusively of moduleswhich perform either the two input
ORfunctlon or the two input ORwith negation (NOR).
Rather than describe the complex details of the S.R.I.-MAD logic
gates it appears more reasonable to present the simpler, more practical
approach to the design of MADdevices developed by Amp., Inc. In t_is
approach a priming opera, on is performed to reverse the flux stored about
the transmitting aperture prior to readout. The readout process in this
case is destructive and resets the core. The priming operation provides
an adequate flux level which, when reversed by the clear or transfer
1-16
operation, delivers an output pulse to set the next core through its
major aperture. Since data flow is from minor aperture to major aper-
ture and since the state of a core is not disturbed by reverse currents
flowing through a minor aperture, the possibilit_ of reverse data flow
is prevented.
The flux conditions present for the various states of a typical
_-IL_D element is shown in Figure 5.
ADV.
(CLEAR)
O) RESET OR CLEARED STATE
PRI ME_
d) RESET CORE AFTER PRIMING
INPUT %
ADV.
|CLEAR}
b) SET STATE
PRIM _E OUTPUT
C) SET CORE AFTER PRIMING
Fig,lre 5 _-I[AD Flux States
1-17
In the cleared state (figure 5a) the core is saturated in the clockwise
direction by a previously generated advance current which threads the major
anerture. Upon application of an innut signal threading the inner nortion
of the major anerture, the flux nearest the major anerture is reversed thus
providing the set condition _hown in figure 5b. This read-in operation does
not affect the flux linking the output aperture and thus a diode is not
required to block data transfer to receiving cores. In order to obtain
an output from a properly set core it is necessary to provide a prime
current as shown in figure 5c to reverse the flux stored about the output
aperture. Priming current is of a lower magnitude than the advance current
and because of its slow rate of change is not sufficient to cause the core
linked by the output winding to be distrubed. Once a core has been set and
primed, the application of an advance current causes a flux reversal about
the output aperture. This in turn, nrovides an induced voltage of suffi-
cient magnitude to drive the next core to the set condition. If the core
was initially in the reset condition it will remain in this condition after
nriming (figure 5d). For this case, the anplication of the advance current
does not _rovide a flux reversal and thus no output occurs.
_?-NAD elements may be connected in a variety of shift register con-
figurations including _arallel in_ut-parallel output, parallel innut-serial
outnut, serial innut-serial output, etc. Such shift registers take the form
of 2 core-per-bit arrays and require a two clock system in combination with
1-18
a primlng source. A tT_ical serial input-serial output shift register
section is shown in flgurc 6.
ADV. 0 _ E
II
" 0 - 0
I _
Figure 6 AMP-MAD Shift Register
The propagation of a "I" from left to right proceeds by activating clock
and prime signals in the following sequencer ... PRIME, ADV O-_E, FRIME,
ADV E-_O, P_IME, ADV O_E, .... AMP-MAD shift registers require relatively
high values of Ix_se current for performing advaDce, prime and set oper-
ations. Nominal operating level for the advance current is 2 to 3 amperes
in a typical design. Prile a_d set pulse currents are lower being IO0 ma
and 250 ma respectively. Because of the requirement for slow priming and
in order to keep average power dissipation at reasonable levels, AMF-NAD
1-19
] -20
shift registers are limited to repetition rates of IO Kc. A tyoical driver,
which utilizes a capacitive storage-discharge scheme and dual Shockley
diodes for triggering the advance currents, requires an average power of
5.3 watts to drive a I0 bit shift register at I0 Kc. A I0 bit shift register
with its associated driver requires a package occupying _pproximately 9
cubic inches.
The implementation of general logic operations using MAD devices is
not easily accomplished, due to the difficulty of achieving logical inversion
and reasonable fan-out without an imposing complexity. The treatment of
much of the general logic capabilities of MAD devices is reported in rather
implicit terms by the current literature. The OR function may be provided
relatively simply by threading additional windings about the input aperture
if care is taken in preventing reverse information transfer. The negation
operation may be achieved by extending the current inhibiting and "one"
generator technique described in the _brid approach to the MAD topology.
Perhaps the most difficult problem w_ich faces the all-magnetic logic de-
signer is that of providing fan-out. This arises from the fact that all
the power which is used to provide inputs to receiving cores comes from
the clock source. Power gain in,he ordinary sense is not available except
in those hybrid schemes which use transistors to provide regeneration.
A MAD device with a reliable fan-out of two is sufficient, however, to
allow the performance of general logical operations requiring much greater
fan-out. This may be accomplished by utilizing additional clock pulses to
i
i
I
I
sequentially transfer data in a "tree" wiring arrangement until the ori-
ginal single core data is available simultaneously in several _. As
far as fan-out is concerned, it appears that the hybrid approach using
transistors provides an important advantage over the all-aagnetic techni-
ques which necessarily require considerable device and system complexity
to achieve the same result.
E. Suamary and Conclusions
The foregoing description of magnetic logic has not attempted to
describe the variety of possible approaches. The techniques for accomp-
lishing general logical operations have been implicit, reflecting the treat.
merit of the current literature. Examples from two general classes of
magnetic devices have been described to provide a basic understanding of
the techniques involved. If the approaches described ma_ be regarded as
typical, then some c_clusions about their utili_ m_ reasonably be expected
to apply in a general sense.
Infor_a_on regarding transfer and shifting operations are covered in
considerable detail by current literature_ but the treataent of general
magnetic logic schemes has been seriously neglected. This suggests the
degree of difficulty which has been encountered in the design of practical
devices. Complex clock prograsming _d device configurations are necessary to
achieve operations which conventional designers have come to consider as
l-Z1
trivial. In general, magnetic devices do not displ_ a natural ability
for performing logic. The primary attribute of magnetic devices is that
of non-volatile storage, the ability of a core to remain in a particular
state indefinitely without further application of energy. This feature is
an important consideration in power limited environments such as space
vehicles where the standby power between clock pulses may be made to approach
negligible values. If the clock processing rate exceeds approximately I0 Kc
however, the average power reql_red often exceeds that of a conventional
transistorized counterpart. This limits the application of magnetic shift
registers, timers, etc. to equipment with low clock rates.
Recent advances in low power microminiaturized devices are seriously
challenging the magnetic attribute of zero standby power while providing
higher speed, smaller size and the greater utility of combinational DC
logic. NASA's Lewis Research Center is sponsoring much of the work in this
important area. Operating speeds of several newly developed circuits are
approaching i00 Kc at power levels in the microwatt range. A complete
logic system with a power consumption of I0 microwatts per stage is anti-
cipated for space application using micropower logic circuits. With the
basic reliability of microminiaturized devices constantly improving by
virtue of an industry wide effort, the role of magnetic logic appears to
be fading.
Another advantage claimed for magnetic devices is the reliability in-
herent in the use of magnetic material and connecting wire. It is assumed
here that magnetic parameters affected by temperature have been compensated
for by proper design and that clock current amplitude and rise time are
within the limits of proper operation. Under these conditions the basic
nechanism of magnetic storage and switching appears devoid of any known
failure mode. This reliability is however obscured by the large n_er
of connections required by the device configuration and the complexity
inherent to the system orgs_Ization. The reliability of a magnetic system
depends upon the connective paths and the clock pulse drivers.
Simplicity and low cost is often claimed as a virtue for magnetic
devices because of the simplicity and cost of the basic cores utilized.
It should be noted however that the task of providing several turns about
the various apertures and connecting cores in a configuration to perform
the basic logical operations of AND, OR and negation is not generally
amenable to automated assembly. The extensive remount of hand wiring and
soldering appears to represent an item of considerable cost.
The physical size of magnetic devices are generally one or two
orders of magnitude larger than their microminiaturized counterparts.
Advances in thin film magnetic logic hold some promise for a significant
size redaction, but develolments in this area have not been extensively
reported to date.
The flexibility of magnetic devices is seen to be severely limited
by the dynmaic logic approach and the difficulty of achieving reliable fan-
out in the absence of active devices. The flexibility of conventional
I-g3
DClogic systems is evidently superior because of the power gain and the
inherent signal level standarization.
After considering the attributes of magnetic devices for performing
general logic, the popular core techniques do not appear to provide an
evident superiority in power consumption, reliabilit_p simplicty, cost,
size _d flexibility over the conventional solid state circuit approach.
Indeed, the requirements of performing the logical operations characteristic
of digital computers appear to be at variance with the capabilities of
magnetic logic. The applications which are best suited to magnetic imple-
mentation are those in which the operations to be performed are not clearly
separated into "logic" and "memory". A strong case can be made for mag-
netic circuits applied to the performance of integrated storage and transfer
operations required by a variety of digital processing functions. Most
appropriate are the low speed operations inherent in input-output, inter-
face and peripheral equipment. Typical applications include shift registers,
programmers, timers, sequencers, etc. where the magnetic modules perform
e_tire functions rather than discrete operations of storage _d logic.
In these special applications where speed is low, the advantages in simpli-
city, reliability, cost and power to be gained through the use of magnetic
circuits should not be neglected. In general applications, however, the
presently developed magnetic circuits do not appear satisfactory due to the
several problems inherent in their use.
1-24
III. Semiconductor Logic
A. Introduction
In contrast with the numerousdisadvantages and the general un-
availability of magnetic logic devices, conventional semiconductor logic
has been used widely. Logic modules are commercially available for con-
struction of general logic systems. Integrated semiconductor circuits
offer an order of magnitude reduction in site comnared to magnetic logic
module_; they do not req_ire high voltage or high peak nower _ulses.
They operate at frequencies many times greater than comparable magnetic
logic requiring the same average power, and provide the convenience of
steady voltage outputs.
Integrated semiconductor circuits offer a significant size and
power reduction compared to discrete component semiconductor circuits.
The rapid acceptance of integrated and semiconductor logic elements attests
to the advantages of their use. Therefore, integrated circuits have been
chosen as more suitable for spaceborne digital applications than tbe dis-
crete comoonent circuitry. The circuit design problem is then translated
to the problem of the choice of suitable types of circuitry and logic.
A variety of such elements is available with predictable characteristics
for a wide range of operating environments. The selection by the Air
Force of integrated circuitry for use in the imnroved Minuteman is a
significant factor in the availability of reliable integrated circuits stud
anpronriate reliability data. There is also a large amount of goverment
1-25
and industry effort devoted to research a_d development of new and improved
integrated circuits.
The low weight and power consumption of integrated circuits offers
an important compensation for the increase in the number of circuits required
for redundsnt design of spaceborne equipment. It is expected that advances
in integrated circuit technology will allow more complex circuits to be
included within a single package to further decrease size and weight. In-
tegrated circuits also offer significantly improved reliability performance;
it is expected that the reliability of single chip containing an entire
function can be shown to a_roacb that of a single discrete transistor.
The low _ower consumption characteristic also tends to increase reliability
by reducing temperature stress. The significant reduction in the number of
interconnections is also an imoortant factor in reliability improvement.
Most integrated logic modules are available in the form of a univer-
sal gate function (NAND or NOR) These logic elements are quite appropriate
for the construction of the restoring function required for a multiple line
majority voted redundant system. Several types of logic available for the
universal gate function have been studied. Each basic type is described
below; those commonly available are compared for suitability for use in
spaceborne redundant systems. One of these is chosen as particularly suit-
able.
B. Classification of Basic .Types of Logic
It appears that most of the common types of transistor logic (TL)
may be classified according to three basic co_oling schemes used for the
1-Z6
universal gate function. They are described below.
I. Linear imDedancecoupling to an input transistor maybe used
to form R-TL, as shownin figure 7. This type of logic is generally not
available in integrated circuit form.
+¥
Figure 7 R-TL Resistor-Transistor Logic (+NOR)
II. Direct coupling to a multiple output transistor array (DC-TL),
maybe used as shownin figure 8. It is co_tmonlyused in the more practical
modified forms, such as R-DC-TL(type II-A) shownin figure 9. An impedance
is inserted in each innut line to improve operational characteristics.
Altho_gh this type of logic is sometimesreferred to as resistor coupled-
transistor logic, its operation is not the sameas R-TL, described above.
I -Z7
÷¥
(
Figure 8 DC-TL Direct Coupled-Transistor Logic (+NOR)
|
÷¥
Figure 9 R-DC-TL Resistor-Direct Coupled-Transistor Logic
1 -7.8
Type II-B coupling involves current switching and output buffering
to prevent saturation of the input transistors. This type of logic is
sometimes referred to as emitter coupled-transistor logic (EC-TL) or current
mode-transistor logic (CM-TL). One type of non-saturated-direct coupled-
_._oto_ lo_c t_Q _ _ _'_-_ uses _ o_++o_ _^_i_ _,,+,_,,+........ ,, .............. -.......... _.. buffer,
is shown in figure lO.
'_ V-V
i
Figure iO NS-DC-TL Non-Saturated-Direct Coupled-Transistor Logic
III. Diode coupling uses non-linear input summing to form the
logical AND or OR f_nction. The most common form of D-_ is shown in
figure llj which performs the positive logic NAND (AND-NOT) function.
Saturation of the output transistor may be prevented by limiting the
minimum saturation voltagej as shown in figure 12. This results in a more
constant "zero" output voltage, and diverts excess base current to improve
transient response.
1-29
-¥
Figure ii D-TL Diode-Transistor Logic (+ NAND)
-¥
Figure 12 NS-D-TL Non-Saturated-Diode-Transistor Logic
1-30
Type III-A coupling, shown in figure 13,is a variation referred to
as T-TL which uses transistor compling to obtain improved response.
Logic operation is equivalent to D-TL when inverse transistor gain (_i)
is low; coupling transistor action removes stored change during turn-off,
and generally permits the elimination of the output transistor base bias
resistor.
Figure 13 T-TL Transistor-Transistor Logic
C. Comparison of Logic Types
A comparison of the types of circuits described above is shown in
the table below for five types which are commercially available. They are
arranged in the table in increasin_ order of the number of ecuivalent com-
nonents reot_ired for a 3-input universal gate function. A larger number
of comnonents generally increases fabrication comolexity and increases
1-31
1-32
oower dissipation. The general characteristics of these logic con-
figurations are discussed and compared in the _aragra_hs following the
table.
The isolation and speed-power rankings for the three saturated
logic types were obtained from "The Changing Prospective in _'_crocircuits",
Electronic DesiKn, February 15, 1963, p. 56. This article describes the
result of a study of different types of logic for single substances
conducted by PSI. They observe that no one logic type is superior to
all others for every a_plication, but rather that the characteristlcs of
each type must be considered according to the particular over-all system
requirements.
The isolation ranking is a qualitative measure of the
inout loading, the isolation between inputs, noise immunity, and varia-
tion of input loading with parameter changes, internal failures, and out-
_ut loading. Logic ty_es with the highest isolation are ranked first;
those with lower isolation are ranked in increasin_ order. The non-
saturated logic types are inserted into the original ranking by a corn-
par@son of their general characteristics with those of the three saturated
logic types.
The speed-power ranking is a quantitative measure of the product
of propagation delay and power dissipation of the different logic types
when similar comDonents and techniques are used in fabrication. This
I
I
I
I
I
characteristic varies considerably according to the design _d technology
used for the construction of actual circuits. Logic types with the lowest
power-speed product are ranked first; those with higher power-speed
_roducts are ranked in increasing order. The non-saturating logic types
are inserted into the ranking order indicated according to available data.
TABLE I COMPARATIVE RANKING OF AVAILABLE LOGIC TYPES
NAME Function for _e of Number of Speed- Isolation
+ Logic Coupling Comnonents Power Ranking
Ranking
T-TL NAND III-A 3 1 4
D-TL NAND III 5 3 2
NS-D-TL N_D III 6 2 3
R-DC-TL NOR II-A 7 5 5
NS-DC-TL NOR II-B 9 4 I
1-33
D. Description of Logic Types
Resistor-transistor logic (R-TL) is a basic schemefor providing
the NORfunction for NPNpositive logic. The resistors are used for linear
input summinginto the output transistor, which is normally biased off
unless at least one inout is present. The bias maybe increased to provide
either the inverse majority or the NA_q3output. The addition of speed-up
capacitors to the input resistors, although significantly increasing transient
response, is not sufficient to reduce the power-speed product %_that avail-
able with other types of logic. The bilateral interconnection maycreate
interaction problems between inputs; performance of the device is sensitive
to variations of the inrut resistors, biasing, and transistor gain. The
difficulty of fabricating an integrated resistor-capacitor ccmbination for
each innut further decreases the suitability of this type of logic.
Direct coupled-transistor logic (DC-TL) is a theoretically simple
method of performing the NORfunction for NPNpositive logic. Inouts are
applied directly to transistor bases; the commoncollector is the output.
Actual operation, however, is limited by the high sensitivity to parameter
variations, input current "hogging" and low input impedancewhich limits
fan-in and fan-out, and the low noise margin. These severe limitations
have resulted in the actual use of a modified version (R-DC-TL) which includes
a low impedanceresistor-capacitor combination on each input to reduce the
sensitivity to noise, oarameter variations, and current "hogging". This
modification increases power dissipation, propagation delay, and fabrication
comnlexity. Since the fan-out capability of most NPN positive logic NOR
1-34
schemesis derived from the output collector resistor, the power
dissipation must be increased to allow fan-out capability regardless of
whether the fan-out is used or not.
The basic DC-TLschememaybe modified to provide non-saturated
_"÷ _^ t_S • _..... _ *+_- res_s_or reduces the _roblems
of innut current "hogging", and increases innut impedance so that this
type of logic offers high innut isolation• Various methods may be used
to nrovide outputs; both the OR and NOR may be nrovided conveniently.
Good matching of components and close tolerance on a special reference
voltage supply are required. The clocking function may be obtained by
controlling the negative voltage supply by gating or a sinusoidal voltage.
A two phase clock is required for flip-flop functions more complex than
simple storage. An additional transistor, which shares a common collector
with other input transistors, is required for each _nout. The voltage
difference between the "l" and "0" level is usually very mnall, resulting
in reduced DC stability and noise margin. NS-DC-TL offers high speed oper-
ation at the exnense of high power dissipation.
Diode-transistor logic (D-TL) is nrobably the most nopular type of
integrated circuit logic, due to its similarity to discrete comnonent
circuitry and the excellent operating characteristics. D-TL circuitry
operates with wide parameter variations to minimize the nossibility of
malfunction due to drift failure. Actual failure testing has _hown that
redundant D-TL is not sensitive to most catastrophic failures. D-TL is
most commonly available as NPN positive loglc NAND integrated circuits.
1-35
1-36
The newer versions of commercially available D-TL circuits offer about the
lowest power-speedproduct available for circuits operating at moderate
speeds and with good noise margins. Consideration of integrated circuit
characteristics has significantly reduced the number of individual
isolated componentscomparedto the number of discrete componentsrequired
for an equivalent circuit. The entire innut diode array, as well as one
level-shifting diode, maybe constructed as one multiple-emitter transistor.
Each additional input merely requires an additional emitter connection.
Transistor-transistor logic (T-TL) is a simplified variation of
D-TL employing transistor coupling directly to the base of the output
transistor. The elimination of one coupling diode reduces the noise margin
and voltage swing to about the equivalent of DC-TL. Input isolation is
similar to D-TL, except that inverse gain of the coupling transistor allows
some"hogging" of innut current. The inverse gain cannot be reduced without
increasing the offset voltage of the coupling transistor _-; increased off-
set voltage, in turn, decreases DCstability and _oise margin. Increased
speed at low Oo_Terlevels is possible because the coupling transistor
removesstored changefrom the output transistor to reduce turn-off time.
The output inverter of D-TL maybe designed to prevent saturation
to reduce excess drive and stored-change effects. This maybe accomplished
by limiting the minimum"0" output voltage by a base to collector clamp
to prevent saturation of the output transistor, as shownabove for non-
saturated diode-transistor logic (NS-D-TL). The increased "0" output
voltage will, however, be more constant with increases in output loading,
II
I
if sufficient gain is available. Logic operation is equivalent to D-TL
with increased speed and lower _ower dissipation under comoarable
conditions. Additional gain may be easily obtained for D-TL by sub-
stituting an emitter follower for the final level shifting diode.
The speed-power performance of some of the commonly available
logic elements currently available are shown in figure 14. This figure
shows the advertised performance characteristics of different logic types
!I available from different suppliers.
I
300
iJ ZOO _ R-OC-TL NOR
XAS INST.)
I O0 _T-TL
_ _ "_ D-TL NAND T = (WESTINGHOUSE)5O
T-TL NAND -'"
' r (IMPROVED)
12_ NS"DC-TL NOI_OR(MOTOROLA)_
2 .5 LO 2 3 S • I0 20 30 50
AVERAGE POWER 'DISSIPATION, P*ldW
Figure14 S_eed-Power Performance 1-37
1-38
The wide variation of performance characteristics for
different suppliers of the samelogic types is due to several causes:
differences of circuit parameter design, lack of standard test conditions
(temperature, fan-out, voltages, etc.), as well as the rapidly improving
technology in this field. Tworecently announcedimproved versions of
previous elements (Westinghouse D-TL and Fairchild R-DC-TL) are indicated
in the figure. The rapid rate at which improvements have been madein
the field of integrated circuits makesit impractical to makean arbitrary
decision to use only one lo_Ic element for all future _maceborneredundant
_ystemz. General characteristics, as well as the snecific requirements
of redundant systems, maybe used to make recommendations, however,
based on available information. The general characteristics discussed
below maybe used as a guide to the choice of circuits, even through
exact requirements m_ vary.
Since _ystematic redundancy is most efficient and _owerful when
the basic elements are highly reliable, the realization of high system
reliability with minimumweight and power penalties requires circuitry with
high basic reliability. High circuit reliability, especially for extended
neriods of time, is usually realized when the circuit configuration is such
that groper operation is not excessively sensitive to parameter variation
or environmental extremes. High speed performance does not appear to be
a particular requirement for most spaceborne systems; low Dowerdissipation
i
i
i
is a much more desirable characteristic. Available uower (and total
energy) is often limited on space missions; the additional circuitry
required to reduce the probability of system failure will further emphasize
this problem. The nower required by individual circuits must be held to
a mi_.,nr_ to keep total power witF_n available li_J_ts. The reliability
performance of most integrated circuits depend on the temperature stress.
The use of low power circuitry is an important factor in reducing the
temperature stress, which, in turn, improves the basic reliability and
performance characteristics of the individual elements.
Although T-TL offers high speed at low _ower levels, its
sensitivity to parameter variatlon, noise, and input current "hogging"
has reduced the general suitability of T-TL. This sensitivity anpears to
be a major disadvantage because the individual circuits in a redundant
snaceborne system are required to operate reliably desnite severe environ-
mental variations and the occurrence of failures within the system. Since
inverse transistor action can limit the input voltage signal, failures
within the circuit or on the outrut may affect the innuts. This transfer
of failure effects to inputs would be a serious disadvantage in redundant
systems, where the effect of failures must be minimized.
DC-TL appears to be even more sensitive to parameter variations
and failure effects, except for the various modifications which are used
to reduce this uroblem. Positive NOR logic appears t o be particularly
vulnerable to output failures resulting in failure of input signals. This
occurs because the transistor turn-on current is obtained from inputs; any
1-39
input must be able to provide sufficient drive to cause the output to be
"0" for wroper operation. Fan-out capability is obtained by providing
each output with the ability to drive several inputs. If actual failures
may cause all of the inputs to a circuit to be overloaded, then any other
c_rcuit receiving any of these inouts are also effectively failed. Addi-
tional fan-out capability is usually reflected in increased _ower consum-
tion, which, in turn, increases reliability _roblems.
In contrast,the turn-on current for positive NAND logic is obtain-
ed within each logic element. This drive current is diverted to a low
impedance inout whenever any inout is "0". Fan-out capability is provided
by the output transistor gain, and may be increased without significantly
increased Dower requirements. Since drive current is provided by each
circuit, rather than by inputs, failures within an NAhU) circuit usually
do not affect proper operation of inputs. The back-to-back diode coup-
ling also offers good isolation characteristics. Actual failure testing
has verified that failure effects in D-TL is usually limited to the
circuit in which the failure occurs.
Limited testing for the effects of both transient effect of
high gamma radiation and the permanent effect of i_te_rated neutron flux
has shown that D-TL integrated circuits are more resistant to radiation
than forms of DC-TL 6 The transient effects of high gamma radiation anpear
to be primarily due to the leakage of the collector isolation diode. DC-TL
is more susceptible because the larger number of common-collector transis-
tors used creates a larger junction area. DC-TL was seriously affected at
1-40
gammalevels of 106 to 107 R/sec., while Signetics D-TL withstood an
order of magnitude increase. Signet_cs D-TL also showedmore resistance
to integrated neutron flux, but no microcircuits showeddamageat ordinarily
expected dosages. At a flux dose of 2.8 x 1014 neutrons/cm. 2 (equivalent
to about I00 years of con÷_nuous exposure in +_.heVan Allen belts), Texas
Instrument elements failed; Fairchild elements showed some waveshape
deteriroation: Signetics and discrete component D-TL showed no noticeable
effects.
E. Logic Selection
Integrated D-TL circuitry appears to be the most a_ro_riate type
of logic for general use in redundant logic systems for s_acecraft missions.
It has be_n chosen for the general advantages of features described above,
and particularly for its suitability for use in redundant _aceborne ecuip-
ment, which requires both high immunity to noise and narameter variation,
as well as reasonably low power dissapation. These requirements are
generally not available in the various forms of DC-TL. Although T-_ logic
is equivalent to D-TL, currently available elements are too sensitive to
innut current "hogging" to be suitable for use in redundant _-stems.
D-TL is known to have high noise immunity, good input-to-out out
isolation, good capability with other circuitry and relati,Jely low power
consumption. D-TL is particularly insensitive to drift failures; failure
testing had shown that the effect of most catastrophic failures is not
especially harmful in redundant logic networks. The speed capability of
1-41
available integrated D-TL circuits appears to exceed the requirements of
most spaceborne systems. Some of this excess speed capability may be
traded for lower power requirements by reducing the power supply voltages.
Power dissipation could be further reduced by a redesign of present D-TL
circuits to use higher resistance values. High resistance is a diffi-
cult problem in present circuits, since the characteristically low resis-
tivity of diffused resistors requires a large area for hiKh resistance
values. The use of thin film resistors and canacitors on the silicon block
in which the semiconductors are diffused, as _lanned by Westinghouse for
the near future, would hermit circuit design for significantly lower _ower
dissipation without the large areas and narrcw strip layout required for
totally diffused circuitry. Such single-chip hybrid circuits are not
presently available for general logic use.
It is expected that the positive logic NAND function will be
used, since this permits logic design of functions as the sum of nroducts,
which is convenient for reduction and simplification by familiar methods.
The NAND circuits shown are particularly versatile, since the collector
outputs m_ be connected together to form AND-OR-NCT logic functions
directly. R-S flip-flops may be formed by interconnected NADn3 elements;
formation of more complex functions such as a compatible counter element
recuire a large number of NAND elements rand a two-phase clock. The maJori_
_Teter is not a commercially available element, but it is easily constructed
from NAD_ elements.
1-4Z
F. Majority Voter Desig_
Failure te,tlng has shownthat _articular care must be _sed for
the desi_ of restoring elements so that failures on one input to the
restorer do not cause failures on other inputs, and the failures in the
restoring elements do not cause failure of a majority of innuts. This
testing has shown that a conventional majority element (whether constructed
as the minimum discrete component circuit, or of interconnected NOR or NAND
elements) m_ experience failures which either cause iw.,ediate failure of
the entire set of restorers, or which would cause the same result if a
single input error occurs.7 If such effects are overlooked, the system
reliability ma_ be seriously degraded. Shown in figure 15 is a three
input majority element using NAND elements which cannot cause an entire
set of restorers to fail due to any single failures.
A
B MAJ ( A,B,C )
C
Figure 15 Majority Element with Input Isolation
1-43
The NANDimplementation shownutilizes com_aonoutput logic so that
the voter requires only t_-o more gates than conventional majority voters,
and retains a two element input to output propagation delay. NORimplemen-
tation, h_ever, would require a total of eight gates and four element
input to output propagation delay to obtain input isolation for NPNpositive
logic. It is exnected that the isolated innut majority element shownwill
be more reliable in normal operation (all innuts alike) than a more conven-
tional configuration, since very few single failure modescan cause the
output to disagree with the inputs when all inputs are identical.
If higher orders of redundancy are used, then each input is
provided with isolation gates. Since componentredundancy is not used to
protect against single failures, a simple test consisting of monitoring
the logic output while applying all combinations of logic inputs will
completely test the operation of the circuit. A custom-packagedmajority
voter would significantly reduce the si_e and weight of a redundant system
whencomparedto one using individual packages. The packaging of this
majority voter is of particular importance because it is used repetitively
in a redundant system.
G. Comparisonof Suppliers
Integrated, single-chip D-TL NANDelements are available from
Sylvania, Siliconix, Westinghouse, and Signetics, amongothers. Advertised
power-speed Derformance and power dissipation at comparable voltages are
shownbelow in Table II. It is noted that Siliconix offers the best pcwer-
speed performance; the Sig_etics gate with low power connection offers the
1-44
lowest total power dissioation at the samepower supply voltages.
TABLEII COMPARISONFD-TL SUPPLIERS
Power Power-Speed
Dissipation Product _) 250C
-9
Siliconix +4V 5 _ 60 xlO
Siliconix +3V 2 38
Signetics +4V,-2V 6 180
Signetics (low power) +4V 2.8 168
Westinghouse +_V 3.7 190
Westinghouse +6V 8.5 255
Sylvania 15.O 195
watt-sec
Delay and binary counter elements are available from Westinghouse
and Signetics. The current Westinghouse binary element requires consider-
ably more power than the Signetics because the Westinghouse element (which
dissipates 75 row.) consists of interconnected NAND functions on the silicon
chip. The Signetics counter requires 16 milliwatts, and uses canacitive
coupling and steering. Westinghouse plans to have a caoacitor-steered
binary counter available soon. The Westinghouse direct-coupled elements
would not be as sensitive to input rise and fall time as the capacitor
coupled elements, although either type will count at frequencies in excess
of I megacycle.
The use of low Dower circuit_j is considered to be an important
consideration, since it will allow greater flexibility on the use of
1-45
1-46
redundancy when power is limited, and will increase basic reliability by
reducing temperature stress. Due to the high power requirement, the
Sylvaina NAND element is not recommended for general use in spaceborne
redundant systems. The Sylvania NAND is most useful as a hich speed
element with high fan-out capability. Although Silicon_x offers superior
nerformance characteristics for the NAND function, when compared to West-
inghouse _ Signetics, the advantage is urimarily that of increased speed,
which is not necessarily required for most space _ssions. Accurate
reliability data appears to be lacking, due to the limited production of
Siliconix elements. The Siliconix NS-D-TL circuit merits further study
into other potential advantages, such as operation with parameter change,
greater fan-out capability, and compatibility with redundancy testing
techniques.
The operational characteristics of the Signetics and Westinghouse
NAND gate are quite similar; the Signetics gate can operate at somewhat
lower power dissipation when this mode of operation is chosen. Although
reliability data is available for both suppliers, Westinghouse has the more
extensive reliability testing program for their integrated circuits. The
availability of accurate reliability data is an important requirement for
the efficient design of b_gh reliability redundant systems. Westinghouse
o_erating life tests of early models at 25°C has indicated a failure rate
better than .053% net 103 hours met element at 50% confidence, it is
expected that continued improvements and increased sample size will verify
a failure rate of better than .001% her lO3 hours per element with a high
confidence, as required b__ the Air Force improved Minuteman program. West-
BB
R
II
I
B
I
B
B
B
inghouse is a major supplier of integrated circuits for the Minuteman oro-
grl (Texas Instruments is the only other major supplier), which is the
first blgh volume integrated circuit contract. Circuits supplied by Westing-
house include drivers, sense amplifiers, several types of switches, and
va__ous _en_ral __mp!if_ers, a_ _e_'as common logic elements. Westinghouse
is currently manufacturing R-DC-TL and T-TL logic elements in addition to
D-TL, and has extensive capability- for custom circuits and variations of
current elements. A 50 NAND gate element on a single silicon chip has been
developed for JPL. Combining the functions per package would be a signi-
ficant factor in the reduction of the size and we_ ght of red_:ndant equinment
when compared to individual package designs.
Signetics offers a variety of integrated D-TL circuits and integrated
comronents for laboratory evaluation. They have conducted noise sensitivity
and life tests. The operating life tests of the NA_) element at 25°C have
indicated a failure rate as low as .12% per 103 hours per element at 50%
confidence. The circuits appear to be compatible with most input-output
equipment, as well as the redundancy testing techniques descrdbed in the
next section of this reoort. Performance testing and evaluation of most
of the Signetics circuits have been performed by the U.S. Naval Air Develop-
ment Center _. Their tests indicate that Signetics circuits generally meet
advertised s_ecifications and seem quite suitable for building logic systems.
The standard circuit and lead arrangement of the Signetics gate allow a
considerable degree of flexibility in the choice of the oarticular character-
istics. A change in the connections to the gate alters the speed-_ower
1-47
1-48
characteristic. Tb/s is expected to be used to reduce power requirements.
In addition, the base bias resistor is brought to a separate lead so that
a negative voltage map be used to improve transient response. Access to
this point is particularly important for the testing procedures described
in the next section of this report.
It appears from currently available catalog information that
Signetics is presently the most suitable single supplier for integrated
circuit elements for the construction of redundant spaceborne equipment.
Signetics offers a relatively complete catalog line of elements required
for distal system design, and generally offers significantly lower power
requirements. The use of a senarate connection for the transistor base
return is particularly suited for the auplication of the testing procedures
to be described later. Indenendent circuit testing has generally observed
that the Signetics circuits are cuite suitable for general use, and are not
particularly sensitive to parameter variation, noise, temperature, or the
effects of radiation.
The choice of Signetics as the most suitable supplier is not
based on a single parameter, but is based on the several characteristics
described above. The more important characteristics applicable to Signetics'
circuits which are expected to be important for redundant systems include:
low power diss-_pation, single power supply operation, complete line of D-TL
logic modules available, compatibility with testing techniques for redundant
systems, and availability of reliability testing data.
i
I
l
I
l
l
I
I
l
I
I
i
I
I
I
I
I
I
II
I
I
I
I
I
I
I
I
I
I
I
I
i
I
I
I
I
IV. Failure Testing of Redundant Systems
A. Introduction
I. Characteristics of Redundant Systems
The outstanding attribute of a redundant system is that of
providing high reliability for a longer period of time than the non-
redundant counteroart. Typical reliability curves depicting this relation-
ship for a simple system shown in figure 16. It is assumed here that both
systems begin operation with all circuits, subsystems, wiring, etc. in a
failure free condition.
REDUNOANT SYSTEM
RELIABILITY
CONVENTIONAL
SYSTEM
O
Figure 16
MTBF(CONVENTIONAL SYSTEM)
I
I
-I
-I
I
h
v
OPERATING TIME I
i
Reliability of Conventional vs. Redundant Systems
1-49
1-50
The statistical relationship between reliability and operating
time is derived by assuming that failures occur at constant rate and are
inherently random and _nde_endent. After some period of operation without
maintenance_the reliability of a typical multiple line, majority voted
redundant system falls off and becomes less reliable than the non-redundant
version. This behavior is normal since the greater number of components
subject to statistical failure eventually cause the majority voters to have
incorrect outputs. The initially flat portion of the redundant system
reliability curve is the characteristic which is exploited to provide high
mission reliability.
Since current sDaceborne equipment is unattended after mission
commencement, it is important to assure that the equipment is in perfect
working order "before launch". It may not always be _ractical to completely
test each part of a redundant system after final assembly and installation
into a space vehicle, and thus the term "before launch" includes diagnostic
testing before final assembly. It will be shown that a redundant system
may be conveniently diagnosed for the _resence of failures after final
assembly and installation in a space vehicle. This may b_ accomplished
during the pre-launch test _eriod when the vehicle is about to begin its
mission. Essentially the technique employed is that of removing the failure
masking effects of redundancy and testing the replicated systems separately.
The function of these tests is initially to detect the occurrence
of a failure and secondly to determine its location. The tests would be
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
4useful in deciding whether the equipment should be finally assembled and
installed into the space vehicle or if the equipment is free of failures
a_d ready for launch. The goal here is to assure that all of the initial
failure protection which has been designed into the system is available.
In a non-redundant system the best one can do is to test the system
and then hope that no failures occur. The statistical nature of failure
occurence, however, offers little assurance that a failure will not occur
Just after mission commencement. This occurrence often precipitates total
mission failure in a non-redundant system. The redundant counteruart is
obviously better suited to tolerate randc_ failures. Further, a typical
order three redundant _jstem which has been diagnosed to be free of failures
prior to mission commencement is not vulnerable to single failures and thus
offers a high degree of assurance of mission success.
Further tests would be utilized to isolate and locate the failure.
The goal here is to effect repair and thus return the system to uerfect
working order. Since this may consume considerable time and involve special
repair or replacement facilities, a duplicate system, which has been found
free from failure, may be required to expedite scheduled installation into
the space vehicle.
For redundant systems which receive maintenance the purpose of
diagnc _tic testing is again to detect and locate failures. The goal, how-
ever, is to return the system to nerfect working order and thus assure the
highest _ossible reliability during the entire operational life of the equip-
merit. In order for neriodic maintenance to be effective it follows that the
1-51
1-52
period between maintenance checks should be sufficiently short so that the
reliability for the maintenance period is high. The probability of operation
repeatedly traverses the initially flat portion of the redundant reliability
curve.
The general problem of diagnostic testing is to provide suitable
test facilities and methods which are effective in determining whether a
failure has occurred, and to determine its location. In a redundant system
the implementation of test facilities entails many considerations, ranging
from basic system configuration to the details of circuit design. In a
conventional non-redundant system, test provisions are all too often given
only token consideration. Although the test features provided may be in-
effective or inconvenient, the diagnosis, failure location and renair of the
equipment is often made nossible through the ingenuity of an exnerienced
technician. A redundant system similarly encumbered imposes a much more
difficult task. Thus the need for integrating system configuration and test
facilities in the initial design stages becomes extremely important.
2. Testing of Conventional Systems
The techniques for detecting a failure in a redundant system
represents a problem which is alien to the test philosophy of conventional
systems. In a non-redundant system the effect of a failure is rather
dramatic and is usually evidenced by either partial or total system failure,
or obvious changes in operational behavior. This simnlifies the problem of
detecting an error, but is small consolation to the user who loses the
service of a system without warning, perhaps at some crucial moment. Total
o° II
i
i
n
i
n
i
II
I
i
i
I
n
I
n
I
n
I
system failure usually indicates the failure of a major function, such as
a power supply or clock generator. Changes in operational behavior and
partial failures normally provide symptoms which_when analyzed_are valuable
in converging on the failure location. In a redundant system the effect of
a non-critical failure is not evidenced by a_-y ch_.nge in _stem behavior.
This means that the effect of a failure does not provide gross symptoms
which may be used to indicate its occurrence or determine its location.
The solution to this unique problem is suggested through several avenues of
apnroach which represent diagnostic routines and implementation schemes
unique to redundant systems.
Before considering the unique demands which a redundant system
ImDoses on the required test facilities, it is useful to consider some
anproaches which are anplicable to digital systems in general. These
general aDproaches include waveshaDe monitoring and the a_Dllcation of
various stresses to enhance the chance of detecting present or potential
failures. The combination of general approaches with the sFecific ap-
proaches to be suggested appear to offer a more inclusive repertoire of
techniques from which to choose.
In a conventional system a failure of some circuit or sub-system
normally nrovides an indication of its occurrence by the resultant changes
in operational behavior. These are usually designated as catastrophic
failures. Degraded components which are not sufficiently marginal to cause
circuit failure are more difficult to detect because there is no indication
of a change in _-stem behavior. Often, however, a degraded component m_y
1-53
1-$4
be detected at the circuit test point level by changes in normal wave-shape.
At the component level the degradation may be considered as a failure. At
the circuit level this condition represents an impending failure. Under-
standably it is important to detect and re_air impending failures since it
is very likely that the circuit will soon fail. This is one of the more
important aspects of periodic maintenance of non-redundant systems. Often
the system may be operated normally and the various test _oints monitored
to detect marginal voltages, wave shapes or rise times. This represents
a very time consuming procedure end is severely limited in effectiveness
by the number of test points which are provided. Many marginal components
are then essentially undetectable.
Another problem which often arises is when a failure in circuit
operation becomes sporadic. In this case the system may operate normally
for most of the time making the location of the fault a difficult task.
As so often happens, Just as maintenance personnel are in the _rocess of
converging on the fault location, the fault disappears and the system
ooerates normally. The problem here is that the fault is not wresent long
encugh to allow an adequate diagnosis of the difficulty.
A more nowerful a_proach for locating impending and sooradic fail-
ures involves the a_plication of stress to the system. This will often
precipitate a circuit failure by subjecting components to a condition which
magnifies any degradation. Consider now the two general classes of approaches
for imposin_ system stress--environmemtal and electrical. Environmental
stress may be ty_cally sub-divided into temperature, humidity, pressure
vibration, shock, radiation) etc. The application of one or combination
of these environmental stresses is seen to present three main problems;
I) the size, complexity and cost of the facilities required, 2) the
difficulty of performing measurements in an alien and often dangerous en-
vironment, and 3) the possibility of subjecting components to unnecessary
stresses and thus causing unwarranted damage or destruction.
Temperature stress is _erhaps the most popular a_proach because of
its utility in causing parameter changes in resistance, capacitance) leakage,
gain, threshold, etc. A second advantage is the small amount of additional
facilities which are required. Often) temperature stress may be conven-
iently applied by controlling the system cooling to increase or decrease
operational temperature. Component variations caused by temperature stress
often make circuit operation marginal when such changes are beyond the
normal specified design limits. Thus a component which has become only
slightly marginal at normal operating temperature, and is indicative of
impending failure) m_ be magnified by temperature stress to precipitate
circuit failure. This method is often used, for example, in testing tran-
sistors for leakage current degradation at elevated temperatures. In a
system test the increased leakage current of degraded transistors causes
circuits to become sufficiently marginal to effect circuit failure.
The remaining types of environmental stress are diffic_It to impose
on a system without test facilities of vast complexity. For this reason
1-55
1-56
they are not readily amenable to system testing but find greater utility
at the component or sub-system level. A case in point is the development
of highly reliable components, i.e., by carefully controlled production
followed by extensive testing under a variety of environmental amd elec-
trical conditions.
Electrical stress is a more convenient method for detecting
marginal comoonents and impending failures. A convenient method for stress-
ing an entire _ystem simultaneously is that of marginal voltage testing.
In this a_proacb the system bower suo_ly voltages are varied to combinations
of maximum and _inimum levels for which the circuits were designed. When
all defective components, modules or sub-systems have been detected and
replaced the system oower suoplies are returned to their nominal values.
Marginal voltage testing is often combined with simulation routines and
static and dynamic measuring techniques to provide an inclusive test program.
Simulation orograms provide a form of electrical stress which is
seen to exercise the variety of operational functions which a system may be
required to perform under actual ooerating conditions. Often however, a
simulation technique may subject the system to operational steeds which are
not encountered in normal system ooeration. This might be accomolished by
varyin_ the frequency of system clock generators to either increase or
decrease the soeed of operation. In a spaceborne sequencer, for example,
it may be necessary to s_eed uo the occurrence of time events by several
orders of magnitude in order to test all functions in some reasonable test
oeriod. In other a_plications increasing the speed of operations to the |
|
|
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
!
|
II
maximum design limit is often use£ul for magnifying the effect of marginal
components. For examole this technique is seen to be useful _n determining
degradation in ca-acitive coupling circuits.
A reduction in operating speed does not usually subject the system
to stress but is useful in ascertaining that some normally fast sequence
of operations is beinE rerformed correctly. Here, the reduction of clock
rate is utilized to allow operation sequence to be conveniently monitored.
The general approaches discussed are primarily useful in precipitating
static failures which are impending or sporadic. DC failures _d catas-
trophic failures are usually i_mediately apparent from the manner in which
the system behaves. When only a portion of the system fails in the static
state it often wrovides symptoms which may be used in diagncz ink the
location of the failure. If a failure occurs near the "front end" of a
system, the majority of outputs will usually become static. !n th_s case
the syrmtoms are not sufficiently explicit to allow a_ adequate diagnosis.
Simulation eq,binment then becomes ,_seful in determining the failure location.
This is accomplished by a_nlying suitable signals at the various subsystem
inwuts and monitoring outputs for the _resence of the correct response.
3. Failure Detection in Redundant Systems
The problem of detecting a failure in a redundant system is
usually more difficult than in the conventional counterpart, because the
effect of non-critical failures do not provide gross symptoms of their
occurrence. This difficulty in diagnosing a failure is amply compensated
1-57
1-58
by the vast improvement in reliability which a red,&udant system provides.
Since a conventional system normally provides little indication
of an impending fail_'e, the only available resort by which the system qual-
ity nmy be diagnosed is by the application of stress. It is, however, an
inconclusive test of the systems ability to perform reliably. In a redun-
dant system the application of stress to components and circuits for the
purpose of detecting impending failures is not of significant value because
the effects of individual fail,.u_es are masked by the system configuration.
Although red_mdant systems are able to tolerate failures _._thout causing
total system failt_e, it is often desirable to diagnose the system to detect
any internal failures. It _,_ll be shown that the application of conditions
which reduce the ability of a red_m_dant system to _thstand internal fail-
ure acts like stress by modifying the configuration so that the failure
m_sking effects are removed. In this marker, failures _hich are present
_ii be indicated by the behavior of the system. %%e following paragraphs
_,_ll describe techniques for detecting and locating failures in redundant
systems.
An order-three, m_&Itiple-line, majorit.v-voted red_dant shift
register sDrstem %fill !_e used to demonstrate basic approaches. This is done
for ease of explanation and is not intended to suggest that the approaches
may not be e_0ended directly to more general system configvrations, or to
higher-order redundant systems. It may be noted that the testing of red_An-
dant s_-stems _ill involve a hierarchy of tests involved _._th first testing
the signal processing parts, then the testing of the restoring elements,
and finally the testing of the hard_are added for the initial testing function
I
I
I
I
I
II
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
itself. The extent and comple_ty of this hierarchy _¢ili depend on the
confidence v_ieh is req,_Ired of the tests and the degree of automation
desired. It appears i_possible, however, that perfectly reliable opera-
tion can ever be e_ected from any hierarchy of i_perfect eqvipment
monitoring other equipment. Although these testing methods are intended
to _ke _ sigrificant contribution to the techniques avsilable for testing
redundant equipment, it is e:_ected that further _rork in this area _dll
result in further i_provements. The accuro_ and co--plenty of the tests
should be balanced to obtain efficient s_-st_n operation.
Often, the problem of failure detection is directly connected
_th th_ requirement for deteznir_ng the location to facilitate mainter--
ance repairs. Therefore, some of the more complete testing methods _Ii
include combined detection and location. Although failure location tech-
niques are usuall7 more complex than the basic fail_e detection techniques
the__ often include complete failure detection capability in order to locate
_ll failures _&ich might e_,_st in a redundant system. Failure location
techniques also provide effective methods to detect and locate failures
in the fail_re detection and location circuitry itself.
Dasic failure detection _ll probably be most useful as a
verification technique to indicate float at least a m_Jor portion of a
red_.mdant system is failure free. This _ll assure that the failure pro-
tection _,_icb has bee_ designed _to a red_ds_t system is available to
prevent s'jstem fail,_re. 3i_ple failure detection techriques ere also e_ec-
ted to be a prelimi_a_T technique which _-_ll indicate if an7 failures are
1-59
present in a maintained redundant system, so that further corrective
action may be undertaken. It is important that all failures be detectable
in a maintained redundant system, so that failures are not allowed to
accumulate and degrade system reliability.
h. Failure Location in Redundant Systems
If a failure is kno_m to exist in a redundant system, it is
often desirable to obtain further information concerning the location of
the failure. This is generally required so that the module containing the
failure may be repaired or replaced. Although it is veo, desirable to be
able to detect any failure to permit maintenance, it is only necessary to
locate failures to within the smallest replaceable module. Therefore, the
requirements of failure detection depend strcngly on the contents of the
smallest replaceable module. If entire subsystems are contained in a module,
then each subsystem could be provided with independent failure detection
hardware. This would be sufficient to locate failures within the replace-
able module. It is possible that the requirement for test points at each
replaceable module to per_.it failure location may in turn determine the
practical size and contents of the module. If the test points and con-
nections occupy a large space compared to the basic module, then the volume
efficiency is rather poor, and a larger replaceable module might be more
practical.
If repairs are expected to be made while the system remains in
operation, then the module which ccntains the failure must n_t include the
remaining replications of that function. This is necessao- to _ermit the
system to operate while the module containing the failure is removed.
1-60
If the entire module is to be re_laced if it contains a failure, then the
failure location technique must be sufficiently accurate to determine which
module contains the failure. This module maythen be replaced without
inter_-_ption of normal system operation. Maintained redundant systems
which are continuously monitored and repaired require a combined failure
detection and location technique which maybe a_plied without altering the
operational characteristics of the system. It will be shownthat relatively
complete testing maybe accomplished during system operation. This is pos-
sible because the most frequent and harmful failures usually cause signal
disagreements at the inouts to the voters. These signals may then be
compared, either automatically or with the use of test points, to detect
and locate these fail,lres. Certain _ystem configurations are amenable to
controls which allow complete failure detection and location with access o_l_
to the signals at the i_uts to the voters. More generally a_]_ cable
techniques require access both to the voter inputs and outputs. These tech-
niques, as well as the implementation circuit_r required, are described in
the followin_ paragraphs.
5. Signal Comparison in Maintained Systems
The location of a failure in a conventional s_stem requires
that a handbook be provided to indicate the correct wave shade and binary
sequence to be expected at each location. This is in addition to sim-
ulation equipment which may be required to Flace portions of the system
into dynamic operation. The redundant system masks the effect of individ-
ual failures and thereby makes the task of detecting their occurrence more
diffic,_lt. It will be shown, however, that the masking effects of a
1-61
a redundant config,_ation may be conveniently removed by co_trolling the
outputs of the signal processors. This is essentially a gross system
approach vrhereby the occ_mrence of a failure is indicated by forcing the
s_-stem to asm_ne various _mlnerable configurations. If the system is
allowed to either operate normally, or in some configuration for _@nich
all operations are performed correctl_, the detection and location of
failt_es may be conveniently accomplished hV ez_nrlnino_ replicated elements
for signal disagreement.
I_ many respects, the location of failures in r redundant s2"s-
tern is a _ach easier task than in the conventional s_stem co_unterpart.
This is because an improper signal mru be determined By comparison N,_th
its replicated versions. If _ red_u_dant s_rsten is operating correctl_
iu an overall s_rstem sense, then the correct sicnal of each monitored
elem.ent is _vailable at leest at a majoritv of associated test points.
_is is seen to eliminate the tedious task of monitoring elaborate v_r_ve
sbepes _nd sequences. _lainte_ance personnel are then presented v_th _
s_-stem v_ich, in principle, contains an inte2ral handbook of normal s!C-
nr_ls to be ez_ected at the various locations. The su_stem may be permitted
to operate normally, v._thout simulation eq_tlpment, performing operations
:,_ose birar_ seq,_ence at an_ single location is so comple:__thet one co_Id
_ot hope to describe them adequatel7 in a._7_ h_adboo]c. This sugsests the
possi_ilit: that _,Ir_te_nce personnel _eed _ot _e completel_, familiar
_¢ith the detailed operation of the system.
1-62
The determination of m_error could be provided by a difference
detector in combination _th a s,litable indicator. A teclunician would be
req_lired only to mo_itor the various test points in some prescribed sequence
_zntil arriv_ug _t the loc_tion of a signal disagreement. He would not be
req_&tred to possess shy special .knowledge of _at constitutes a correct or
incorrect _mve shvpe, binary sequence or repetition rate. 5]so, most dif-
ference detector devices _lich might be employed _L1 signal _ny largo de-
part_e from normal signs2s, sz'_dmay include memo_- to indicate the location
of transient or sporadic failures. From this _e re7 conclude that the
tr_Inirg requirements for mainterm.nce personnel may be appreciabl7 reduced,
th_ provid_g red_d_nt systems _th a distir_ct maintena: ce cost advantage
over the r_ore coi_vention_l co_mterpart. This _ttribute _ione might become
siznificant f_ctor in evahatirE the total utility of a redundant syste=:
_r_Lid_is periodicslly maintained.
In order to reduce the total system fail,ire rate, periodic main-
ten_nce must be conducted at a sufficiently short inter_zal so that indivi-
dual failures _re not so probable that system reliability is (_@preci_hl7
de_raded. Ln addition, if system failure occurs it might be necessary to
employ siz_lation equipment to place portions of the system back into oper-
ation. The advantage of not req_iring sirmilation equipment to locate
indi_rld_l failures is _n important feature of a maintained redundant system.
Thus the bz_ction of periodic maintenance is not orly to assure high system
reliability during the life of the equipment, but also to eliminate the
requirement for simulstion equipment to locate failures.
Thus far in our discussion of maintained redundant system.s, it h_s
_een implied that the signal comparison equipment is us,_llv e=_err_"/ly
applied to the appropriate test points in n_ich the same manner as an
1-63
• |
o-
oscilloscope or voltmeter is uce,1 in a conventional system. As indicated
previously, it may be ,undesirable to provide these test points at every
signal processor end voter output in the system. This may be due to the
lack of access to the signals, the physical size of the test points in
comparison to the circuitry being monitored, or the signal loading caused
b:_ test point leads. Ir some applications it meU therefore be desirable
to provide error detection aud display as an integral part of the system.
Ictegral signal comparators may be desirable for ez_m_ple, in a maintained
red,m_dant s:rster,_rhich is continuously monitored during operation and each
fail,_re is repaired as soon as it is detected. This m_inte:=ance philosophy
allo_Js _ much higher s:-stem reliability than available _,_th periodic main-
te_auce. Uith proper design it appears feasible to renove end replace
defective modules _thout disturbing the operation of the 3ystem.
Since sign_l comparators _Ii indicate o_y v_hen signal disagree-
inert occurs during the no_ml s_rstez operation, more extensive tests are
required to detect and locate such failures as might oc;ur in siznal pro-
cezsors which _re _ot to be used for some r!odes of s_rsten operation, soz_e
of the failv_es in voters, and failures that might occ_ in the control and
sign_l comparison circ_tr-j. _is suggests a maintenance philosopher :_f con-
ti_uous mcnitorin Z co_,bi_ed _ith periodic complete testing _s follows: 3igral
processor outputs _re continuously monitored d_ing the operation of the
svste_ for the indication of the more frequent and harmful failures :i_ich
cause incorrect siznals. ?nese failures are located and ma 7 _e repaired
_,itl:outinterrupting nornml syster, operation. PeriodicallT_ the norzml
1-64
operation of the system is shut doomto allow the system to be completely
exercised and the othem.&se undetectable fsil, ares to be located and repaired.
in contrast, the periodically rm&ntained s_rstemis allowed to acc?mmmlate
fail,ares, even though they maybe easily detectable, until the end of a
scheduled mslntenance period. Continuous monitoring and repairing is there-
fore a very pov2rf,ml technique for detecting and repairing most failt_es
as they occur, vithout seriously impairing the a_ility of tL_esystem to
operate conti_uously >_ile __ndivid,_l fvil-_res are repaired.
.7,. 3in_]lar i_ar_ Testing
I. Detection of :]isnal Processor Failures
An obvious method for detecting failures in _ t_ical redundant
s_-stemis to separate and reconnect the replicated parts to creete indi-
vid_l, independent s_stems. Each system may then be separately diacnosed
for the presence of fail,ares iv. the conventional manner. Th.is would require
theft the basic s._stembe provided _ith a large n,_nberof special m,&tching
cirmuits v&ic_ accomplish a separation. Suchan approach is somev%atin-
practical because of the ez_ense, comple_ty and reliability de,gr_lation
_4_ich the additional circ,_itry and _dring would i_pose. As _Ii _e sho_m,
a _ch simpler meansis available to provide a pseudo-separation of repli-
cated sFstemsvz[thout req_dring an elaborate m_[tching mechanization.
As an e_r_ple, co_sider the simple red_md_nt conficurstion s'nmm
in fiT_re 17. _cb of the complete replications of the non-red_md_nt system
_re hereefter referred to _s a rank of the s_tem. _ac,_rank norrmlly
" 1-65
1-66
Figure 17 Singular Rank Testing
consists of the components of t_e non-red_md_nt eq_dvelent system, separated
b- the _aJorit_-voti_ rest_)rers. Each of the signal processing elements
(indicated bT _locks) _,dthin the s_e rank are designated _th the s_,_e
c_pital letters: e_cb of the ma_orit2r voti_ restorers (indicated by circles)
_thin the s_._e rank _re designated _ith the s_me lo_,zer c_se letters.
The corresponding replications of the s_ue signal processors _re
hereafter referred to as bein_ on the zeme file of the s_stem. E_ch element
i_ the file _ormall_- performs the s_me f_ction, _wd is designated _ith the
some n_nber. E_cb signal processor file corresponds to i_dividual f_mctio_s
at the non-red_u_da_t system. If a signal processor file has _ restoring file
sssoci_ted _dtb it_ the resterin$ file may be _ssi_ned the s_e n_mmber.
I
I
I
I
I
I
I
I
I
I
I
I
I
It will be assumed that the order of redundancy is uniform
throughout the portion of the system which is being tested and that the
onl_ interconnections between ra_ks occur at the innut_ to restorers.
Singular rank testing will a_sume that there is no restrictions on ,_ystem
_i_ej co_igrrationj or uniform4_ty of direction of _i_n..al .__ow. The_e
characteristics are chosen to he compatible with current redundancy _nthesis
techniques.
Suppose that the control lines shown in figure 17 nrovide a
means of causing each output of the rank signal processors to assume
either the "i" state, the "0" state or "N" (normal oneration). In effect,
the output of the A and B rank _locks have been forced to assume definite
DC failure states. The mechanization to accomplish this is described in
part D of this section, and will be shown to entail only slight modification
to the normal circuitry. Consider the effect of causing all the A and B
rank signal processors to assume a static complimentary state, allowing
the C rank signal nroces_ors to operate normally, and that the system
is allowed to onerate with its normal innuts. Under the conditions that
all A and B blocks are im a comnlimentary state the input to each voter con-
sists of "I", "O" and the out_ut of the nrecedin_ C ran_ sign.aT _rocessor
outnut. This means that the dynamic signal nredominates and @_uses this
signal to apnear at the out_ut of the voters. If all voters operate cor-
rectly, the system is equivalent to a non-redundant s_stem, and may be
comnletely exercised in the same m_ner as the non-redundant system
to verify that all signal processing blocks in rank C are functioning
correctly. This test should also yield identical results if the
1-67
1-68
complimentary states of the A and B rank blocks are reversed. If an
incorrect final outDut results for both tests it indicates that at least
one failure is _resent in the C signal processors, the c voters or com-
binations of both. If only one test is successful, then a failure is
evidently present in one or more of the c voters.
Success of either of the above tests is sufficient to verify that
all C rank signal mroce_sors are failure free. It should be noted that the
Dresence of a correct outnut for both complimentary test conditions does
not verify with certainty that the c voters are failure free. This is be-
cause each voter was subjected to le_s than the maximum _ssible number of
input signal combinations. Consider the various combinations of innut signals
and the correct resnonse of a three input majority voter in the table be-
low. States I and 2 renresent the case when A="I", B="O", and C-"N"; states
3 and h represent the case when the static signals on A and R are reversed.
All signals are the same for states 5 and 6.
C disagrees with the other two innuts.
States 7 and 8 occur when
State No. A £ C Out_
l) i o i ].
2) 1 o o o
_ 0 I ]_ I
_7
_) o l o o
5) o o o 0
6) 1 i I 1
7) 1 i o l
8) o o i
• ]
Only the first four of the eight cc_blnations were verified by the test
conditions described. States 5 and 6 are trivial however, since they
contain the combinational states of 2, _ and l, 3 respectively. If a
majority voter makes a "l" output decision for inputs consisting of two
"l"'s and a "O",it will make the s__me decision for an _uput of t_ee "l"'s.
Similarly, if a majority voter makes a "0" output decision for inputs con-
sistlng of two "O"'s and a "l"git will make the same decision for an input
of three "0" 's. Frc_ this it appears reasonable to assume that if the ma-
jority voter operates correctly for the first four states it will operate
correctly for states 5 and 6. Thus the combinations which have not been
tested and hence explicitly verified are states 7 and 8.
The tests conducted thus far have verified that all C rank blocks
operate correctly and that the voters operate correctly for six of the eight
possible input signal conditions. The A and B ranks may be similarly tested
with the result that the correct operation of all signal processing blocks
may be verified. This test philosophy is seen to be an approach for isolat-
ing each rank of a multiple line configuration and thus determining the
presence of any failures which would jeopardize the ability of the systmm
to mask out future fail,_res. Each rank is not operated simultaneously and
independently, but rather one rank at a time is effectively rmmoved frmm
the _r_ltiple line configuration and separately diagnosed for the presence
of failures.
The success of all of these tests has verified the proper operation
of all signal processors. These tests have not completely verified the
I 1-69
condition of the voters as was described by the example of the C rank tests.
However, the following voter input-output operation has been verified with
certainty: All voters will make correct decisions if the input from the
rank in which the voter is located agrees with at least one of the other
inputs.
The condition which has not been verified is the uncertainty that
a voter will make a correct decision when the innut from the rank in which
the voter is located is in disagreement with the majority of the remaining
inputs (both remaining inputs for order three redundancy). It should be
noted, however, that the complete set of singular rank tests will result in
the application of all possible combinations of inputs to the voters. These
tests are therefore sufficient to verify that any undetectable voter failures
cannot combine with further single failures to cause an order three system
to fail.
There are, however, a very limited number of component failures which
can occur in the majority voter which cannot be detected with singular rank
testing. These involve the failure of two of the input diodes for the three
innut D-TL voter. If the voter has a conventional minimum design, singular
rank testing will indicate if either of these diodes is shorted. Due to
the additional innut isolation, the occurrence of these innut diode shorts
cannot be detected in the isolated innut voter which has been shown in figure
15. If either of these undetectable diode shorts has occurred in the isolated
innut voter, the result is that the voter outnut is a "I" whenever the input
from the rank in which the voter is located is a "I". The majority function
is performed for all other inouts. The occurrence of either one of these
1-70
diodes being open cannot be detected for either the minimal design or the
isolated input voters. The result of this condition is that the output
of the isolated input voter is "0" whenever the input from the rank in
which the voter is located is a "0"; if the input to a minimal design voter
is a "l", the voter output is a "l". If one of the diodes shorts and the
other opens, then the voter output is controlled by the input from the rank
in which the voter is located, although the diode short could be detected if
the minimal design voter is used. Therefore the existence of undetectable
failures cannot introduce additional errors, but may cause signal processor
errors to propagate through the restorers.
The above analysis has shown that the occurrence of undetectable
failures tends to cause the output of the voter to be dominated by the
signal from the rank in which it is located. In the worst possible case
(complete dominance caused by the one diode open and the other diode short
in every voter in restoring file when these fail,Ares are undetectable),
the restorers have been effectively replaced by conductive paths from the
output signal processor in the previous file to the input of each follow-
ing signal processors in the same rank. The result is equivalent to elim-
inating the restoring file completely (except that the reliability of the
signal processors is reduced by the additional voter circuitry). Although
it is extremely improbable that such conditions would predominate in a
system recently constructed from completely tested parts, the system becomes
more _-llnerable to further failures if they are allowed to accumulate.
1-71
2. Detection and Location of Voter Failures
It may be desirable to have some means for detecting the
presence of any failures within the system. One such example in which some
method of complete testing is desirable is a maintained system which is
expected to operate reliably for extended periods of time. If such a method
is convenient, signal comparison may be combined with singular rank testing
to detect and locate all voter failures. Since the combined singular rank
tests result in the application of all possible inputs to the voter, the
outputs of all voters in a restoring file may be compared for agreement while
the inputs are a_plied. All voters are failure free if no cutout disagree-
ments occur while all combinations of inout signals are applied.
Since the only _urpose of reversing the comoleme_ta_ r states of the
two ra_ks not being tested _n an order three system was to gain additional
information concerning the voters, voter comparison testing eliminates the
need for interchanging the complementary states a_sociated with each rank
test. This requires, however, that a systematic zlethod be used to assure
that the complete set of tests results in the application of all possible
combination of inputs to the voters, except the trivial cases when all
inputs are the same. This condition will be met if the following rule is
followed during singular rank testing: As each of the ranks is completely
exercised as an individual non-redundant system, the particular pair of
complementary DC states of the remaining two signal processors is chosen so
that the state of either rank does not duplicate the DC state during any
previous testing of the other ranks. Since the choice of which pair of
1-72
complementary PCstates for the testin_ of the first rank is arbitrary,
either of two alternate sequence_maybe used for the complementary LC
states; these states will be complementsof those in the alternate sequence.
Thus it ms_-be shownthat only three tests (one for each rank) are required
for complete singular rank testing with signal comp_son. If each test is
successful in demonstrating that the system will perform the entire set of
functions for which it was designed, all signal processors are verified to
be failure free and the voters are capable cf tran_zS_tinz a correct _'_ e_ic
signal for some of the possible input states. If, in addition, all voters
make the same decision while the proper sequence of controls is applied
during the above tests, the voters are verified to be failure free.
3. Detection and Location of Control and Comparator Failures
The basic concepts cf singular rank testing may be extended
to verifying that the controls used for singular rank testing are operating
correctly. Rather than allowing each rank to cperate individually, each
rank is individually controlled by the singular rank testing controls. If
the controls are working properly, a signal comparison on the output of
each signal processing file should indicate a disagreement whenever the
dynamic signal on the remaining ranks is in disagre(:ment _dth the DC state
of the rank being controlled. In the case where difference detectors are
used on the output of all signal processor files, t_s testing will also test
these difference detectors. The detectors should indicate a difference at
each signal processor file whe_ever the signal on the controlled rank dis-
agrees with the dynamic signals. If the signal comparison of the signal
1-73
processors is accomplished while complementary DC states are aoplied to
each _air of ranks, as described above, all possible input combinations
_nvolving disagreements are aoplied, and the difference detectors should
give a continuous indication. If signal disagreements are noted for each
signal Drocessing file while all of the ranks are being controlled (either
_ndividually, in pairs, or for all possible input combinations involving
disagreements, but not when the entire system is allowed to ooerate without
signal processor failures) then the associated singular rank control
circuitry is verified to be failure free.
h. Summary
It msy be concluded that singular rank testing techniques are
a very powerful tool for verifying that a redundant system does not contain
internal failures. This testing would be valuable for use in acceptance
tests which verify that all the reliability designed into a redundant system
is available, or as the failure testing for continuously monitored and
repaired systems with periodic complete verification, or in a system which
is only periodically diagnosed to determine if any repairs are needed. The
basic singular rank testin_ is a simple and effective method to allow a
redundant system to be tested as if _t were a non-redundant system to verify
that all signal processors are ooerating correctly, and that the restorers
will introduce no additional errors. This is equivalent to verifying that
an order three system is not vulnerable to single failures. Basic singular
rank testing techniques may combine with signal comparison to detect and
locate failures which may exist in the signal processors, the restorers, the
1-74
acontrol equipment, and any signal processor difference detectors.
Failure detectioz and location are often directly associated
problems; failure location techniques are also effective failure detection
techniques whe_ they are available. It is expected that basic singular
rank testing will '_e used as _.n effective and efficient tecb.nique for verify-
ing that a redundant system is nearly failure free for regularly scheduled
maintenance, or for relatively simole accewtance tests. The more complete
detection and location tecbnicues are expected to be used for the more
thorough maintenance cbecks where any failures would be reoaired, or for
complete final tests after assembly. Signal comparison on all signal
processor outnuts may be used to continuously monitor and locate most failures
in a continuously maintained system. These tests ca_ be designed as part
almost any majority voted, multiple line system with a uniform order of
redundancy throughout the _ortion being tested. No special signal sim-
ulation equipment is required, except the normally required inputs. The
eauioment required fcr the tests is described in more detail in part D of
t.his section.
C. Interwoven Rank Testing
1. Complete Failure Detection
In some systems it ms_ be desirable to completely diagnose a
redundant system without the use of the signal ccmoarison and failure
location technique described above. In some cases, it is possible to per-
form this diagnosis without the reauirement for any of the test _o'nts
necessary for signal comparison. One such technique, which will be described
1-75
\in the following paragraphs, is referred to as interwoven rank testing.
It represents an extension of the singular rank testing, since the signal
paths are interwoven between the ranks to form an equivalent non-redundant
system in which the signal is switched from one rank to another at the
restoring files. This is possible only if the system config_Aration has a
sufficient degree of reg_larity. The example will ass_Ame that the system has
restorers on the output of every signal processing file, and that these files
may be assigned odd and even numbers in such a manner that odd files receive
inputs only from even files, and likewise that even files receive inputs
only from odd files. These restrictions are in addition to the ass_nptions
on which sing_lar rank testing is based. It will also be shown that the
controls used for fail_Are detection may be used to locate voter fail_Ares
without req,Airing test points or difference detectors on the output of the
voters. Comparison of signal processor outputs is sufficient to continually
monitor signal processors and locate all voter failures.
Shown in fig_Ares 18 and 19 are six replications of the previously
discussed configuration, with the exception that the two control lines for
each rank individually determine the state of the odd and even numbered
signal processors. If the two control lines for each rank were connected,
the system wo_Ald be identical to the one used in describing sing_Alar
rank testing. Consider that the control lines and associated signal proces-
sors are placed in the following states: AG="O" , AE="I '', BG=_N '', BE="O _,
CO="1 _', CF_"N", as shown in figure 18a. If an input signal is applied to
the first file of signal processors, the signal flow will take the path
shown by the arrows. This is because the two remaining signal processors
in each file have been placed in complimentary static states. If all signal
1-76
O,I 1,0 -- -- - O,I I O
I
AO=O,!
AE-I,O
I-t..r- I--'_ i =, BO:N,N
__L. _ BE 0,1
I I. ,(_-'_. ru- I...... I. _-_ I.._ io.,
__ ¢ .... _ o.,.o
.__L. _ ..---_.,/ ---,__L.,. L _ _ CE-N,N
'N_UTI I _ I I _ l___J _ l__l
Figare iBa
AO-IoO
AE=O,I
o,,
I _ _. --- _ I _ cE _,o
_.__.eJ_ t(_.._2 C_ ,.0 (_)._1 ( N-,, C K (_'_ NC I__
,.;OTII '..__1 I1 _ II _ !____1
Figure 18"o
Y ....
BsN"
CO-O.I
CE=O,t
Flgu_e 18c
,,1".,1l,,_=-=x-_ .,z,,v .,I,SS VV_ -._j
1-77
Figure 19a
-- -- -- - _BO'O,I
.... /._. ,-=-_ 8_.o.,
I 0,1 ------ 0,1 ,
_L_/ _ i :::__T._ .--:-- CE.,.O
c L._._J v L.___I v L__._ _ I I .
Fi_are lgb
- AO-N,N
._. _ _ ,__.L, _ ,.._L, AE. I,O
Figure 19c
Figure 19 Interwoven Rank Testing
1-78
\\
\
processors and voters in the math operate correctly the final outnut of the
Nth _rocesscr (NC) will be the correct output signal. Reversing the states
of control lines AO, AE, BE, CO should also proxdde the same result since
t_is causes the pairs of signal _roce_sors in each file to assume the
opposite complementary condition. The s3-stem may be completely exercised
as a ncn-redum_dant system for either of the above DC states.
Consider now the various combinations of input signals which the
Ic voter was subjected to as a result of the above tests. An examAnation
of figure 18a reveals that these combinations are as follows:
State No. A B C Outnut
3) 0 1 1 1
_) 0 0 l 0
7) 1 1 o 1
2) i o o o
Note that the tests have verified that the voter operated correctly for the
two signal states which could not be confirmed b_ the basic s_ngular rank
tests. This was the uncertain condition that a voter will make a correct
decision when the signal nrocessor proceed_nF it in the same rank is in
disagreement with the other two signal _rocessors. Thus far our tests have
verified the above uncertain condition for all odd numbered c rank voters,
as well as all even numbered b r_k voters. A total of four different inrut
states have been verified for each of these voters. The remaining voters
ir these ranks may be similarly verified by the test conditions _hown in
1-79
1-80
figure 18b. The a rank voters are veri fled by the arrangement shown in
figure 18c and figure 19a. This is seen to be a mirror image extension of
_-C rank tests.
At this noint in the rests, the correct operation of all _ignal
nrocessors has been verified. An examination of the various in,ut signal
combinations which the voters were subject to is tabulated as follows:
Rank a voters Rank b voters Rank c voters
A B C A B C A B C
O I I O i i C I I
0 O I ¢ O I O O I
i I O i I O I I O
I O 0 i O O I O O
I 0 I
O i O
Note that the b rank voters have been verified fcr six of the eight possible
sign_ combinations while the a and c rank_ were examined for only four.
Since the signal c_ndition of _I "I"_ or a].l "O"s was _rev_ou_._ shown to
be t_vial, it is evident that the b ran_ voters ha_:e been comoletel_ tested
for -roper o_eration under all combinations of input signals. The rea_om
that only the b rank voters have been completely verified and not the a or
c rank voters is due to the fact that the b rank voters provided a co_on
signal path in the tests involving t_m c rank voters and the rank voters.
The a and c rank voters may be completely verified by the tests shown in
|
|
|
T
I
firures "gb and 19c. Thi_ _ _een to ca,_se the dynamic ricnal hath tc be
interwoven between the a and c ranks.
Interwoven rank testing ma_ therefore be used a_ an all inclusive
_rocedure for detecting any failures of voters or signal processors without
requiring access %o any test points within the system. The system is reduced
to sets of equivalent non-redundant systems by appropriate controls. It is
then completely excercised and tested to determine if all functions are
performed correctly. The success of all tests verifies that all signal
processors and voters are failure free. If any cf the tests result in an
incorrect cutout, then some failure is present in the s_stem. The detection
of a failure gives very little information concerning its location within
the system.
Although interwoven rank testing does not re_ire access to
test noints within the _y_tem, it is a more elaborate annroach which requires
a de_'ree of regu _ arity in the system configuration a_ well as the e_tablish-
ment of twel-e se-arate test conditions for an order three _stem, instead
of the three required for singular rank testing and voter signal commarisor_
The system should be completely exercised for each of these tests to verify
that the system is failure free if all tests are successful.
2. Failure Detectio_ and Location for Maintenance
The alternate file controls described above may be used to
detect and locate failures during normal system operation. Signal com-
parators are required only on the output of every signal processing file.
1-81
bIf a difference detector is integrally connected with each oro-
censor file, then the correct operation of the signal processors may be
continuously monitored for maintenance purposes. If only test ooints are
available, they may be periodically tested for signal disagreement. Any
disagreement cn the output of a signal processor will indicate that there
is a failure in that signal processor or the voter which proceeds it. This
failure may be repaired during system operation if the other replicated
signal orocessor and voters in that file continue to operate correctly. If
a module consists of one signal nrocessor and the voter which orovides its
input, then reoair is accomplished by replacing that module. This procedure
is useful for detectin_ and locating failures which cause errors, but is
not sufficient for determining the location of some failures within the
voters. If all signal processors are failure free, the voter portion of
the modules may be completely tested by imposing various combinations of
signals at the voter inouts and examing the associated signal orocessor out-
_uts for signal disagreement. To locate all possible voter failures, it
is necessary to provide a means of examining signal processor outputs while
subjecting the associated voters to the various combinations of input signals.
This may be accomplished by controlling separately the odd and even files of
the system or sub-system under test, as described in the previous paragraphs
and illustrated in figure I$. For example, suppose that the odd files are
allowed to operate normally and that each one of the three signal orocessors
in the even files are in turn placed in each of the static DC states. The
outouts of the odd files are monitored for signal disagreement during each
1-82
mI
I
I
I
i
I
!
I
I
I
I
I
I
I
I
I
I
I
of the succersive tests. Any disagreement on the output of an odd file
signal _rocessor will indicate that there is a failure in the voter which
_rovides the input to that orocessor. Similarly, the cutouts of the even
files are menitored for each of the successive tests. Signal disagreement
should be indicated whenever the control signal disagrees with the correct
signal on the other orocessors in that file. If this indication does not
occur, then either the control to that file is not effective, or there is a
failure in the difference detector. The above testing is then repeated with
the role of the odd and even files interchanged, each successive test
examining the signal processors for disagreement. With proper design, any
failures in the voters, the difference detectors, or the control hardware
may be repaired while the system is in operation. Removal or disablement
of one replicated voter or nrocessor will not seriously j_ooardize system
reliability if the remaining replications of voters and _rocessors continue
to onerate correctly.
D. Circuit Implementations
i. Control Circuitry
Consider new the mechanization for control!in£ the outnut
of several signal processors with a single control line. A typical signal
processor output is shown in ficure 20. The circuitD shown is seen to be
in the usual form of D-TL NAND gates. The base return resistor RB may be
connected to the emitter ground retu_ 5f the associated transistor is
representative of the low leakage silicon devices found in integrated cir-
cuitNy. Since this resistor is formally connected to ground by a discrete
1-83
connective path, it is a relatively simple matter to provide _ with a
seoarate external connection.
LOGIC
E
RS <--- RA
iRA
DA DB
SIGNAL PROCESSOR
R B
CONTROL
OUTPUT
1 -84
Figure 20 Signal Processor Outout Control
i
i
l
i
I
Suppose further that _ is chosen to be equal to or less than RA. If
is connected to ground potential_h_ circuit_-3 will operate normally. If
is connected to the + E supply Q0 Kill conduct and saturate regardless of
the signals present on the inputs l, 2, - - - N. This is seen to be the
condition where the control l_ne potential forces the signal processor out-
out to assume the "0" state. If the control line is connected to an equal
eotential of opposite polarity (-_),transistor QO will be cut off thus
causing it to assume the "l" state regardless of the signals ore=ent cn
im_uts l, 2, - - - N. The method described to implement the req ired control
function is one of several _ossible approaches. It is am aoproacb which
represents a simole modification to existing circuit_: an_d ten,ires only
a single control line which is grounded in normal operation.
Another alternative requires control of both the _ase returr line
and the emitter ground line, but does not restrict the value of the base
return resistor, RB, and does not require a negative voltage supply. The
same method described above is used to cause the "0" output, i.e., to con-
nect the control line to a _Joltage which 5s sufficiently positive to cause
the cutout to saturate. For most circuits, + E will be of sufficient mag-
nitude for this _ur_ose. To effect a "l" cutout, the emitter ground line
may be removed, so that the output cannot be a low impedance to Fround,
reFardlesm of input signals. TMs approach may be eart_ cularly _seful when
it would be imdesirable to reduce _ less than RA, or in circuits _There the
base input diode, DB, is re_laced by an emitter follower te increase _ase
current drive. This ae_roach _laces little restriction rn circuit
1-85
configuration or values and the test Dower supplies, but requires two
separate control lines, both of which are grounded in normal operation.
2. Difference Detector Circuit
Shown in figure 21 is a typical discrete component difference
detector which may he utilized in the foregoing tests. The output level
is a logical "0" only if all inputs are identical. Any disagreement of
input signals will cause the first transistor to conduct and thus cause
the secend transistor to assume the "I" state (cut off). The circuit is
seen to _erform the functional operation of "exclusive OR" for two inputs.
INPUTS I
I
I
!
.L!
.p
I
I
I
L.
I
I
+ v
v
OUTPUT
Figure 21 Difference Detector
1-86
The outwut of the d_fference detector maybe used to tr_[£er a flip-
flop in order that any momenta_ disagreement of input signals maybe dis-
played. This would be useful in detecting any sporadic errors wP&ch_ght
otherwise remain unnoticed. As previously mentioned, the difference
detecters _,,,gnt be combinedwith -"_-_- :-_" ...... p_c_=_ =_ =,,_u_udu±=.,,u_u_L,_.,,t-_ and -- '.... ; ....
integral part of the system circuitry. This would eliminate any loading
effects due to the use of test leads and external test equipment in monitor-
ing test points. In addition this would provide _:aintenance personnel _ith
a simultaneous display of the ccncbitlon of the system and the location of
fault)- modules.
1-87
V. S1mmmary and Conclusions
I. General
It has been shown that the special feat_Ares of a red_Au-
dant config_Aration impose unique requirements on the design of functional
circuitry and the facilities req_Aired for test. Red_Audancy is a powerful
tool for achieving extended reliability, but it sho_Ald not be encumbered
with circ_Aitry which is inherently _Aureliable or contain particular fail,Are
modes which prevent the associated system confi_Aration from operating
independently. An appreciation of this philosophy allows the achievement
of reliability goals with a minim_n of additional complexity. Effective
circuit design is required to obtain the desired balance between complexity
and reliability in redundant systems.
2. l_gnetic Logic
Although magnetic logic is often cited as having several
features particularly applicable to spaceborne computers, the disadvan-
tages of magnetic logic strictly limit their usefulness in general logic
systems, and particularly for red_mdant spaceborne systems. Some basic
disadvantages are listed below:
l) Lack of compatible steady output signals
2) Excessive power cons,_nption for speeds
comparable to low-power microcircuitry.
3) Extensive peripheral equipment, including
high c_mrent drivers.
_) Limited fan-out and gain characteristics
.
1-88
5. High peak power requirements.
6. Indeterminate reliability performance due
extensive hand wiring with fine wire and numerous
connections, as well as unavaiolability of accurate
reliability data.
7. C_lexity required for general logic functions.
8. Lack of suitable restoring element for use in
redundant systems.
Magnetic logic does, however, offer non-volatile storage and
reduced average power for low co_uting speeds. Magnetic devices appear
to be suited to special a_olications where certain logic fu_ct4ons, such
as transfer and OR, are intermixed with the memory function, and very low
sneed carability is acceptable.
3. Integrated Semiconductor Logic
Integrated semiconductor circuitry offers many character-
istics which are desirable for circuits to be used in redu_daut space-
borne systeas. Sane general features of integrated semiconductor logic
when caepared to other commonl_ available logic systems are:
I. Signlficantl_ reduced size, weight, and power consumption.
2. Availability of general logic elements, as well as
special purpose circuits.
3. Predictable operating characteristics over wide
environmental variations.
_. Availability of accurate reliability data.
i-89
5. Extensive research and develo_ent for new integrated
circuits.
6. High frequency capability.
7. Compatibility with synthesis and testing techniques
for redundant systems.
A comparison of the currently available integrated logic elements
indicates that diode-transistor logic (D-TL) is the most suitable for use
in redundant spaceborne systems. D-TL offers excellent operating charac-
teristics, such as easily distinguished "i" and "0" states resulting in
high DC stability and compatible output signals, high noise im_mity,
self contained drive current, allowable parameter tolerances, input iso-
lation, and other characteristics which permit efficient redundant design.
D-TL frequency capability exceeds the requirements of most spaceborne
systems, and requires relatively low power, so that total power dissipation
and temperature stress are minimized.
A majority voting restorer, designed using interconnected NAND
elements, has been described which is not subject to the detrimental
failures of conventional majority voters. Signetics is chosen as the
most suitable supplier for commercially available D-TL integrated semi-
conductor logic elements. Characteristics of the Signetics circuits
include: Low power dissipation, single nower supply operation, complete
general logic line, compatibility with testing techniques for redundant
systems, and availability of reliability data.
4. Failure Testing
It is a characteristic of redundant systems that they offer a
1-90
I
I"
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
high reliability for a neriod of time after the initially failure free
condition, and that the system reliability decreases rapidly when internal
failures are present. It is therefore important to insure that no initial
failures exist in a redundant system to obtain maximum system reliability.
This reliability ms_ be required for a single time interval without further
maintenance, such as for spaceborne systems, or it may be required for a
repeated time intervals, where the system is restored to the initially
_erfect condition prior to each interval. The later method may be used
to obtain high missic_ reliability by maintaining a redundant system
which is used repetitively, such as the ground sunport and launch equip-
ment used prior to and during each mission. Since an initially failure
free order three system can withstand ar_ single failure, as well as a
relatively large number of randomly scattered failures, it' offers high
reliability for the period of time when the probability of individual
failures is low. Techniques are described which permit even higher reliabili-
ty by combining periodic maintenance with continuous maintenance of a redun-
dant system.
It has been shown that a relatively simple test referred to as
singular rank testing ma_"be used to determine that all of the replicated
signal processors are working properly. If the signal processor fails
whenever any of its parts fail, success of the singular rank tests will
veril_y that all signal processors are failure free. Success of singular
rank testing will also verify that the majority voters are sufficiently
failure free to insure that the system is not vulnerable to single failures.
Singular rank testing effectively isolates each rank of the replicated non-
1-91
redundant system by forcing each remaining pair of replicated ranks to
have static complementarybinary outputs. System output is monitored to
determine if each individual rank is able to perform all system functions
correctly, in a manner similar to the verification of a non-redm_dant sys-
tem. Singular rank testing is expected to be the most efficient and effective
method for diagnosing equipment which has been recently assembled from com-
pletely tested modules, since the probability that the few undetectable
failures might have occurred since complete testing is very low.
A somewhatmore complicated testing procedure, referred to as inter-
woven rank testing, has been described which _lll completely test all voters
to insure that they will _mkecorrect decisions for all possible input
combinations. It has been show_that the failure detection procedures may
be accomplished by controlling one or more normally gro_mdedcommonlines
for each of the replicated ranks of the system, without altering the logic
desio_nor including any additional hardware except to provide access to
these lines. Singular rank testing places no restrictions on system size
or configuration.
The characteristics of redundant systems have been shownto intro-
duce _mique properties to the problem of failure location and fa_Alty module
replacement. Although a red_mdant system is more complex that its conven-
tional counterpart, fail,Are location within an operating system does not
reqLulre the operator skill and simulation equipment usually required to
locate failures in a non-redundant system. Since an operating redundant
s_stem always has at least one correct signal available at every point in
the system, these correct signals maybe used as a basis of comparis6n to
1-92
other versions of the nominally identical signal. A difference detector
on the signal processor outputs to restorers may be used to indicate
fail,Ares amonK these signal processors. If the detector includes memory,
it will also detect and locate transient or sporadic fail_Ares. These same
difference detectors may be used for the somewhat more diffic_Alt task of
locatin_ those fail,Ares in the voters which do not cause erros when all
voter inputs are identical, as well as verification that the test controls
are actually capable of proper operation. The method which has been
described uses the same types of control as sin_Alar and interwoven rank
testing, and does not jeopardize system operation if all signal processors
are operating correctly.
1-93
BIBLIOGRAPHY
i. Haynes, J. L., "Logic Circuits Using Square-Loop Magnetic Devices:
A Survey", IRE Trans. on Elec. Computers, Vol. EC-IO, No. 2 (June 1961)
2. H. D. Crane, "A High Speed Logic System Using Magnetic Elements and
Connecting Wire Only," Proc. IRE, Vol. 47, pp. 63-73; (Jan. 1959).
3. D. R. Bennion and H. D. Crane, "Design and Analysis of MAD Transfer
Circuitry," Proc. 1959 Western Joint Computer Conf., San Francisco,
Calif., pp. 21-36, (March 1959).
4. J. A. RaJchmau, "The Transfluxor," Proc. IRE, Vol. 44, pp. 321-332;
(March 1956).
5. H. D. Crane, "Design of an All-Magnetic Computing System," IRE Trans.
on Elec. Computers, Vol. EC-IO, No. 2 (June 1961).
6. "Aviation Week and Space Technology," Aug. 19, 1963 pp. 93-103
7. A. R. Hellmud and W. C. Mann," Failure Effects in Redundant Systems"
Westinghouse Report EE-3351. (March, 1963)
8. Report No. NADC-EL-6319, Micro-Notes No. 3, "Information on Micro
Electronics for Navy Avionics Equipment" (June, 1963)
1-94
: :'i:::i_ii_:i_:iC ::_i:_ i_'i_ _ ' __i!_i!ii:i_i_ __i_) i!!_::ii:I
1
Appendix 2
RELIABILITY OF IMPERFECT REDUNDANT SYSTEMS
sl
Transor Decision F_mctions
And
Statistical Measurement of Quality
Contract Nasw-572
Reference WGD-38521
APPROVED:
by
I_S. Bray
P.A. Jensen
C.G. Masters
September 1963
stinghouse Electric Corporation
Electronics Division
1897, Baltimore 3, Maryland
S.__, Director
A-_vanc_ Development Engrg.
TABLE OF CONTENTS
I°
II.
III.
IV.
V,
INTRODUCTION .......................... 2-1
MISSION RELIABILITY ....................... 2-3
PROCEDURES FOR ESTIMATING THE SYSTEM RELIABILITY ..... 2-7
A. Estimation of the Expected Value of Mission Reliability with only
the Information that the System is Operating at t I ........ 2-7
B. Estimation of the Expected Value of Mission Reliability with
Tests at t I Helping to Establish the Circuit Failure Rates .... 2-7
C. Improvement of the Estimate Through Failure State Tests ..... 2-I0
D. Determining the Mission Reliability of Large Systems ....... 2- 13
E. Using Tests to Determine Both the Failure States of the System
and Failure Rates of the Circuits at t I ............ 2-17
TEST OF THE HYPOTHESIS THAT MISSION RELIABILITY IS GREATER
THAN A REQUIRED VALUE .................... 2-19
CONCLUSIONS AND RECOMMENDATIONS .............. 2-21
!
2-ii
I. INTRODUCTION
The problem of the pre-launch testing of spaceborne electronic systems is becoming
more severe as the systems increase in complexity while decreasing in physical size. The
testing problem will soon become much worse as systems are made redundant and in-flight
tests are used to determine the successive actions of deep space probes. Tests can no
longer be made adequately on the basis of a strict "working" or "failed" criterion because a
vp_.a .... 6 time of
test. Such a system might easily have a much lower probability of successfully completing
a mission than a functionally identical non-redundant system.
In addition, the large number of subsystems in a complex redundant network will make
complete check-out (i. e. tests of each subsystem) virtually impossible. Consequently, a
new method must be devised which will permit a statistical estimate to be made of the proba-
bility of mission success (reliability). This estimate must be based on the results of a
limited amount of testing and should be as accurate as possible.
2-1
II. MISSION RELIABILITY
The problem may be stated more specifically as follows. A test of a redundant machine
will be made at some time t 1. (It is expected that some failures will be found in the equipment,
and the object of the test is merely to determine the number and pattern of the failures in the
system. ) From the test data, the probability that the redundant system under test will oper-
ate successfully throughout a mission which begins at time, tl, and ends at time, [2, given
that the system is operating at tl, is estimated. This probability is defined as the mission
reliability (R) and is a function of the system organization, the state of the system at tl, the
failure rates of the parts of the system, the starting time (tl) of the mission, and the mission's
duration, t 2 - t 1. At some time t 0, which is less than t 1 or t2, all circuits in the system are
assumed perfect. As time progresses they are assumed to fail in a random manner with a
constant failure rate. At t 1 when the system is ready to begin the mission, the system must
be in one of a finite number of possible failure states. The failure states are determined by
the number and location of failed circuits in the system. For example, consider the multiple-
line redundant network of figure Q-1. A restoring circuit indicated by a circle will make a
correct decision if at least two of its inputs are correct.
#,
STAGE A STAGE B
Figure Q-1. A Two Stage Example of a Redundant System
Assume for simplicity of explanation, that the restoring circuits of this system are
perfectly reliable and that only signal processing circuits, indicated by rectangles, can fail.
The possible failure states of this system are listed in colums 2 and 3 of Table I.
2-2
TABLE 1
1 2 3
Number of Number of
Failure Failures in Failures in
State Stage A Stage B
4 5
Ri* (t2) ** Pi (tl) ***
1 0 0 [pin 3
2 =" 0 1 [pm 3
3 0 2
4 0 3
5 1 0
6 1 1
7 1 2
8 1 3
9 2 0
10 2 1
11 2 2
12 2 3
pm 3
2 (1-pm )] 2+3Pm d [p3] [p3]
+ 3 pm3 (1-Pm)] Pm 2 [P3] [3 2 (1-P)_
r..<i[. ,,_,;.]o L_j p
0 [p3} [(l-p>3}
+3Pm2(1-Pm>} Pm 2 _p2(1-P>}_ 3}
Pm 4 _p2 (l_p_p2 (1-p_
o 7,'<'-"_k,'<'-,,>1
o 7,>.<,_,,i[_,_,,>1
o [.o,.+.?[._<._o>_]
* Ri(t2) is the probability of correct system operation at time (t 2) given the i th failure
state exists at t 1.
** All the p_s in this column are probabilities that a circuit is successful at t2, given
it was successful at t 1.
*** All the p's in this column are probabilities that a circuit is successful at tl, given
it was successful at t O.
2-3
TABLE i (Cont)
1 2 3 4 5
Number of Number of
Failure Failures in Failures in R. * ._(t2) ** P...(t I} ***
State Stage A Stage B l l
|
|
|
* Ri(t 2) is the probability of correct system operation at time (t 2) given the ith failure
state exists at t 1.
** All the p/as in this column are the probability that a circuit is successful at t2, given
it was successful at t 1.
*** All the p's in this column are the probability that a circuit is successful at tl, given
it was successful at t O.
at t 2.
probability that a circuit is successful at t2, given it is successful at t 1
- k (t 2 - tl)
p =e
m
For each of the failure states of Table 1, the reliability of the system can be calculated
This is done as follows: If the failure rate, ), , of a circuit is constant and known, the
is the expontential.
(1)
For the system to be successful at the end of the mission, two or three circuits in each
stage must be successful. The probability that the system meets this requirement depends
on the failure state of the system at tl, and the value of Pm" For instance for failure states
3, 4, 7, 8 and 9-16, the probability of correct system operation must be zero because there
Because R i is defined as this probability, given the system isare too many failures at t 1.
in the ith state at tl:
R i = 0 for i = 3, 4, 7, 8, 9-16
For failure state 1, the reliability is the probability that two or three circuits are
successful at t 2. Thus:
pm 2 (1 )] 2R 1 = + 3 Pm - Pm
The reliability of the system for other failure states is shown in column 4 of Table 1.
2-4
/
Column 5 of Table 1 lists the probabilities that the particular failure states will be
present at t 1. The factor p in this column is the probability of success of a circuit at t 1
given the circuit was successful at t O. These probabilities will find use in later discussions.
Two things must be known if the mission reliability of the system is to be determined
with 100% confidence, the failure state of the system and the failure rates of the circuits
(needed to calculate pm ). For large systems both these factors may be very difficult or
_,_,_hl,_ tn determine exactly. To find the failure state of a svstem, the failure state of
each stage must be known. This may require a considerable amount of testing, probably a
test of all circuits in the system. The failure rates of the circuits can only be determined
exactly with a test of an infinite number of circuits all operating under the same environments
as the circuits in the system. Of course, with limited testing allowed at t I it is improbable
that the exact failure state of the system can be found. Estimates and their accuracy are the
subject of the remainder of this report.
2-5
IT[. PROCEDURES FOR ESTIMATING THE SYSTEM RELIABILITY
In the study of this problem, several ways have been proposed to estimate a system's
mission reliability with varying degrees of accuracy and varying levels of confidence. Four
of these are described below.
A. ESTIMATION OF THE EXPECTED VALUE OF MISSION RELIABILITY WITH ONLY
THE INFORMATION THAT THE SYSTEM IS OPERATING AT t 1.
Using the design failure rates* one can estimate the mission reliability with only the
information that the system is operating successfully at t 1. This is done using the equations
representing the reliability of the system at time t given only that all circuits are operating
successfully at time 0. The system reliability R (t) can be written as the probability of
successful operation from time 0 to time t. The reliability of the system of figure 1 is:
R(t) = {p(t)3 + 3 Ip (t_ 2 [1-p(t_ _ 2 (2)
-kt
where p(t) = e
A plot of R(t) for the redundant system of figure Q-1 is shown in figure Q-2a.
The design failure rates are those assigned to the circuits during the design of the system.
They are generally derived from controlled life testing of components similar to those
used in the circuits or from field tests of similar components.
2-6
1.00
0.9
0.8
_. 0.7
I,-
- 0.6I
o.s
"; 0.4
w
,,- 0.3
0.2
0.1
1 2 3 4 5 6 7 8 9 10
TIME IN HUNDREDS OF HOURS
)-
l-
..J
m
3
¢I:
1.00
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
B
I 2 3 4 S 6 7 8 9 I0
TIME IN HUNDREDS OF HOURS
Figure Q-2. Reliability vs Time For a Redundant System.
A) With No Test at t 1.
B) With a Test Determining the Success of the System at t 1
.If one tests the system at a time t 1 and finds it to be working successfully, this infor-
mation can be used to adjust the system reliability for time greater than t I to take account of
the condition of success at t 1. A curve must now be determined which gives the reliability
of the system given successful operation at t 1. This is expressed as:
For t<tl,
it stays failed.
Then:
the reliability must be unity, because it is assumed that once a system fails
R [tlR(tl) ] = 1 t <t I (3)
For t>tl, the reliability is:
[ t > t I (4)
This is derived from the definition of conditional probabilities.
P (A[B) P (A and B)
P (B)
A plot of equations (3) and (4) is shown in figure Q-2b for a particular t I and the system
shown in figure Q-1.
Using equation (4) the mission reliability can be written:
R (t2)
R(t2, t 1) = R(t 1) (5)
Thus, the mission reliability can be determined simply by using the reliability equations of
the system and the design failure rates of the circuits of the system.
The question now arises, of what value is this result ? First, assuming the failure
rates used in the calculation of R are perfect, if a large number of systems were constructed
and run until tl, approximately R (t 1) x 100% of them would be working. Throwing away all
systems that were failed at t I and continuing the test until t2, R (t2, t 1) x 100% of the popula-
tion all systems working at t 1 will be working at t 2.
2-7
No information was given for this estimate about the failure state of the system at tl,
except that the system was in one of the failure states for which the system is successful.
For the example, these are states 1, 2, 5 and 6. This limited information about the failure
state makes it necessary to approximate the mission reliability by an expected value given
that the system is in one of the four successful failure states. The approximation has a con-
siderable effect on the accuracy of the estimate which is described in detail in Section IIIC
of this report.
B. ESTIMATION OF THE EXPECTED VALUE OF MISSION RELIABILITY WITH TESTS
ATt 1 , HELPING TO ESTABLISH THE CIRCUIT FAILURE RATES.
Another problem which threatens the validity of the R calculated by this method is the
uncertainty of the failure rates of the components of the system. The failure rates used in
design are derived from a variety of sources and are almost surely not exactly accurate for
any operational system. A realistic way to use design failure rates is to assign confidence
limits to their values. With these one can say with a certain confidence that the failure
rates of his parts are within a region determined by his confidence limits. This data is often
available with design failure rates. Using the two extremes of failure rates, upper and lower
confidence limits can be calculated for the mission reliability. The statement can then be
made with a certain confidence that the mission reliability is within the interval of its con-
fidence limits. It is instructive to point out that if the failure rates of all parts are perfectly
known, there is 100_o confidence in the calculated value of mission reliability. If, however,
the failure rates are uncertain, as is always the case, confidence limits should be indicated
for the mission reliability which reflect the uncertainty of the failure rates.
Estimation of the mission reliability of the system using the failure rates used in design
has one serious failing. These failure rates often do not accurately describe the actual com-
ponents. The design failure rates may have been determined under different environmental
conditions than those of system in use, or components in the system may have been subjected
to different manufacturing conditions than those used to derive the design failure rates.
These and other factors might cause the circuits in the system to have different failure
rates than those predicted in original design. Tests performed at t 1 can be used to deter-
mine if the actual failure rates are indeed different from design failure rates. If they are
different the tests will be used to estimate the actual failure rate.
The first task is to test the null hypothesis that the actual average failure rates are
the same as those used in design. To do this, the system must be split into groups of
circuits with each group comprised of circuits of identical design. Using the design failure
rates, the number of failures that can be expected in each group at t 1 is calculated.
2-8
observeddata at t 1.
equation
- A jt , and n is the number of circuits in the group.This expected number is pjn,where pj= e
About this expected value one can construct an interval specifying the number of failures he
is willing to observe at t I and still accept the hypothesis that the actual failure rate is that
used in design.
The next step in the procedure is to test the circuits. If possible, all circuits are
tested ** and the numbers of failures recorded. If the number of failures at t I in n samples
is within this interval the design failure rate is used to calculate the mission reliability. If
the number of failures is not within the interval a new failure rate is calculated using the
The mean of this new failure rate is k and is determined from the
O
In x/n
=
o t 1
Confidence limits are placed on this calculated rate and the extremes of the confidence
interval are used to calculate confidence limits on the estimates of the mission reliability of
the system.
The question immediately arises, 'NVhy test the null hypothesis at all if test data is to
be accepted in preference to the design failure rates?" This is done because under the con-
dition that the null hypothesis is met, the correspondence of the two sources of failure rate
estimates would result in a higher confidence in the final estimate than either source alone
can provide. When the null is rejected and the test data alone is used, the confidence in the
estimate is reduced.
C. IMPROVEMENT OF THE ESTIMATE THROUGH FAILURE STATE TESTS
In this reliability estimation procedure a more accurate estimate is obtained by testing
at t 1 to determine the failure state of the system. If the failure state were known exactly and
the failure rates of the circuits were accurate, the mission reliability of the system could be
calculated with no equivocation. Thorough testing at t 1 could determine exactly the failure
state of the system, but since thorough testing is not of interest in this study the failure state
will be known imperfectly. One will have a number of alternatives each with a certain pro-
bability given the results of the tests.
hj -- design failure rate of the j th type circuit.
Note, if the system is too large to permit complete testing, a random sample of each
type of circuit is taken and the number of failures observed in the sample is used to
estimate the actual failure rates.
2-9
Consider again the example of figure Q-1. Each stage of the system has four failure
states, zero, one, two, or three failed circuits. If no information is available at tl, not
even that the system is operating, every stage may be in any one of these states. Thus there
are 42 possible failure states of the system. They have been listed in column 1 of Table 1.
Associated with the ith failure state is a probability Pi which is the probability that the sys-
tem is in this state at t 1 given that all circuits were successful at t 0. Thus, with no
information at t 1 on the condition of the system, the probability that the system is in the
state in which no circuits have failed is
6
P1 = p
The factor p is the probability of success of a circuit at t 1.
failure state in which one circuit is failed in Stage B is
P2 = 3p5 (l-p).
The probability of the
The probabilities of occurrence of the states given no information on the condition of the
system at t 1 are listed in column 5 of Table Q-1.
Associated with each of the failure states is a reliability of the system at t 2 given that
the system is in the failure state at t 1. This is written as R 1 (t 2) and is shown for each state
in column 4 of Table 1.
The reliability of the system is written as the sum over all i of the product of the
probability of a ith failure state and the mission reliability given that the system is in the
ith state at t 1. Thus:
all i
R (t2) = _ Pi Ri
If tests are made at t 1 that give some information on the condition of the system, the
number of failure states possible are markedly reduced, and the reliability estimate available
at tl is much more accurate. For instance if one tests the system of figure Q-1 and finds it
functioning correctly at tl, each stage must have no more than one circuit failure. Thus,
only four states are possible after this test. These are states 1, 2, 5 and 6. The probability
that the system is in a particular state must be adjusted to account for the known condition
Thus, for the example the probability of being in state 1 withthat the system functions at t 1.
no failures is:
P1
i = 1,2,5,6 (6)
2-10
The denominator in equation (6) is the probability that the system is in one of the four
possible states.
In general, a test to establish the failure state will leave only a set of possible failure
states. Assume the test determines the state of the system to such an extent that the only
possible failure states are included in the set I. If P'. is the probability of being in the ithi
failure state given the results of the tests, then:
P'. = 0 For i $ I1
Or if a state is not in the set I its probability is zero.
If a state is possible then:
p'._-
1
P°
!
alli _ I
For i • I (7)
The mission reliability for a particular failure state, Ri, does not change, hence the
mission reliability given the results of the test can be written in general as:
allieI[ P.1 ]RM= E all i, I Ri
Pi
(8)
For the example
1
RM = Pl +P2 +P5 +p6 P1 R1 + P2 R2 + P5 R5 + P6 R6 ]
(9)
More extensive tests at t 1 will further reduce the number of failure states which can
exist. For instance if a test reveals that at least one circuit in the network is failed, the
failure state which has no errors is eliminated, changing considerably the expected mission
reliability. For this example P_ = 0, and states 2, 5 and 6 are the only members of the set I.
To illustrate the value of testing to determine the failure state at tl, consider the
example. The probability that a circuit operates until t 1 is p (t 1) = 0.9 and the probability
it lasts until t2, given it was successful at t 1 is Pm (t2) = 0.9. The system is that shown
in figure Q-1 and the restoring circuits are assumed perfectly reliable. Say that in reality
one circuit is failed in one stage and the circuits in the other stage are all successful, but
2-11
this information is unknown to the tester. This is the information to be gained at t 1 through
the tests. Table 2 lists the reliability one would predict with different amounts of infor-
mation about the condition of the system at t 1. The wide variation in the result indicates the
importance of testing at t 1.
This section does not propose the detailed procedures for testing a system at t 1. It
should, however, indicate the importance of making these tests and the calculations required
to utilize the information gained from the test to estimate the system reliability.
TABLE 2
Test Results at the
Mission's Start (t 1)
Predicted System
Mission Reliability
Corresponding
Risk of Failur_
1. No information at tl, not even 0.821 0.179
that the system is working.
2. Tests show that the system is 0.867 0.133
working at t 1.
3. Tests show that the system is 0.770 0.230
working but that at least one
circuit is failed.
4. Tests show that exactly one 0.788 0.212
circuit in the system is failed
att 1.
i
I
I
I
I
i
I
I
n
B
D. DETERMINING THE MISSION RELIABILITY OF LARGE SYSTEMS
The example of the last section is a small two stage system. One might well ask if it
is feasible to enumerate all of the possible failure states of a large system for the determina-
tion of the mission reliability. Indeed with no information at t 1 on whether or not an n stage
system is operating correctly, there are 4n possible failure states of the system. As n in-
creases, the number of possible failure states increases exponentially.
The purpose of the tests at t 1 is to eliminate large numbers of these states in the manner
shown for the example and hence obtain a better estimate of the mission reliability. The use
of equation (8) provides this estimate but it requires, in its present form, separate considera-
tion of each failure state. This is impractical for all but the smallest systems.
2-12
I
I
I
l
n
I
l
This problem is circumvented by first putting the mission reliability equation in a
more general form. The mission reliability of the system given the results of the test at t 1
is a conditional probability which can be written:
Prob. (Test results at t 1 and successful system operation at t 2)
RM = Prob. (Test results at tl) (10)
Equation 8 is a representation of this equation for small systems.
The form equation (10) takes depends on the characteristics of the system under study
and the type of test to which it is subject at t 1. For example, consider an n stage order-
three-multiple-line system which has perfect voters. For simplicity assume all the stages
are identical with equally reliable circuits. For illustrative purposes assume the stages are
arranged in a chain as in figure Q-3.
Figure Q-3. Chain of n-Multiple-Line Stages
The first type of test to which the system of figure Q-3 is subjected is a simple test to
determine its operability. Is the system failed or successful at t 1? Given the system is
successful at t I the mission reliability will now be determined.
Because the system is working at tl, each stage must be in one of two states, either
three circuits successful or two circuits successful and one failed. Then the system may be
in any one of 2n possible states. Using equation (8) to evaluate the mission reliability would
be a rather tedious and time consuming process if n were a sufficientlyiarge value since both
the numerator and denominator of this equation have 2n terms. However, because of the in-
dependence of the stages of the multiple line system, it isn't necessary to carry out this
operation. The probability that each stage is successful at t I is independent of the condition
of all other stages and can be written:
[p3 + 3p2 (l-p)] (11)
2-13
Sincetheyareall identicaltheprobability that all the stagesare successfulat t 1 is:
[p3 + 3p2 (l-p)] n (12)
This term is theprobabilitythat thesystemis in a successfulfailure stateat t 1andis
thedenominatorfor equation(10)whenthetest consistsonly of determiningtheoperabilityof
thesystem.
Theprobabilitythata singlestageis operatingat t2 canbewritten:
2 pm)] + 3 (l-p) [pm2]) (13){p3 Ipm3 + 3Pm (1- p2
Sincethe stagesare independenttheprobability that systemis operatingat t2 is:
2 pm)] 3 (l-p) Ipm2]) (14)fp3 [pm3 + 3Pm (1_ + p2 n
This term is equivalento the numeratorof equation(10). Usingtheterms (12)and(14)
themissionreliability canbedeterminedfor this system. Giventhat thesystemis successful
at t 1 theprobabilitythat thesystemis successfulat t 2 is:
. •  m'l °
[p3 + 3p2 (1-p)] n (15)
Note that for this determination of the mission reliability the separate failure states have not
been enumerated. The calculation of mission reliability for this system has been a relatively
simple procedure.
Other tests at t 1 will result in different forms for the mission reliability equation (10).
For instance assume the system of figure 3 is subjected to a different test. This test sub-
divides the system into three nonredundant ranks as shown in figure Q-4.
Each rank will be tested individually. If a rank fails it can be inferred that one or more
circuits in the rank are failed. If a rank is successful it can be inferred that all circuits in
the rank are successful.
At tl the information is given that the system is operating correctly and that 0, 1, 2 or
3 of the ranks have failed. Now equations must be developed that determine the mission
reliability of the system given the results of the test at t 1.
2-14
i
i
i
i
i
i
i
i
i
i
i
[ ..........
Figure Q-4. System Divided Into Three Nonredundant Ranks
The numerators and denominators of the mission reliability equation for the various
test results are shown in Table 3.
TABLE 3
Test Result
(Ranks
Failed)
0
Prob.
(Test Result at t 1)
 0Ldn
Y1 = I p2 (l-p) +
p3]?Y 0
Y2 = I 2 p2 (l-p) +
_]-_o-_
Y3 = 13p2 (l-p) +
p31ny0-3Y1-3Y 2
Prob. (Test Result at tl and
Successful System Operation at t 2 )
f3, 3 _ 2 )inQ0 = LP (Pm+JPm (1-Pm)
I
n= (l-p) pm + p (Pro + 3Pro
-Qo
=[ 233 2 )inQ2 2 P2 (l-P) Pm +p (Pm + 3 Pm (1-Pm)
-Qo -2Q1
=[ 233 2 ]nQ3 3p2 (1-p)Pm +P (Pro + 3Pm (1-Pm)
-Q1 -3Q1 -3Q2
Mission
Reliability
Q0
Y0
Q1
Y1
Q2
Y2
Q3
Compared to enumerating all the failed states possible with the particular results of a
test, these equations are relatively simple. If the assumption that all circuits are equally
reliable is removed, the equations for mission reliability are very similar to these except in-
stead of raising a single term to the power n as in these equations, a product of n factors
will be. taken. This should be a simple matter on a computer.
2-15
If the restriction that the restoring circuits be perfectly reliable is removed, the
mission reliability equation will not be changed significantly unless the stages are intercon-
nected in such a manner that they are no longer independent. The techniques used to calculate
system reliability inthis section are invalid if the stages are not independent. Techniques
have been developed to determine the reliability of such systems* and these must be used in
determining the mission reliability.
The equation describing the mission reliability for a system will depend on both the
tests performed at t I and the characteristics of the system. These factors will surely be
known prior to the test, so equations can be developed to evaluate the mission reliability
which take into account the possible failure states of the system without exhaustive enumeration.
E. USING TESTS TO DETERMINE BOTH THE FAILURE STATE OF THE SYSTEM AND
FAILURE RATES OF THE CIRCUITS AT t 1
In technique C, tests were made at t 1 to determine the possible failure states of the
system. In technique B tests were made to establish the actual failure rate of the circuits of
the system. It should be possible to design tests which give information regarding both these
parameters.
The tests will establish the failure rate of the system at t 1 and use these in carrying out
the reliability calculations described for Technique C. It takes little imagination to see that
in the course of tests to determine the failure rate a great deal will be learned about the
failure state of the system. For instance as soon as one failure is found the possibility that
the system is in the no circuit failure state is decreased to zero, probably decreasing the
mission reliability appreciably.
The details of this technique have not been developed, but generally it proposes to use
the tests of t 1 to indicate both these parameters and thereby increase markedly the accuracy
of the mission reliability estimate.
* Jensen, P. A., W.C. Mann and M. R. Cosgrove, "The Synthesis of Redundant Multiple-
Line Networks", First Annual Report Contract NONR 3842 (00), May 1, 1963.
2-16
]
!
IV. TEST OF THE HYPOTHESIS THAT THE MISSION RELIABILITY IS
GREATER THAN A REQUIRED VALUE
This method is separated from the others because it does not explicity estimate the
reliability of a system. Instead it finds, through measurements at the beginning of the
mission, the probability that the system will not meet a given mission reliability specification.
The user of the system must specify the minimum mission reliability. He must also
specify the maximum chance he is willing to take that the system does not meet this goal when
his tests indicate that it will. It is assumed that the system is not acceptable if the probability
that it does not meet the reliability specification is above the given value, and is acceptable
otherwise.
The first step in this procedure is to determine the failure rates that the circuits of
the system must have to just meet the mission reliability goal. These failure rates are
called the maximum failure rates, k m" For a system in which many circuits have the same
failure rate this does not seem to be too imposing a problem. For example consider a system
where all circuits have the same failure rate. If the starting time and duration of the mission
are known, the mission reliability can be expressed only as a function of the failure rate, k .
Equation (5) can then be set equal to the required mission reliability and solved for the failure
rate. A cut and try method may be required for the solution.
The maximum failure rate is a function of both the starting time, tl, and the duration,
t 2 - tl, of the mission. However, if the duration of the mission is known, it is possible to
plot a curve of mission starting time against the maximum failure rate.
Once the maximum failure rate is known it only remains to determine if the actual
failure rate of the circuits of the system is less than or equal to this value. This will be de-
termined by testing n of the circuits at t I and counting the number of failed circuits. Call the
number of failed circuits X 1. With this data and by using the maximum failure rate, an upper
bound on the probability that the true failure rate is greater than the maximum failure rate can
be determined.
If the fact that a majority of the circuits in a stage must be operative at t 1 is neglected,
the success of a circuit in the system may be considered a Bernoulli trial with probability of
kt
success, e The probability distribution of the total number of circuit failures in M
circuits is then binomial. This distribution or the associated density function can be plotted
for any number of samples. One such plot appears in figure Q-5.
The probability distribution of the number of failures at time t 1 can be plotted using the
calculated maximum failure rate.
2-17
FigureQ-5. Sample Distribution
Some maximum number of failures Y will be chosen such that there is probability of B
that the number of failed circuits observed at tl, X 1, will be less than Y if the failure rate of
the circuits is X m" The quantity B is determined from the binominal:
Y- 1 k t I n-h )'m
- - t I h
8 = _ (h) (e m ) (1-e ) (16)
h=0
For failure rates greater than k m the probability that less than Y failures occur must be
less than 8 • Soil X 1 is less than Y, with confidence 1 - 8 the statement can be made that
the actual failure rate must be less than the maximum failure rate. Now the statement can
be made that with confidence 1 - 8 that the reliability of the system is greater than the mini-
mum reliability specified by the user.
This method leads to the statement with a confidence (1 - 8 ), it can be said that the
probability that the system will suceed is R. The information used to compute R might be
used to compute the expected time to system failure instead. The object of the test would
then be to confirm or reject the hypothesis that the expected life would exceed the mission
time with a confidence (1 - 8 ). This modification has not been carefully examined but it
appears to reduce the number of probabilistic statements from two to one.
This procedure again uses no information on the failure state of the system except that
the system is successful at the beginning of the mission. The effect of this on the accuracy
of the results has already been discussed in Section IIIC.
2-18
V. CONCLUSI0b_ AND RECOMMENDATIONS
Itis the nature of a redundant system to withstand a number of internal failures and
stillperform its function successfully. This is an extremely desirable property for increas-
ing lifeor providing high reliabiIRy, but itmakes R unreasonable to base the decision -
whether or not to carry out a mission with the system - only on the fact that the system is
operating at the beginning of the mission.
This decision should be based on the probabilRy that the system will complete the
mission successfully. There are two major -factorsaffectingthe probability which are im=
perfectly known at the beginning of the mission. First, the number and location of initial
circuit failures has a very significanteffecton the probability that the system will operate
throughout the mission. Second, the mission reliabilitydepends heavily on the failure rates
of the circuits which make up the system. There is littleaccurate information concerning
either of these factors when itis time to make the decision.
The report proposes that certain tests be made justbefore the mission is to begin to
determine at least approximately, these unknowns. Itproposes some procedures for using
the results of the tests to estimate the mission reliabilitywith varying degrees of accuracy.
A procedure for making the decision on the useability of the system wRhout estimating the
mission reliabilityis also presented.
Itshould be noted that the detailsof these procedures are still to be worked out and
the accuracy of their results are stilluncertain. The work here reported will provide the
basis for future studies on the subject.
No attempt has been made to evaluate the relative usefulness of these procedures. R
is recommended that efforts be made to develop an appropriate measure for comparing the
techniques so that they may be evaluated relative to a common scale.
One very important area of study neglected by this report is the design of simple and
efficienttests to be performed at the beginning of the mission to obtain the information re-
quired for the reliabilityestimates. As much information as possible must be gained from
a minimum number of tests. A small amount of basic work has been done in this area, and
itwill be the subject of future efforts.
|
| 2-19/20
LAppendix 3
A SURVEY OF COMPONENTS FOR ADAPTIVE RESTORING CIRCUITS
A Survey of Adaptive Components
For U_e in Failure-Free Systems
Contract Nasw-572
Reference WGD-385 21
H. Brinker
August 1963
APPROVED:
The Westinghouse Electric Corporation
lectronics Division
_97, Baltimore 3, Maryland
Adv_ced _Development Engrg.
3-i
TABLE OF CONTENTS
Introduction
I. Electrochemical Devices
a. The Memistor
b. Solion
c. Mercury Cell
2. Magnetic Devices
a. MAD Integrator
b. Orthogonal Core Integrator
c. Second Harmonic Integrator
d. Nmgnetostrictive Integrator
3 • Conclusion
References
LIST OF FIGURES
Figure i
Figure 2
Figure 3
Figure 4
Figure 5a
Figure 5b
Figure 6
Fi_Jre 7
Figure 8
Figure 9
Figure lO
Figure ll
Comparison of Adaptive and Majority
Voting Techniques
Adaptive Voter
Memistor Cell
Memistor Integrator
Solion Tetrode and @_tput Characteristics
Solion Tetrode connected as an Integrator
Merc,Ary Cell Integrator (capacitive readout)
Multiple A_erture Device (MAD)
MAD Integrator
Orthogonal Core
Second Harmonic Integrator
Magnetostrictive Integrator
Page
I
3
3
5
7
8
8
ll
ll
12
13
15
2
2
h,
6
6
7
9
10
ll
12
13
3 -ii
.t
Introduction
_J
The Adaline Neuron 1 is an adaptive logic device which may be trained
to recognize certain classes of input patterns. The device output is a
binary signal @hich classifies particular combinations of input signals
into two categories. An output decision is determined by a threshold
element @hose input is the linear sum of the products of each input and
its associated variable weight. During adaption the weights are appro-
priately changed in order to make the output decision agree _ith the de-
sired response. By following a simple set of rules after each application
of input signal combinations the device is caused to converge to an optimum
state for properly categorizing the set of input patterns.
Althou_h training rules for a single layer system have been formulated
by _Idrowl, _ new adaptive theory is required if systems of t_._ or more cas-
caded layers are to be properly trained to perform complex functions of
adaptive behavior and pattern recognition. The question of whether such
devices may be connected in complex arrays and demonstrate brain-like
behavior has generated considerable interest. Such applications appear to
be philosophical and subject to considerable controversy. Of primary con-
cern in the present study is to consider the usefulness of the Adaline
neuron approach in implementing the adaptive voting elements of a redundant
system.
The chart of Figure i shows how adaptive voters may extend the relia-
bility of a conventional redundant system, allowing a system using 9 replicas
to outperform a conventional system using 35 replicas of each function.
The Adaline neuron has received considerable quantitative study in
application to pattern recognition. When modified as shown in Figure 2,
and applied as an adaptive voter, the training rules become quite simple
since the desired output is determined by a voting of the weighted inputs.
Initially, all weights (gains) are made equal. The decision element will
then provide an output in accordance with the states of the majority of
binary, replicated input signals. If input errors are independent and
random the adaptive voter, by progressively adjusting its weights to assign
high weights to reliable inputs and low weights to failed or unreliable in-
puts, may derive correct information from a small minority of correct inputs.
3-1
.8
b.
b- A
0
0
INPUT ADAPTIVE "VOTER
35 INPUT MAJORITY WORKING INPUT| REQUIR[O
7"Z_E
Figure 1 Comparison of Adaptive and Majority Voting Techniques
/9 G
PJI
u
j Co._rol
-L
L
L
.+
Figure 2 Adaptive Voter
3-Z
In this manner the effect of errors caused by input failures _ be negated,
allowing a correct decision to be made under a high probability of input
signal failure. _he simple, fixed majority voter will make output decision
errors when more than half of the inputs fail or are in error. The adaptive
voter, by _asking out input errors as they occur, _ay tolerate failures until
only two correct inputs out of the original group are present.
In order to provide aut_atic adaption it is necessary to continuously
compare the output decision _th each binary iuput and to inereme_tal_
decrease or increase each input weight according to _hether agreement or
disagreement exists. Asswmlng that input errors or failures occur rando_
and that the automatic adaptive process can negate an unreliable input be-
fore other failures occur, the adaptive voter offers the possibility of
realizing system reliability of unprecedented excellence.
Inherent in the basic design of an adaptive voter is the requirement for
a variable weighted device which performs integration and displays relatively
permanent meaory. These special characteristics have stimulated considerable
effort toward the development of suitable adaptive components. Devices which
display variable weight with lemory generally utilize phenomena involving atomic
translation or rotation. The following represents a survey of the more prom-
ising techniques which have bee_ suggested by researchers. The first three
devices described exploit electrochemical effects while the remaining devices
utilize magnetic domain phenomena.
I. Electro-ChamLcal Devices
a. The Memistor
The Memisto_, an electrolytic device developed at Stanford University
by Widrow, is an electronically adjustable resistor _ith a rate-of-change of
resistance controlled by _plication of d-c current in a third electrode.
It consists of a sealed plating cell containing an electrolytic bath, a
resistive substrate upon which metal is deposited and a metal source elec-
trode. A typical configuration indicating the placement of electrodes and
electrolyte in a small plastic enclosure is sho_ in Figure 3. Tw_ leads
are attached to the substrate and resistance between these leads can be
reversibly controlled by passing plating current into a third electrode.
_he conductance of the device is changed and stored by plating or stripping
metal f_a the substrate by means of the integral of the plating current.
Conductance is sensed nondestructively by applying a low voltage a-c signal
and meamn-ing the resultant current flow.
Normal d-c drop between between source and substrate is typically 0.2
volts at a plating ourre_t of 0.2 _a. The substrate resistance cha_ges
from 30 ohms to 2 ohms in I0 seconds with this magnitude of plating current.
The AC sensing voltage applied is usual_ 0.i volts RMS. A typical imple-
mentation of the Memistor with associated transformer coupled sensing and
d-c plating circuitry is sho_m in Figure _.
3-3
Although Memistors are commercially available at a cost of approxi-
mately $50 per cell their application in a practical system is somewhat cum-
bersome. Transformer coupled circuits are usually required in order to
present a balanced load to the plating current source, and to provide the
I
I
CONTAINER FILLED WITH
PLATING SOLUTION
PLATING
NAL
_ ___ j_'_--- RHODIUM COATED
_'- __..,,_'_ PLATING SURFACE
_AL TING CONNECTING
Figure 3 Memistor Cell
2z6.... _ -
M-2CR
INPUT 47K_
Figure 4 Memistor Integrator
3-4
o.
low voltage drop across the substrate. The substrate resistance is usually
less than I00 ohms and the a-c voltage drop must be ke_t below 3/4 volt in
order to prevent the formation of gas in the cell. Some difficulty has been
reported in keeping the substrate material free of dimensional imperfections
_hich in turn cause non linear plating effects to take place. Long term
stability is apparently affected by chemical reactions taking place between
plating material and electrolyte. To date Hemistors are available in sample
quantities and it is difficult to predict ultimate large scale production
costs, repeatability and reliability.
b. Solion
The solion is a fluid-state device which functions by controlling
and monitoring a reversible electrochemical "redo_' reaction. _he term
redox refers to a chemical reaction in which oxidation and reduction occur
simultaneously. The redox system used in solions consists of two electrodes
immersed in an electrolyte containing both the oxidized and reduced species
of an ion. The system is completely reversible in that oxidation can occur
at either electrode while an equivalent amount of the same element is reduced
at the opposite electrode. Iodine is the reacting element most commonly used.
A simplified drawing of a solion tetrode and its output characteristics
is shown in Figure 5a. The tetrode has a platinum electrode at each end of a
glass tube and two perforated platinum electrodes separating the tube into
three compartments. The reservoir, containing the input electrode, is the
largest compartment. The integral compartment, containing the common elec-
trode, is made very small so an equilibrium distribution of the iodine may
be quickly reached. The compartment between the shield and readout elec-
trodes serve to separate the two electrodes. The output characteristics of
a solion Tetrode are similar to that of a vacuum tube pentode, and show a
transconductance of 40,000 micromhos at an output current of 500 microamperes.
A Solion Tetrode connected as an integrator is shown in Figure 5b.
By controlling the charge transferred between the two input electrodes,
a change in conductivity proportional to the integral of the inout current
may be obtained between the output electrodes. In this manner the device
may be utilized as an integrator, providing an output current proportional
to the integral of the input current. Because of the concentration poten-
tial, the input impedance of the solion tetrode is in the order of I000
ohms and therefore a relatively high impedance signal source is required
in order to avoid integration errors. At constant temperature, the
stability of solions is reported to be less than 1% over a period of several
days.
3-5
--.8
E
E. = 0.7 Volts
il ' !
El= 4 MIIIIvolt|
--.6 mi of
--.4 4
• f
I e
0 --.2 --.4 --.6 --.0
K. -- Veits
Figure 5a Solion Tetrode and Output Characteristics
|I_R |leNrlu I Input
IC S Shield
Circuit Symbol
R Readout
C Common
,nputs,,n.,_ I s_,...o,r
(Current Seurue_ 0,7 V
3-6
Figure 5b Solion Tetrode Connected as an Integrator
A practical problem in the use of solion tetrodes arises from the
requirement of providing an isolated battery potential between input and
shield electrodes to prevent iodine diffusion between the reservoir and
integral compartments. Primary application for the solion tetrode to date
has been demonstrated as a low level DC amolifier with a time constant of
I
I
I
|
20 seconds. Because of the inherent practical problems of precision de-
sign, isolated supply voltages and discharging effects of parallel outputs
the solion appears to offer little promise as a practical adaptive component.
c. Mercury Cell
Another novel approach fo_ variable gain with memory is achieved by
use of a Mercury Cell Integrator, Q an electrochemical device which provides
visual and electrical readout of the integral of an applied current. The
integratimg element consists of a capillary tube filled with two columns
(electrodes) of mercur7 separated by a gap of aqueous electrolyte of metal-
lic salt. Two different methods have been used to provide electrical read
out. The first method called capacitive readout is sho_m functionally in
Figure 6. The d-c input signal electroplates mercury across the gap at a
rate which is a direct function of the input signal amplitude, thus causing
the gap or bubble of electrolyte to move. The outside of the capillary is
covered by a vapor-deposited conductive sheath. The mercury electrodes and
sheath, separated by a %hln glass wall provide a capacitance of approximately
20 pF. In application, an a-c signal is connected across the electrodes and
I(
ttin
l fllndt
CIRCUIT DIAGRAM
Figure 6 Mercury Cell Integrator
(Capacitive Readout)
superimposed on the d-c input signal. The a-c signal will divide in accor-
dance with the capacitance existing between the upper mercury column and
sheath, and the capacitance between sheath and lower grounded column of
mercury. The excitation signal provides a signal at the sheath which is
a direct function of the length of the ungrounded electrode. An auxiliary
amplifier and detector in turn provide a proportional d-c signal of proper
level to operate other related devices.
The device provides reversible integration, relatively stable
memory, direct visual readout and a linearity better than 0.1 percent.
Input control current is limited to +_5 ma d-c. The integration time from
mlnimmm to maximnm output signal is approximately I00 minutes at maximum
control current. _his time is ultimately limited by the maxlmem voltage
_hich may be dro.m)ed across the electrolyte, without causing the formation
of gas.
3-7
A typical capacitive readout integrator now commercially available
is approximately 0.5 cu. in. but prices range around $130 per unit. Although
displaying excellent stability and predictable operation such devices will
require considerable price reduction before application becomes practical.
The integration time although relatively long may not present a serious
limitation for systems _hich display slow adaptive behavior as would be the
case in adaptive voting elements.
Another technique for sensing the position of the bubble utilizes
a light source and a photo-conductor whose resistance is inversely propor-
tional to the amount of light passed by the transparent electrolyte. As
the bubble moves out of line with the light source and photo-conductor
target area the light becomes progressively blocked by the mercury columns,
causing the photo-conductor resistance to increase. This technique allows
faster integration because the bubble need only be displaced by its own
height to effect a change from maximum to minimum light intensity at the
photo-conductor. A typical photoelectric integrator commercially available
occupies I cu. inch and requires 300 milliwatts to Dower an integral in-
candescent lamp. Output resistance varies over the range from 25K ohms to
35OK ohms. Quantity prices are expected to fall below $15 per unit thus
providing a reasonably inexpensive adaptive component. The use of an in-
candescent lamp for the light source imposes a serious life and reliability
problem. The use of a more reliable light source and a substantial size
reduction will be necessary before application becomes practical.
2. Magnetic Devices
Various techniques have been suggested for providing variable gain and
non-destructive readout with magnetic devices. The phenomena utilized in
such devices is based upon the ability of magnetic materials to store a
remanent flux which is sensed in a non-destructive manner. Suggested de-
vices provide the capability for a partial switching of magnetic domain
under a volt-second impulse as the basic incrementing source. Suitable
magnetic materials include ferrites and tape wound cores which are charac-
terized by a square hysterisis curve. Most of the devices to be described
utilize the same basic type of incrementing technique and differ primarily
in the manner by which the stored flux is sensed.
a. MAD lnte_rator
A diagram of a typical multi-aperture device 7 is shown in Figure 7.
In this device flux can be switched around the minor aperture by means of an
a-c drive winding without disturbing the flux linking and stored around the
main aperture. Initially the flux around the main aperture is set to cause
saturation in either a clockwise or counterclockwise direction. A momentary
reversal of the magnetizing force driving the main aperture will cause a
partial reversal of the flux. The amount of flux reversal is determined by
the magnitude and duration of the drive and the value of the hold current.
The purpose of the hold winding is to retain a portion of the core saturated
in the original direction of magnetization and thereby assure partial
switching of the flux. The amount of flux alternately switched around
the small aperture is then proportional to the flux whichhas been switched
3-8
around the main aperture. The output voltage will consist of a signal
,hose voltage integral is proportional to the amount of flux trapped in
the common area between the two flux paths. Several cycles of carrier
drive may be required before this condition stabilizes. Care must be
taken to limit the carrier drive to values less than the magnetizing force
required to disturb the remanent flux around the main aperture.
The extemt to _hich the remanent flux can be incremented is usually
implemented by means of a smaller core of like magnetic material. The
smaller core provides the appropriate amount of volt-second drive to
Incrm_ent th_ storaEe core in equal stsgs at various settings of remanent
flux. Brain u has indicated that it is essential that incrementing should
always occur at a constant reference phase with respect %o the carrier
drive unless carrier drive is removed. If this is not done the size of
the incremental flux change _ill be dependent on the vector sum of the
switching and carrier signals. A typical scheme for realizing integrator
operation is shown in Figure 8.
SENSE
__1 WINDING
UT
NG
IqI.D -
Figure 7 Multiple Aperture Device (MAD)
The physical requiremnt of providing a number of hand wound turns
about the various apertures dictates to a large extent the cost of the de-
vice. LarEe drivi_ currents, a moderate amount of timing during incre-
menting and relatively low output signal amplitude necessitate peripheral
circuitry of considerable complexity. The resultant degradation in the
basic reliability of the approach then becomes an imposing problem.
3-9
2 TURNS
8TURNS
SET LEVEL
12:L (
_J-L
SATU R AT ION
52.9
A
6 TURNS
BIAS HOLDd
225 me
2TURNS
READ OUT
2TURNS
75kc/s
0.2 AMP
Figure 8 MAD Integrator
3-I0
b. Orthogonal Core Integrator
The magnitude and direction of a stored flux _s_ be sensed by a_p_-
tng a magnetic field orthogonally to the direction of stored flu_-Y _hie
causes the remment flux vector to rotate generating a voltage proportional
to its rate of change and hence its m_nitude. The application of a read
or sensing field at right angles to the stored or written flux mi_zes the
interraction of the sense drive on the stored flux magnetic path. At the
termination of the read drive the flux vector returns back to its original
preferred orientation by virtue of domain elasticity. A typical orthogonal
core configuration is shown in Figure 9. The flux level stored in the core
is altered by p"_l_'_ug the output winding in a =_.uner _L_ilar to the incre-
menting techniques previously discussed. Output signal consists of either
positive or negative pulses depending upon the direction of the stored
flux, with an amplitude proportional to the magnitude of the remanent flux.
Practical problems similar to those associated with the aultia_erture de-
vice previously discussed again make physical i_lewentation cumbersome.
c. Second Harmonic Integrator 10
Nondestructive readout of reuanent flux amy be obtained by reducing
the sensing drive to a value insufficient to cause irreversible hitching.
Since magnetic cores are generally non-linear the output voltage will con-
rain harmonics of the drive current. In particular, the even harmonic
_ON SENSE
AND ADAPT
NDING
IVE
NDING
FLUX
/ RETURN
' .- ' ;" _,,.:| "SENSE AND
_1 '_ ADAPT WINDING! "DRIVE WINDING
m
' FERRITE CORE
Figure 90rthognaal Core
3-11
voltage for certain core materials is found to be proportional to the net
remanent flux level. The second-harmonic generator shown in Figure i0
consists of a pair of tape wound cores driven from an r-f sinusoidal
power source. The output winding is arranged so that the fundamental com-
ponent of drive voltage cancels out, leaving a second harmonic distortion
voltage proportional to the remanent flux in the cores.
By passing a direct current through the output winding the remanent
flux level may be altered. Due to an interaction between the d-c adapt
current and the RF drive the rate of change of the remanent flux with
respect to the adapt current is constant and reversible. Tape-wound cores
have been found to provide the best performance and because of their higher
permeability require fewer turns. Typical associated driving, sensing and
timing circuitry tend to be rather elaborate however, be cancellation of
the fundamental driving frequency is difficult to achieve in practice thus
making the desired output signal appear against a background of noise. This
low level signal must in turn be amplified in order to provide a signal com-
patible with the associated solid state circuitry which it must ultimately
control. Clearly a separately switched driving source for each pair of
cores is required in order to provide the individual binary signal inputs
whose weights are to be altered. Since the sinusoidal drive currents tend
to be in the order of I0 to i00 or more milliamperes the driving and peripheral
circuitry is necessarily elaborate.
d. Magnetostrictive Integrator
The direction and magnitude of the net remanent flux in a_agneto-
strictive core may be sensed if the core is excited mechanically. AA Figure
II shows a simplified scheme for implementing a magnetostrictive storage
system using an ultrasonic delay line to excite several magnetostrictive
torroids. Driving source for the sonic delay line is a piezoelectric trans-
ducer. Input to each of the torroids is provided by means of narrow width
RF DRIVE
RF DRIVE _
IO0-KC
SOURCE
OUTPUT
VOLTAGE
ADAPT CURRENT
Figure i0 Second Harmonic Integrator
3-12
pulses through a separate write coil wound concentrically _ith the read
coil. If the frequency and rms amplitude of the stress wave is maintained
at constant value, the open circuit output of the read coil is approxi-
mately proportional to the Flux stored in the individual torToids. Although
this effect has been demonstrated experimentally by N_ II and others the
basic Decullarities of magnetic domain behavior especially under the in-
Fluence of mechanical excitation is only crudely understood.
The experimental systems fabricated to date are rather large owing
to the structural requirements of acoustical devices and the associated
e!ectro.-!c circuitry_ necessary, tO provide proper timing_, current driving
and voltage amplification. At best considerable experimental work is
necessary to show that magnetostrictive storage offers any real advantage
over more conventional electro-magnetic approaches. Indeed, the sensing
of remanent flux by acoustical means rather than by non-destructive, elec-
trical drive appears to inject an unwarranted interface complexity.
Figure 11 Magnetostrictive Integrator
3. Conclusion
As a result of the foregoing survey it became apparent that none of the
suggested adaptive devices were sufficiently developed to Justify the selec-
tion of a practical approach for _mnediate circuit implementation of an
adaptive voter. An explicit eval_ation was not attempted owing to the
s_erficial treatment of the various devices by academic researchers.
The magnetic devices with their known sensitivity to temperature stress
appear to offer the least hope for providing analog memory with long term
stability. The requirement for providing carefully controlled incrementing
with relatively large drive currents coupled with the mall output signals
and associated amplification appears to dictate an imposing amount of
peripheral circuitry. The degradation in reliability as a result of this
complexity represents a liability which makes practical application doubtful
for redundant systems.
3-13
The electro-chemical devices, especially the memistor and solion in
their present state of development, appear to be plagued by a number of
stability problems. The memistor with its dependence upon an electroplating
process which is not widely understood, chemical impurities and dimensional
imperfections will require considerable refinement befor_ application be-
comes practical. In addition the requirement for sensing the state of the
device with an ac signal makes circuit implementation rather awkward.
Solions appear to be somewhat more practical if size is not an important
consideration. It has been reported that the Rome Air Development Center is
constructing an adaptive learning machine (CHILD) _hich uses 1080 solions.
With its dep_dence on the chemical equilibrium of a redox system and the
precise construction required to achieve stability the solion presents
several challenging design difficulties. The requirement for providing an
isolated battery cell between the input and shield electrodes imposes a
practical encumberance on a system design which requires a large number of
solions.
The mercury cell integrator with photoelectric readout appears in
principle to offer the most attractive approach because of its_mplicity,
stability and general compatibility with conventional circuitry.
Since the output is essentially a variable resistance proportional to
the integral of the control input current the device offers the possibility
of providing a simple interface with standard circuitry. The mercury cell
integrator is still in a rather primitive state of development and it is
felt that any detailed circuit design undertaken at present _uld be pre-
mature. It has been reported that the Department of Defense is about to
let a contract to develop and fabricate a large number of cells.
It appears reasonable then to restrict our efforts on the design of
an adaptive voter to that of monitoring the state of the art in device
development and to begin detailed circuit design when suitable cells become
available.
3-14
_ere=e?
I)
2)
3)
k)
5)
6)
7)
8)
9)
10)
11)
B. _drow and M. E. Hoff, "Adaptive Suitching Circuits," Technical
Report No. 1553-1, Stanford Electronics Laboratories, June 1960.
B. Widrow, WAdaptive Sampled-Data Systems - A Statistical Theory
of Adaption," 1959 _SCON Convention Record, part 4.
B. _drow, tAn Adaptive 'Adallne' Neuron Using Chemical Memistors, _
Technical Re_ort No_.1553-2, Stanford Electronics Laboratories,
October 1960.
"An Introduction to Solions," Texas Research and Electronic Corp.,
Dallas, June 1961.
"D-C AHplifier Uses Fluid-State Tetrode," Electronic Products
Magazine, October 1962.
"Capacitive Readout Integrator," Technical Brochure, Curtis
l__s, Ira., Mount Kisoo, New York.
J. A. RaJchman and A. W. Lo, WThe Tranfl_r," Proceedings of the
I.R.E., March 1956.
A. E. Brain, "?he Simula_on of Neural Elements by Electrical Met-
works based on Mult4-Apert_re Magnetic Cores," Proceedings of the
I.R.E., January 1961.
J. K. Hawkins and C. J. Munsey, =A Magnetic Integrator for the Percep-
tr_n Program, _ Annual S_mmary Report, Publication No. U-603, Aeronu-
tronics, Newport Beach, Col., July 30, 1960.
H. S. Crafts, "A Magnetic Variable Gain C.c_onent for Adaptive Net-
_orks," SEL-62-147, Technical Report 1851 2, Stanford Electronics
Laboratories, December 1962.
G. Magy "Analogue Memory Mechanisms for Neural Nets, _ Cognitive
Systeu Research Progrm, Contract No. _MR 401(40), Report No. 3,
Cmu_ell University', Ithaca, New York, August 31, 1962.
3-15
Appendix 4
TRANSOR ANALYSIS
I
Trznsor Decision Functions
And
Statistical Measurement of Quality
Contract Nasw-572
Reference WGD-38521
by
1t. S. Bray
P.A. Jensen
C.G. Masters
September 1963
APPROVED:
T_ _stinghouse Electric Corporation
/7_/ J Electronics Division
///_.l_x 1897, Baltimore 3, Maryland
_.:___..._ ,oDirector ,_-.--_ 38521-5050
4-L
TABLE OF CONTENTS
Io
II.
III.
IV.
Vo
INTRODUCTION ..........................
RESTORING CIRCUIT MODELS ...................
A. The Transor Decision Function ...............
B. The Threshold Decision Function ...............
FAILURE MODES .........................
A. Transor Restoring Circuit Vulnerability .............
B. Threshold Restoring Circuit Vulnerability ............
RELIABILITY ANALYSIS ......................
A. Transor Reliability Defined ...................
B. Output Modes Defined ......................
C. Upper Bound on Transor Reliability ...............
D. Transor Reliability for Strictly Asymmetric Failure Modes .....
E. Transor Reliability for Mutually Exclusive Output
Failure Modes ........................
F. Transor Reliability for Symmetrical Environment .........
CONCLUSION ........................
BIBLIOGRAPHY ..........................
4-1
4-3
4-3
4-4
4-6
4-6
4-10
4-11
4-11
4-11
4-12
4-13
4-13
4-15
4-17
4-25
4-ii
I
| :
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
°%
I. INTRODUCTION
In recent years many novel schemes have been proposed to improve digital system
reliability through the use of "redundant" equipment. Several of these, patterned after
a concept of Von Neumann, 1 require a "restoring organ," "restorer" or "voter" to be placed
after each set of redundant signal processors which perform a particular subsystem
function. A restoring organ receives an input from each member of the associated set of
processors. From these nominally identical input signals, the restoring organ produces
an estimate of the correct subsystem output based on one or more specified decision
criteria. It should be noted that the restorer does not perform any data processor function
but acts as an error correcting transmission channel connecting two signal processors.
It has been shown in the literature 2 that the theoretically most efficient restoring
organ is one that is capable of adapting itself to changes in the reliability of inputs.
Specifically, for threshold type organs it has been shown that the optimum use of n unreliable
versions of the same signal could be achieved by dynamically weighting each input in accor-
dance with its relative reliability. Inputs which have a past history of being more reliable
are given the heavier vote weights, and the unreliable inputs the lighter vote weights.
The ideal restoring organ would sense the unreliable inputs and decide on the optimal vote
weights. By efficiently tailoring the restoring organ to its ever-changing environment,
significant improvement could be achieved over the presently popular majority restoring
circuits.
In studying adaptive restoring organs, Westinghouse has shown 3 that circuit imple-
mentation of adaptive restoring organs for the specific requirements of redundant space-
borne systems is not yet practical. The complex circuitry required under the present
"state of the art" to perform the adaptive function results in machines too cumbersome and
unreliable to compete with less sophisticated redundant systems. This does not mean
though that the present restoring organs used in redundancy techniques are adequate and
cannot be improved upon.
The purpose of this study is to investigate a new restoring organ proposed by Westing-
house, called the Transor 4. A characteristic of many failed subsystems is their tendency
to have steady-state outputs as their dominant failure mode. In Transor, steady-state
outputs are automatically deweighted by detecting only changes in states rather than the
absolute states themselves. In an environment where the probability of steady state
1:2;3;4
See Bibliography
4-1
failure is relatively high, a restoringorganwhichignoresits steady-stateinputscanderive
a correct outputwith lessthanamajority of workinginputs.
Thesalientcharacteristicsof the Transor restoringorganare bestshownby contrasting
themto the correspondingcharacteristicsof a majority restoring organ. Themajority
organwaschosenasa referencebasebecauseof its similarity in functionto the Transor
andbecauseit is presentlythe mostwidelyusedrestoring organ.
4-2
I
|
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
m ..
| :
| _
| w,
I fmTr
I figure T-I.
!
I SUM CHANGEDETECTOR
I
II. RESTORING CIRCUIT MODELS
THE TRANSOR DECISION FUNCTION
To be consistent with the terminology adopted by the other investigators of the
Westinghouse Electronics Division, the term "restoring circuit" will be used to denote one
functional unit of a restoring organ or restorer. A very general block diagram of a
Transor restoring circuit having binary inputs (Xl, x2,... XR)and an output z is shown in
OUTPUT
MEMORY •_ Z
Figure T-1. Transor Restoring Circuit
Some of the salient characteristics of a Transor Restoring circuit are noted below:
1) It has memory
I
I
I
I
I
I
2) It operates only on the number of changes in the states of
individual inputs between two adjacent bit times, (t - 1) and
(t).
3) It is a binary voting element with a binary output.
4) It has two thresholds, not necessarily of the same magnitude,
which combine with the states of the input at (t - 1) and ( t )
to determine the element output.
The functional relationship, describing the Transor Decision function can be stated as
follows
z(t) = f [Z (t-l), x2__XR)(t-1); T I]
- (Xl, x2--XR)(t); (x I, To; (1)
4-3
The number of binary Ones appearing on its inputs during each bit time are summed and com-
pared with the number present during the previous time period.
and greater than a given threshold T 1
change is negative and greater in magnitude then a second threshold, r3,
forced to a binary Zero. If neither threshold is exceeded, the output does not change from its
previous state.
ments.
I
If the change is positive
then the output z is forced to a binary One. If the I
nitude [ .n q c)n_ _ J, T , the output is
_eshcdd ! s exce ;ded the output does not char _e fro_ I
This operation may be summarized by the following decision rule state-
I
R R
X-_ x (t) _-' x (t-l) > W -- Z (t) 1 2)
oL, oL, * ' I
E x,,,, E _,'*_): _o- z,,_: 0 ,_I I
o o
o o [
B. THE THRESHOLD DECISION FUNCTION
The threshold model* consists of a black box having a certain number of binary inputs
(Xl, x 2. .. x R) and an output z. At any bit time (t) the state of the output line z is a
function of the state of the input lines and the threshold T. A general relationship similar
to equation (1), but describing the threshold decision function may be delineated by the
following expression.
Ct)
zCt)=, E{x,,x_,--x_) ; • ] _,
If the output, z, can assume either a Zero orOne state, the threshold restoring circuit
makes a decision to force its output to the One state under the following decision rule :
* R+I
The majority gate is a threshold model with T - 2 , where R is the number of inputs.
4-4
I
Rxi (t) > T
and to the Zero state when
_x i (t) < T
O
Z(t) = 1
z (t) = o
(6)
(7)
4-5
III. FAILUREMODES
A. TRANSOR ESTORINGCIRCUITVULNERABILITY
Beforethe reliability of anyTransornetworkcanbeexpressedin a meaningful
mathematicalform, thefailure modesof the individualsubsystemsappearingat the
Transor's inputsmustbeexplicitly stated.
A characteristicof Transor is its ability to differentiatebetweentransistional and
steady-statefailures. This propertycreatesfailure modesdifferent from thoseof
thresholddecision. Specifically,a signalprocessoris assumedeither to beworking
correctly or failed into oneof thefollowingmodes:
1. Thetransitional mode,in whichextra Onesand/or extra Zeros
appearat the output,and
2. Thesteady-statemode,in whichthe outputpermanentlyremains
ina singlestate.
A transition (figureT-2) is definedasthe rise or fall of a pulseduringits switching
time. Therestoringcircuit executesadecisionbyvectorsummingthe change
II i ,
I i i
I I ,
1 i II I
I I
TRANSITION INTERVALS
Figure T-2. Transition Intervals
in input pulses on the R redundant lines during the vote interval and a decisioa is made
according to the decision rules (2) through (4). The term "extra One"implies a one has
appeared on a signal processor's output when it should have been a Zero. By going to the
wrong state a signal processor creates a wrong transition which is voted by the Transor.
Wrong transitions can occur through diode failures in the gating section of diode-transistor
4-6
type Signal processors. These failures sporadically generate "extra" Ones or "extra"
Zeros as a function of the information at the gate's inputs. To illustrate, consider a
three input Transor voting on the output of a network of redundaut AND gates. The
state of the binary inputs may be represented by the state vector S t (t) below.
S.(t)
1
xl(t)
x2(t)
_ it)
In figure T-3 a diode is assumed to have opened in branch (1) of two of the gates causing
those branches to appear as Ones. An erroneous One will appear at the gate's output
whenever a correct Zero appears on those inputs and correct Ones appear on the remainder
of the inputs. However, if all the input diodes open or an output element fails, the gating
function will be destroyed, and the output will assume a steady-state. A method for
determining the probability that a signal processor will fail into either of these two modes is
discussed in Appendix I.
t-I t
_ 20 L.P
3C L.
[:1[!]_ 2C L
sC " L..P
t-I t
r-
30 _P
i ii
Figure T-3.
i
0-,3
INPUT _
Generation of Wrong Transitions in Redundant AND Gates
4-7
Because transitions are vector quantities their occurrence in the wrong direction
may threaten Transor performance in three ways:
1. Wrong transitions cancelling correct transitions.
2. Wrong transitions occurring while the correct inputs remain in
the same state (a series of Ones or Zeros). During this time
the correct inputs have lost their voting power.
3. Wrong transistions temporarily simulating steady-state failures.
Wrong transitions produced by "extra Ones" and/or "extra Zeros" over a sequence of bit
times can result in "error correlation" and create a variety of failure modes, subject to
the nominally correct input states to the Transor for the considered sequence.
Figure T-4 shows this more clearly when state vectors are used to represent the inputs
to a five input Transor. Inputs x 1 and x 2 are assumed to have failed and capable of
randomly producing wrong transitions in either direction, i. e, extra Ones or Zeros.
No inputs are assumed failed to a steady-state. For definiteness all inputs at time It)
may be assumed correct. In the following bit times (proceeding to the right) several
failure patterns are possible for each nominally correct input state. At It+l) the states
C2), (3), C4), and (5) are considered among the possible states (four other possible states
including (1) have been omitted as repetitious). Observe that sequence (1) _ (2)
is the most damaging because only the wrong transitions have any voting power. For a
threshold set as low as two this would result in a wrong decision. The sequence C1) -- (5)
represents a possibility in which both erroneous inputs have temporarily "stuck" in one
state simulating a temporary steady-state. The sequences (1) _ (3) and (1) _ (4) are
the most likely possibilities in which one of the failed inputs is temporarily correct. In
the next bit time {t + 2) transitions to the possible states (3), (4), (5) and (6) and (7) are
considered (again repetitions are omitted). Shown here are the cancellation effects
caused by the introduction of errors on the previous bit time, demonstrating the "error
correlation" inherent in Transor. The sequence C2) --- (5) is the most damaging because
any threshold greater than one would have resulted in a wrong decision. Observe the
tradeoff conflict created by the necessity for setting the threshold at a value greater than
two in the sequence (1) _ (2) and the same threshold at a value less than two in the
sequence (2) _ (5) in the following bit time. Clearly there must exist an optimum
threshold. Inclusion in figure (4) of transitions from states (4) and (5) would have pro-
duced no new failure modes since they are but the duals of (2) and (3).
4-8
Ot
Ii
Figure T-4.
• "it
II
tl
0
0
I0
• ..
®
L!lo
OI
OI
. °
it (_)
I!
I! °°
O! o [O o0
0 It
0 II
0
,. II
Possible Sequences of Input States for a Five Input
Transor Over Two Bit Times
4-9
B. THRESHOLD RESTORING CIRCUIT VULNERABILITY
A threshold restoring circuit makes a decision at time (t) by summing the number of
binaryOnes appearing momentarily at its inputs. The decision is independent of the input
state at time (t - 1). By virtue of decision rule (6) if the number of err9rs appearing on
the restorer's inputs is greater than the threshold T the restorer makes the wrong output
decision. As opposed to Transor, the threshold device cannot differentiate between pure
wrong transitions and steady-state failures so that both failure modes may be lumped
together. To illustrate, consider a three-input threshold restoring circuit whose threshold
is set at two (T = 2). For definiteness assume that x 1 and x 2 at time (t) are in error and in
the same state and x3 is correct as indicated below.
x (t'1 xi
x:,]: xlI
(t) lx 3 x
z(t) -
_- X
I
I
I
Under this condition a wrong decision will be made. This may be considered a "worst case"
failure mode because the alternate situation is possible where x 1 and x 2 have failed into
opposite steady states.
lx0
x3 I
_4
-- Z = x 3
In this case the errors nullify each other and the restoring circuit's output will follow
the single correct input (x3). In most reliability analyses the "worst case" is assumed, and
any two failures in a set of restoring circuit inputs are assumed to cause system failure.
4- I0
IV - RELIABILITY ANALYSIS
A. RELIABILITY DEFINED
In keeping with the usual concept ofreliability,the reliabilityof a Transor restoring
circuit will be defined as the probability that itnever makes a wrong decision during its
mission time. For analysis purposes the transor itselfis assumed perfectly reliable,
i.e., a wrong decision is never made through component failure within the Transor itself.
In part HI itwas shown that errors appearing on the Transor inputs in a particular bit
time could be correlated with errors that appeared on adjacent bit times to produce unique
failure modes. Two of these were:
(1) Cancellation effects
(2) Simulated steady-state
In the following discussion itwill be shown how these failure modes may be '_uiltin"
to reliabilitymodels by using multinomial expansions. Analytical models formulated in this
manner may be easily compared with models for threshold reliability.
B. OUTPUT MODES DEFINED
Any output of a binary signal processor can be classified into one of six mutually
exclusive classes over the element's mission time. These are:
1) Correct
2) Continuous Zero state
3) Continuous One state
4) Extra Ones but no extra Zeros
5) Extra Zeros but no extra Ones
6) Both extra Ones and Zeros.
Moreover the output of a system, composed of binary signal processors may be defined
by the six mutually exclusive classes above. Each of these classes will be assigned the
following probability measures in conformance with the Transor decision rules.
1)
2&3)
4)
p; the probability that the output is correct
qs; the probability that the output is either a continuous Zero or a
continuous One.
ql; the probability that the output generates extra Ones, but not extra Zeros.
4-11
5) qo; the probability that the output generates extra Zeros, but not extra Ones
6) ql0; the probability that the output generates both extra Ones and Zeros randomly.
Note that the measure qs is the result of the union of classes (2) and (3). The transitional
probabilities ql' qo and ql0 are defined to represent only the probabilities that a particular
set of components, whose failure will cause wrong transitions to be generated randomly,
will fail.
C. UPPER BOUND ON TRANSOR RELIABILITY
An upper bound on reliabilityis easily obtained by excluding all but steady-state failures
from the environment. If B is a random variable denoting the number of correct transi-
tions (or working inputs)and Z the number of inputs failedto a steady-state; a probability
density function may be defined over the sample space as
_ : , p /3 q
Since Transor ignores steady-state failures the only criterion for a correct decision
is that
|
B > T O
B > T
1
The corresponding limits on y are
|
|
), < a - T (8) 1
where T = T 1 = T. The reliability is
o
In an environment capable of producing only steady-state failures, the maximum
reliabilityand error correction capability is obtained by settingT = 1. This is the optimum
threshold. From equation (8)we see that Transor can correct at best R - 1 failures in
an order R redundant system.
/g=T l
!
I
!
4-12 l
D. TRANSOR RELIABILITY FOR STRICTLY ASYMMETRIC FAILURE MODES
Excluding from the mutually exclusive ways an environment can fail class (6) and
either class (4) or (5) limits transitional failure modes to states (2), (3), (4) and (5) in
fig. (4). Of these the sequence (1) q (2) is the '_worst case". For definiteness let it be
assumed that Transor inputs may produce only extra Zeros and steady-state failures.
a be a random variable denoting the number of wrong transitions to the Zero state.
o
The density function on this sample space is
I R / P_ )_ a°0= ,r,°o % %
A wrong decision will be made unless
Let
a <_ T -1
O O
Since it is necessary that
B >- T
O
the limits on y must be
y <_ R-T -a
O 0
The reliability is
T-1 R-T-Q.
O O O
a =0 ), =0
0
R _R_a ° -YR-ao- _'' ?" ' ao/ P qs
ay o
% (io)
E. TRANSOR RELIABILITY FOR MUTUALLY EXCLUSIVE OUTPUT FAILURE MODES
The scope of the environment considered in part D can be broadened to include both
the mutually exclusive classes (4) and (5). Each input may be failed to either steady-state,
extra Ones or extra Zeros (but not both). The failure modes (figure T-5) may be represented
in a manner similar to figure T-4; inputs x 1 and x 2 assumed failed in one of the four mutually
exclusive ways listed above.
The sample space may be described by the density function
f \
O = _ R _ pB ao al )"qo ql qs
,ao, a 1,y
4-13
t÷l )+Z
i}o
[i]!i° :I
.,]
.. [it tol ®o! :.i,0
. q
I
, ®
0
0
0
®
®
I
I
I
l
l
l
|
Figure T-5.
The sequence (1)
decision unless
o
and its dual
Possible Sequences for a Five-Input Transor with Mutually
Exclusive Output Failure Modes
(2) in figure (5) implies that a Transor will make a wrong
<T -1
O
al<Tl-1.
(11)
(12)
i
I
n
From the sequences (1) -- (3) and (1)
_+ a 1 >T 1
(4) respectively
(13)
4-14
>_.T (14)
_+ 60 0
for a correct decision. However examination of the sequences (3) -- (4) and (4) -- (3)
show that inequalities (13) and (14) do not represent '_orst cases". "Error correlation"
between the bit times (t + 1) and (t + 2) have produced a temporary steady-state. A correct
decision will be made only if
From (15) and (16)
__ T (15)
0
B ->-T 1 (16)
y < (R-To) -- 6 1 -- a o (17)
y < (R-T 1) -- 6 1 -- 6o (18)
Of these last two inequalities the number of allowable steady-state failures,
governed by the highest threshold, T o or T 1.
The reliability will take the form
To-I TI-I R-T °- 6 o- a 1
a0= 0 a I=0 ),=0 - a l- a, a
- ao 6o, I,
R
_" , will be
R- 6° al )" 6o al )"
P qo ql qs
(19)
where T O is assumed > T I.
F. TRANSOR RELIABILITY FOR A SYMMETRICAL ENVIRONMENT
A symmetrical environment utilizing Transor decision will be defined as the mutually
exclusive classes (1), (2), (3) and (6). Wrong transitions may occur in both directions and at
random. Therefore a o = 6 1 = 6 and T o = T 1 = T. The density function on this sample
may be written as
¢ : 6 ),
, a, 10 qs
From figure T-4 it can be seen that a wrong decision will be made unless
4-15
and
From (21)
¥<_ R-T-2a
Therefore the reliability for the symmetrical environment is
a_
T-1 R-T- 2a
a =0 7" =0
a
qlO
T
qs
(20)
(21)
(22)
(23)
4-16
4"
V. CONCLUSION
The dynamic characteristics of the Transor decision function make this type restoring
circuit unique to the present art. The mission of this part of the Failure Free Systems
Study has been to evaluate the potential usefulness of the Transor as a restoring circuit.
Primarily because it is most commonly used in present redundant equipment, the thres-
hold type restoring circuit has been chosen as the reference point for the evaluation primarily. It has
been hypothesized that, if it can be shown that the Transor failure masking capability com-
pares favorably to that of the threshold restoring circuit, further development, including the
construction of a breadboard model, should be justified.
The results of section IV have shown that there are certain environments in which
Transor can be used to advantage in improving system reliability. For example, the
maximum error restoring capability of Transor is shown to be R-1 failures of R redundant
lines in an environment free from transitional failures. This is a significant improvement
over the majority threshold restoring capability under the same conditions. There is need
for caution, however, for in environments where symmetrical transitional errors are
possible error correlation may make Transor performance inferior to threshold. From the
reliability models, a tradeoff may be determined in terms of the output error probabilities
of the environment.
The work done up to this point represents only a first step in Transor decision study.
Work yet to be done includes: (1) ageneral Transor reliability model incorporating all the
possible failure modes and (2) a decision rule for determining an optimum threshold.
In addition to continuing the analytical effort described in this report, a computer sim-
ulation program is being written to aid in the task (1) effort. This will be a relatively simple
but versatile program designed to accommodate any set of restricting assumptions including
those made in the four models derived in this report. The results of this report have shown
a solution to task (2) would be desirable because of the tradeoffs between different failure
modes. If the error probabilities of the signal processor outputs are known in the design
stage maximum reliability can be bought for zero additional cost by a judicious choice of
the thresholds.
4-17
VI. APPENDIX
Determination of the Reliability Parameters p , qs' qo' ql' ql0 in a Signal
Processor.
In section IV it was shown that reliability models could be formulated in terms of the
output error probabilities of a set of redundant signal processors. This section describes
a method for determining these probabilities.
Consider a set X* which has for its members the n components of a signal processor.
Each member (component) has two possible states:
.th
xi; the I--- member is working.
xi; the i th member has failed.
Let each component have a reliability
_ xi t
P(xi) = e
and a probability of failure
.t
-Xl
P(_i ). = i - e
The probability measure on the sample space of X may be partitioned into the canonical
form
1= P (x n x 2 A __x n -
1 ) +p( xln x2n __Xn )
+P(Xln x2N x3--Xn )+'''+
+P(_l N _2fl -- Rn )
(24)
Briefly, the method requires determining the correspondence between groups of the terms
in (24) and the individual terms in
1= p +qs÷qo +ql +ql0 (25)
Obviously the parameter p, that the signal processor output is correct is
p = P (x i n x 2 n .... x n)
The remaining 2n-1 terms in (24) are mapped into the four remaining parameters in (25) by
paritioning the set X into subsets whose members are defined by those components whose
* Summary of all the notation to be used is included on the last page of this appendix.
4-18
failure will result in one of the four mutually exclusive events described in part IV. Specifi-
cally let
Xss be the set whose failure results in either a steady-state Zero or One.
X 1 be the set whose failure results in extra Ones.
X be the set whose failure results in extra Zeros.
o
XlO be the set whose failure results in extra Ones and Zeros.
Since each component may fail by shorting or opening, these two modes will determine
membership in one or more of the above sets. If the probability of a component shorting
s IT i ), is p then the joint probability of x. failing andgiven that its failed, P( x i i 1 shorting
i
is
P (x i N x i s ) = p (x i s) = Pi(1 -e- _it)
Let the probability of an x i opening given that its failed the P(x i ° I x--i)
!
Then
Ip (x i s I xi ) + P(xi o xi ) =I
and
oi-p (xi x i) : I
Also since for each x i the events working,
probability of a component no___tshorting is
- Pi
shorted or opened are mutually exclusive the
p (xiS ) = P ( x i U x i°) = 1 - Pi (1 - e Xit)
To illustrate the technique a NAND gate will be analyzed using the test results contained in
5
an earlier Westinghouse report.
C
G
C
+12 •
., I_. I(
-IZ
= OUTPUT
I
Figure AT-1. NAND GATE
4-19
.'i
The pertinent results are included below.
1. AND gate input diodes; CR1, CR2, CR3
A. OPEN - Any open circuit input is equivalent to a logical "one" on that input; it
cannot inhibit the AND gate.
B. SHORT - A shorted diode will not affect the ability to perform the AND function if
that input has low impedance to ground in the "zero" state and high impedance to a
positive voltage in the "one" state. The line with a shorted diode is no longer
isolated from other inputs; that line is shorted to the AND gate output and may,
therefore, be an incorrect "zero".
2. AND gate resistor; R4
.
A. OPEN - The AND gate has no voltage available to drive current into the transistor
base, so the NAND gate output remains a "one".
B. SHORT- This will cause a low impedance path from the +12 volt power supply
through the input diodes to all of the inputs to the gate. If any of these inputs
are from NAND gate transistors which are conducting, that input will also be a
low impedance to ground. A low impedance path then exists from the power
supply to ground, and a high current will flow through the diode and transistor
according to the magnitude of the impedance of the power supply and components
involved. In the tests observed, this current was not sufficient to damage the
transistor or diode and did not blow the fuse on the power supply. However, if
any inputs are from flip-flops, the clamp diode will turn on when the voltage
exceeds the clamp voltage. A low impedance path then exists from the +12 volt
power supply through the shorted AND gate resistor, the input diode, and
may seriously overload the clamp voltage supply, depending how the clamp
voltage is derived. In the tests observed, this current was sufficient to cause
both the input diode and clamp diode to short and the clamp voltage to rise
toward +12 volts.
Input resistor - capacitor; R5, C9
A. Resistor SHORT- The transistor base voltage will be the AND gate output.
This will normally cause the transistor to conduct, so that the output will
be "zero" for any logic input.
B. Resistor OPEN- This will cause the transistor to be off, so that the output will
be a "one" for all logic inputs.
4-20
|
|
|
°-
C. OPEN C9 - This does not adversely affect operation, unless the switching time is
critical, in which case NAND gate turn-on time was increased from 65 nanoseconds
with C9 to 80 nanoseconds without C9; turn-off time was increased from 25 to 45
nanoseconds in one approximate measurement with a constant load on the output
of the circuit. The turn-on time was measured as the time from the input going
positive above +1.6 v. until the output goes to +1.6 v. from the "one" state. The
turn-off time was measured as the time from the input going negative below +2. 4 v.
until the output goes to +2. 4 v. from the "zero" state.
4. Base bias resistor, R6
._ OPEN - This will normally cause the transistor to conduct, so that the output will
be "zero" for any logic input, except that when the AND gate voltage is going
negative from the "one" state, this voltage change is coupled across C9 and will
turn the transistor off until the transient effect has ended.
B. SHORT- The short of the base resistor may cause damage to the output transistor,
since -12 volts on the base exceeds the maximum rating of 5 volts for VEB O. The
output voltage will depend on the failure mode, ff any, of the transistor. In three
multiple failure tests that included short of the base bias resistor in a NAND gate,
two transistors shorted base to collector, which resulted in a -12 volt output;
one transistor shorted collector to emitter, which resulted in a "zero" output.
The -12 volt output did not cause any significant difference than a normal "zero"
output to the following circuitry.
5. Collector (output) resistor, RS.
A. OPEN- The removal of the output resistor does not affect the logical operation
of the circuit, since any loads are also to positive voltage sources. The output
rise time will be somewhat slower but the output will turn off faster because the
output voltage in the "one" state is lower and the load current is less.
B. SHORT- The output voltage will be +6 volts; the current in the transistor will be
high if the transistor is conducting. This current was not sufficient to cause
permanant damage to the transistor in the observed tests.
6. Transistor, T7
The transistor may fail into any of several possible modes, but the circuit output
will usually be a "one" unless a low impedance path exists from the output to ground,
such as when the collector is shorted to emitter, or if the transistor is otherwise
forced to remain conducting from collector to emitter.
4-21
,o
From the test results the component failures may be categorized (below) into their
effects on the NAND gate's output.
I Components Causing Failure into Steady State "I"
l) R4 Open
2) R5 Open
3) T7 (most modes result in a 'T' )
II Components Causing Failures into Steady State "0"
l) R5 short
2) R6 short
3) R6 open
4) CR1 and CR2 and CR3 open (together)
III Component Failures that will Produce Transitional Extra "Ones"
1) CR1 or CR2 or CR3 open
2) CR1 and CR2 open
3) CR1 and CR3 open
4) CR2 and CR3 open
From the three categories above may be formed the mutually exclusive sets
Probability of X s (i)= P IX s (i)]Set X s
X s (1): x 4° (1-#4) ( 1 - e-X4 t )
X (2): x 1 - e -k5t
s 5
X s (3):x 6 1- e-k6t
X (4):x 1 - e "XTt
s 7
o o |
X s (5): x I N x 2 fl x3° L(I-p) (I- -kt ] 3e )
The probability of a steady-state failure is
5 5 5
qs= _ P IX (,)] - _ P [Xs(,, j) ] + _
i =I i _ j i_j/k
Xs(i, j, k)]
5 5
_ Pixs,i,,,k1,]+ P[Xs,i,]
i _ j/k_l i =1
4-22
oSet X
o
Xo(1): (Xl ° (B x2 ° (_ x3°) N x4 ° f] x 5 N x 6 A x 7
Xo(2): (Xl°N x2° (B x2°N x3° (_ Xl°A x3°)
Nx4°A x 5 Nx 6 Nx 7
The probability of an extra zero is
2 p [Xoqo i _= 1
Probability of x ° (i) =
3 (l-e) (1-e -xt) •
-2Xt [e 1-(1- p4 ) (1-e
-_t]2 e3 [(L,1-p ) (1-e )
• (1-e-X4 t)].
-(>'5 + X6 + x7)t
P [Xo (i) ]
Observe from the set X ° that transitional errors will be caused by less than three of
the input diodes failing through opening. In actuality the probability of a wrong transition
for the member X ° (1) in the set X ° is the joint probability:
P (i th Diode open A "O" on the i th input N
n-1 diodes working N "l's" on the n-1 diodes N no steady-state failures)
=P (i t h Diode open ). P (n-1 Diodes working). P ("0" on i t_hhinput N
l's on n-1 inputs I ith Diode open N n-1 working). P (no steady-state failure)
m
The third term in the joint probability expression is the conditional probability express-
ing the fact that a wrong transition is a function of the information appearing at the gate inputs
in any bit time. For all practical purposes this term may be set equal to unity due to the
tremendous speed at which information is processed and the resulting short time between
occurrance of all possible input states. This same reasoning may be applied to the other
member X (2).
O
Note that a NAND gate possesses an asymmetric environment because there are no
failure modes that can result in the exclusive classes X 1 or X 1 o"
4-23
Thus the reliability of a Transor voting on the output of a network of redundant NAND gates
can be defined by equation (i0) in part IV.
The following notation was used in this appendix.
th
1) xi, the event that the i -- component is working correctly.
2) xi; the event that the i t h component has failed.
3) P (xi); probability of the defined event (1)
4) p_i )=l-P(x i)
s
5) x. ; the event that the i _ component has shorted
1
O
6) x i ; the event that the i th component has opened because the probability
space of each component is the logical union of
x. U- Nx s o(xi )u(x. n x.)1 1
s
7) p(x i ); the probability of (5)
8)
9)
10)
o
P (x i ); the probability of (6) = I-P (xi) - P (xiS)
- s .th
x. ; the event that the 1- component has not shorted
1
- o .th
x i ; the event that the x- component has not opened
--s
11) P (x i ) ; the probability of (9)
12) P (x. ° ) ;the probability of (I0)
I
13) P (xs. I x i ); the probability of the i th component shorting given that its!
failed = p
14)
0
P (x i I x. ); the probability of the i th component opening given that its1
failed. = 1- p
I
I
I
I
I
I
I
4-24
4."
BIBUOGRAPHY
1) J. von Neuman, "Probabilistic Logics and the Synthesis of Reliable
Organisms from Unreliable Components in Automata Studies, "Ed.
C. E. Shannon and J. McCarthy, Princeton University Press, 1956.
2) W. H. Pierce, "Adaptive Vote-Takers Improve the Use of Redundancy, "
Redundancy Techniques for Computing Systems. " Ed. R. H. Wilcox and
W. C. Mann, Spartan Books, 1962. July 17, 1961
3) "A Survey of Adaptive Components for Use in Failure Free Systems",
Special Technical Report No. 1, Nasw-572, Aug. 1963.
4) W. C. Mann, 'Restorative Processes for Redundant Computing Systems, "
Redundancy Techniques for Compu.tingS_stems, Ed R.- H. Wilcox and
W. C. Mann, Spartan Books, 1962.
5) A. R. Heiland, W. C. Mann, "Failure Effects in Redundant Systems, "
Report No. EE-3351, Westinghouse Electronics Division 1963.
4-25/26
Appendix 5
COMPARISON OF DYNAMIC AND THRESHOLD RESTORERS
f i
i
t
!
r_/
/
r_
-°
COMPARISON OF DYNAMIC
AND
THRESHOLD RESTORERS
by
C. G. Masters
tt S. Bray
December 1963
5-i
Section
Io
II.
III.
IV.
V.
VI.
5-ii
TABLE OF CONTENTS
INTRODUC TION ........................... 5-1
DESCRIPTION OF DYNAMIC RESTORING CIRCUITS .......... 5-3
A. Review of the Transor Decision Function ............. 5-3
B. Description of the Hamming Distance Restoring Function ...... 5-4
C. Comparison of Transor and the Hamming Distance
Restoring Circuit .......................
REVIEW OF THE ANALYTICAL EFFORTS ...............
Ao
B.
C.
D.
E.
F.
5-5
5-7
Signal Processor Assumptions .................. 5-7
Classification of Failure Effects ................. 5-8
Class Probability Measure .................... 5-10
Analytical Models ....................... 5- l 1
I. Multinomial Model for a Dynamic Restoring Circuit ...... 5-ii
2. The Transor Model ..................... 5-12
3. The Hamming Distance Restoring Circuit Model ........ 5-12
4. The Threshold Restoring Circuit Model ............ 5-13
Threshold Parameters as a Bound on Dynamic Parameters ..... 5-14
A Comparison of Transor and the Hamming Distance
Restoring Circuit ...................... 5-16
SIMULATION PROGRAM ....................... 5-19
DISCUSSION OF RESULTS ...................... 5-21
A. Simulation Results ....................... 5-21
B. Curves Discussion ....................... 5-24
CONCLUSIONS ........................... 5-29
I
|
I
I
I
I
I
I
I
I
I
I
I
I
I
,."
q*.
Figure
°
2.
3.
4.
5.
6.
7.
8.
9.
LIST OF FIGURES
Page
Block Diagram of the Transor ..................... 5-4
Block Diagram of the Hamming Distance Restoring Circuit ........ 5-4
Possible Five-Input Sequences for Two Failures ............. 5-9
Typical Histogram .......................... 5-22
Approximation to Reliability Curve ................... 5-22
Transor Order 5 Redundancy ..................... 5-23
Comparison of Transor and Hamming Distance ............. 5-25
Comparison of Threshold and Hamming Distance ............ 5-26
Comparison of Order 7 Threshold and Order 5 Hamming Distance ..... 5-28
5 - iii/iv
.o
I. INTRODUCTION
The basic function of a restoring circuit is discussed in Part One of Special Technical
Report No. 4 which is contained in Appendix 4 of this report. The Transor is described in
that report as a device which is potentially useful for performing the restoring function. Be-
cause it is sensitive only to changes in the states of its inputs, a restoring circuit of this
type appears to have advantages over the common threshold voter in environments where
most failures result in steady state inputs to the restorers. Of course, such a circuit should
be inferior to the threshold voter when failures result in transient errors.
The original goal of this study was the determination of the ratio of probability of steady-
state errors to probability of transient errors for which any decrease in the ratio will make
the use of threshold voter advantageous compared to the Transor. In the process of perform-
ing the study, a new dynamic restoring circuit has been developed which has obvious advan-
tages over the Trausor for certain input failure pattern conditions. The invention of the
Hamming Distance Restoring Circuit caused a shift in the primary goal to include evaluation
of both it and the Transor relative to each other, as well as to the threshold voter.
Section H of this report includes a brief review of the Transor and describes the Ham-
ruing Distance Restoring Circuit. Section HI reviews the analytical techniques which have
been used in searching for tools to evaluate the two restorers. Section IV describes the com-
puter simulation program which was used in the evaluation. Sections V and VI contain the
results which have been obtained and the conclusions which can be drawn from these results.
s-1/2
If. DESCRIPTION OF DYNAMIC RESTORING CIRCUITS
A. REVIEW OF THE TRANSOR FUNCTION
The Transor is described in detail in Appendix 4. A brief review of the Transor func-
tion is given here to ease the discussion of the Hamming Distance function and to facilitate a
rough comparison of the salient features of each.
A block diagram of the Transor Restoring Circuit with binary inputs (x r x 2 .... x R)
is shown in figure 1. The functional relationship between the output Z, the inputs, and the
thresholds T O and T 1 is expressed in general as
Z (t) = f [Z (t'l); (x 1,x 2 ..... xR)t; (x r x2,.---xR)(t-l); T 0; T I]
(1)
The specific function summarized by this relationship may be described as follows.
The number of binary "ones" appearing on the Transor inputs during each bit time (t) axe
summed and compared with the number present during the previous time period (t-l). If
the change is positive and greater than a given threshold T 1 then the output Z is forced to a
binary "one". If the change is negative and greater in magnitude than a second threshold,
TO, the output is forced to a binary "zero". If neither threshold is exceeded, the output does
not change from its previous state. This operation may be completely specified by the follow-
ing decision rule statements:
R R
_ x'(t) - _ x'(t-l) -> T1 ---_ z(t) =11 1
i=I) i=0
R
i__ x'(t)1
R
xi(t_1 ) < _ T O _ Z(t)
i=0
= 0
R
_ (t)
TO< _ xi
0
R
x.(t_1 ) <1
0
- T 1 --_Z (t) = Z (t-l)
5-3
_ SUM CHANGE
DETECTOR
TI
To
Figure 1. Block Diagram of the Transor
OUTPUT ZMEMORY
B° DESCRIPTION OF THE HAMMING DISTANCE RESTORING CIRCUIT DECISION
FUNCTION
A block diagram for a Hamming Distance Restoring Circuit with binary inputs (Xl,
x2, ... xR) is shown in figure 2. The functional relationship between the output Z, the in-
puts, and the threshold T can be expressed in a form similar to that of Transor:
Z (t) = f [ Z (t-l)., -1 (t) _ xl(t-1)., x2(t) _ x2(t-1); x2(t) _ x2(t-1).,...
(t) (t- 1) ]
x R - x R ; T
Again, this relationship summarizes a rather complicated function. In the same man-
ner as the Transor, the output of the Hamming Distance Restoring Circuit tends to remain in
the Z (t-l} state unless the number of state changes on its inputs exceeds some threshold. In
the latter case, however, the direction of state changes is not considered and output state
change decisions are made without any consideration of the absolute states of the imputs. Thus, the
o
E)
I STATE
CHANGE
DETEC OR
I STATE
CHANGE
DETEC OR
T 1MEM°RYOUTPUT
Figure 2. Block Diagram of the Hamming Distance Restoring Circuit
5-4
output at time t, z_t),t is always dependent upon Z (t-l) and the Hamming distance between the
two input vectors ( Xl, x2, x )(t) and xR)(t-1)"'" R (x 1, x2, .... This relationship is completely
specified by the following rule statements* :
R
T> _ I xi(t) - xi(t-1) I----_ Z (t) = Z(t-l)
i=l
R
T< _ [_i (t)
i=1
_.... Z(t) = Z(t-l)
C. COMPARISON OF TRANSOR AND THE HAMMING DISTANCE RESTORING CIRCUIT
The outstanding characteristic of the Hamming Distance Restoring Circuit which dif-
ferentiates it from the Transor is that it ignores information about the absolute state of its
inputs. This characteristic can be used to advantage because the input from a signal pro-
cessor producing both erroneous "ones" and "zeros" cannot cancel the influence of a working
processor input as it can in the Transor case. This may be illustrated by considering the
following input pattern for two bit times. Suppose that input 3 is failed to a steady state
"zero", that inputs I and 2 represent the correct information, and that inputs 4 and 5 are
producing both extra "ones" and "zeros" at these bit times.
INPUTS x. (t-l) x. (t)
1 1
1 (correct) 0 1
2 (correct) 0 1
3 (failed) 0 0
4 (incorrect) 1 0
5 (incorrect) 1 0
* The function
R
_ [xl(t)- xi(t-1) I
i=1
is a measure of the difference between vectors x (t)
and x (t-l) which applies frequently in formation theory. The conception of this measure
is credited to R. W. Hamming of Bell Telephone Laboratories.
5-5
OUTPUTS z(t-l) z(t)
Threshold (majority, 0 0
T = 3)
Transor 0 0
Hamming Distance 0 1
Restorer
Actually, the states indicated by inputs 4 and 6 need not necessarily occur as a result
of component failures. For example, if no provision is made for synchronization, corre-
sponding elements of a redundant binary counter may become permanently out of phase as the
result of either noise, or the initially random states due to application of power. For this
example, the net change in the number of "ones" is zero, but the total number of state changes
is four. It cannot be said from this one example that the Hamming Distance Restorer can al-
ways withstand more input failures, but grounds for further consideration have certainly been
established.
It should be noted at this point that ignoring the absolute state of the inputs provides the
major advantage of the Hamming Distance Restorer but it also a disadvantage. Because the
output Z is not directly related to the absolute states of the input, the output state must be
set to the correct initial state before operation is begin or it has only a chance, perhaps 50%,
of being correct. If it is not initially correct, Z (t) will always be in the state opposite to the
correct one. Trarsor, on the other hand, will converge to the correct value after a small
number of bit times because of its dependence on the direction of state changes.
The remaining sections of this report will describe the efforts which have been made
to evaluate both Transor and the Hamming Distance Restoring Circuits. These evaluations
are referenced to the commonly used threshold voter. The results of the evaluations are
discussed in Section V. The conditions under which one of the dynamic restoring circuits
might be more powerful than the threshold voter are established.
5-6
"m
i
I
i
I
I
I
n
i
I
|
I
I
I
I
I
I
I
HI. REVIEW OF THE ANALYTICAL EFFORTS
A. SIGNAL PROCESSOR ASSUMPTIONS
To clarify the description of the analysis of the various restoring circuits, it seems
advisable to summarize the assumptions which have been made concerning the signal pro-
cessors which provide inputs to the restoring circuits. Each processor is assumed to be
composed of a set of components, all of which must work properly in order for the proces-
sor output to be correct. It is assumed that the i-th component of the set has a probability
of failure durir__ the differential interval A t which....... i_ r-nrn"_rt-4_"°1-_----v._ *_.vthe interval length.
This probability can be expressed as k i A t" This implies that the reliability (the proba-
bility that the i-th component does not fail during a time interval, t) given by the expression
k.t
1
R(t) = e
(3)
Because correct operation of all components is required for correct processor opera-
tion and assuming independence of failures between signal processors, the reliability of a
processor composed of N components is equal to the product of the component reliabilities.
Therefore:
(N )N N -_.t - _- ki t
R s : ]'[ R. : "[]"el 1 = e I=1
i=l i=1 (4)
Similarly, if the set of components is partioned into M subsets and a reliability com-
puted for j-th subset, the processor reliability would be the product of subset reliabilities.
Mathematically, this is expressed as
and
M
R s Rj
j=l
R.
]
n° ]
R1
j=l
= e
n.
]
i=1
k i
t
(_)
(6)
5-7
where n. is the number of components in j-th subset and k . is the failure rate of the i-th
] 1
component of the subset. The subsets which the components are partioned into correspond
to the class of processor output errors which failure of the component will cause. The
classification of errors is discussed in this section.
If all failure modes of a component caused only errors of one class, the assumption
could be made that each component was completely associated with one of the class subsets.
In general, this is not true. For example, if the output transistor of a binary signal proces-
sor is shorted (emitter to collector), the output would probably become permanently fixed at
the "zero" level. If, however, the transistor is open circuited, the output of the processor
would probably become permanently fixed at the "one" level. Because subsets are established
by classification of output error types; the above transistor cannot be uniquely associated with
any subset. To make an association, some artificial method must be used to assign to each
subset only that "portion" of a component which will cause that particular class of output er-
ror. Although the components cannot be physically divided in the required manner, they can
be analytically split by multiplying the total failure rate of the component by the conditional
probability of the occurrence of each possible failure mode. This procedure produces a num-
ber which can be considered the failure rate of a smaller component or subcomponent whose
failure results in only one of the possible classes of output errors.
It should be noted at this point that the failure probabilities of the smaller subcomponents
described above are not independent of the operational state of all other similar components,
as are the original circuit components. This may be illustrated by referring to the previous
example. If the transistor in the example were split into two subcomponents representing
the short and open failure modes, and one of the subcomponents had failed, the other compo-
nent could not also fail. The occurrence of a double failure of subcomponents associated with
a single physical component, however, is normally a relatively improbable event in compari-
son to the other system-failure producing events in associated circuits. For this reason,
this dependence effect has been ignored in all the models developed during this study.
B. CLASSIFICATION OF FAILURE EFFECTS
In the initial phase of this study, which is reported in Appendix 4, it was shown that
the ability of dynamic restorers to differentiate between inputs working correctly and those
failed to a steady state could generate failure modes different from those of threshold deci-
sion. There are, specifically, four modes which threaten the operation of dynamic restoring
circuits.
1) Wrong transitions cancelling correct transitions. (A sufficient number leave
a net number of correct signals insufficient to span the set threshold. )
4 •
5-8
2) Wrong transitions occurring while the correct inputs remain the same state
(a series of extra "ones" or "zeros"). During this time the nominally correct
inputs have lost their voting power so that, if enough wrong transitions occur
at one time, they will span the threshold and result in a wrong decision.
3) Wrong transitions temporarily simulating steady state failures. Wrong tran-
sitions can combine on adjacent bit times in a manner to produce a steady
state effect_
4) Steady-state failures. Enough steady-state failures would leave insufficient
correct signals to span the threshold.
To Ulustrate, consider figure 3 where state vectors are used to represent the five in-
puts to Transor. Inputs x 1 and x 2 are assumed to have failed and capable of error. For
definiteness ail inputs at time (t) may be assumed correct. In the following bit times (pro-
ceecling to the right) several failure patterns are possible for each nominaily correct input
state. The cancellation mode (I) is clearly shown in the sequence (2)---_(5) where extra
"zeros" have appeared at time (t+l). By virtue of the Transor decision rules, an error
will be made at (t+2)unless T = 1 since the net result of the summation over (t+l)and (t+2)
o
is minus one. Of course, itis also possible for errors to cancel each other as in sequences
(3)---(4) and (3)--*(7).
Ill
T T+ I T+2
I
• 0
I o
I o
,o -LI oJ
Figure 3. Possible Five-lnput Sequences for Two Failures
5-9
The second failure mode (2) is shown in sequences (1)--*(2) and (1)----*(3) and the third
mode (3) by sequence (3)---*(6). Theresult is the same in the third mode whether the errors
are caused by wrong transitions or steady-state errors.
Any output of a binary signal processor can be classified into one of six mutually ex-
clusive classes over an arbitrary time interval of six mutually exclusive classes over an
arbitrary time interval. These are:
1) Correct
2) Continuous Zero-State
3) Continuous One-State
4) Extra "ones" but no "zeros"
5) Extra "zeros" but no "ones"
6) Both extra "ones" and "zeros"
"m
|
|
i
|
I
This classification is necessary because the failure modes caused by wrong transitions
have no parallel in threshold voter. A realistic comparison cannot be made on the basis of
each output simply failing or working. For example, the sixth output mode listed above re-
sults in the cancellation effect (1) mentioned earlier. Likewise, output modes (4), (5), and
(6) result in the second and third failure modes listed in part A.
C. CLASS PROBABILITY MEASURE
Each of the six mutually exclusive classes must be assigned a separate probability
measure. Let these be:
1) p; the probability that the output is correct
2) qYo the probability that the output is a continuous "zero"
3) qY 1 ; the probability that the output is a continuous "one"
4) qa 0 ; the probability that the output generates extra "zeros"
5) q a i i the probability that the output generates extra "ones"
6) q a 10; the probability that the output generates both extra "ones" and "zeros"
randomly.
The q's above are related to the reliability of the reliability of the component subsets
through the simple relationship
r.=l-qj] wherej = Yo' }el' ao' al' a 10 (7)
and
r= r 7 • r y .r a • r a .ra
o 1 1 0 10 (8)
I
I
I
I
I
I
I
I
I
I
I
5-10 I
i4 • "°
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
Thus, the q's refer to the probabilities that one or more failures will occur within a particu-
lar set of subcomponents and cause the related output error.
D. ANALYTICAL MODELS
In a multiple-line redundant system, it is assumed that each input to a restoring circuit
is derived independently, and each input, over an arbitrary time interval, can be defined by
one of six mutually exclusive operational classes. A physical system, defined in this man-
ner, suggests a multinomial distribution as its possible analytical model because the R re-
dundant lines can be considered analogous to R repeated trials of an event with more than
two possible outcomes.
1. The Multinomiai Model for a Dynamic Restoring Circuit
Let the number of outputs failed to a particular mode be represented by a random
variable. Specifically, let
Y = the number of outputs failed to the steady state
a 0 = the number of outputs generating extra "zeros"
a 1 = number of outputs generating extra "ones"
a 10 = number of outputs generating both extra. "ones" and "zeros" randomly
Hence, the number of outputs that are continuously correct is
R- a 10 - el - aO - y
We see that the analytical model for a dynamic voter may be delineated by a subset of points
in a four dimensional sample space. These points correspond to possible operating states
of the system. Associated with each sample point is a probability defined by the density
function
_ (alO, al, ao, Y ) = R-alO-el-aO-y, e lO, a 1,a O,
__ __
N
* The symbol Xl, ... x i ....
m
where _ x i = N.
Jt JL
(R-a _a _a _y)
.(p) I0 1 0 (q,. lo)alO
(qcxo) aO (q¥)Y
m ) represents the mathematical function
(9)
N!
m
X.!l
i=l
5-11
Where
P + qalO + qa I + qa 0 + q )_ = I (10)
Thus, the reliability of a dynamic restoring circuit will be
ALL _)C H
R(t)=_ ) ( alO, (21, aO, Y ) (11)
where I"[ is the subset of sample points whose outcomes result in a continuously correct de-
cision by the circuit.
2. The Transor Model
For the Transor, membership in the subset l'I may be determined by the intersec-
tion of the following set of linear inequalities derived from the Transor decision rules.
010 + a I <- T I -I
alO + n o -< TO -t
2010+ QI + aO÷ ),'--. T'
where )_=Z' I + 7"0 andT' =R- T O or R- T1, whichever is smaller. Thus
RT(t) -alO-O I -QO-Y,alO,al,aO,y (R-alO-a I-
(P)
ALL (:_ SATISFYING
THE DECISION RULES
For example, if R=5, T O = 2 and T 1 = 3, then
%-
(qalo)alO (qQi °_1) (qao)a 0 (qT')T
(12)
+ 3 y+ 20p3qaoq y 20p3qa i qao2 ÷2op%,,
(13)
3. The Hamming Distance Restoring Circuit Model
The decision rules for the Hamming Distance Restoring Circuit described earlier
in the report determine the following set of linear inequalities:
alO+ a
alo + a I <_ T-I
alO + a 0 _< T-I
+ Q + <I o y_ R-T
5-12
Removal of the cancellation effect accounts for the absence of the factor of two (2) in the last
inequality thus making the Hamming circuitless sensitive to failures causing both extra "one"
and "zero" transitions. From these decision rules, the reliabilityof the circuitcan be
written as
T! T! TI
RH(t )* Z _- Z
010=0 QI=O Go=O Y=O
"(qQI _1 (qao)aO {qy)y
For R=5 and T=2
RH(t):pS+5:ll-p)+lO2q),Z+2Op3qQ iqy + aO p3qo0 qy ÷ 20p3qal qao
(.
_. -a iO-al - a O- y ,alO ,a I,aO,)'/t P! qa iO
÷ lOp2 qy3+ 30p2qoloqy2+ 30p2 q
(14)
+20}q o q;"
+30p 2 * -, .
QIOqY E qa 0 qy2 60p2q_ Iq_Oq_,
(15)
4_ The Threshold Restoring Circuit Model
In system reliabilityanalysis using majority threshold voters, itis customary to
assume that the failure of a majority of inputs, regardless of their mode, will result in a
wrong decision. Although this common assumption was used in Special Technical Report No.
4, itis not strictlycorrect because a threshold voter may tolerate as many as R-1 failed
inputs and stillfunction correctly. A more rigorous approach, using the results of section
HB, can be found by letting:
I) 010 be a random variable devoting the number of wrong "ones"
and "zeros"
be a random variable denoting the number of wrong "ones" only2) '_'1
3)
0
be a random variable denoting the number of wrong "zeros" only
Thus, we see that the parameters defined for the threshold voter are related to
the dynamic restorer by:
_"1 : al + yi
_"0: aO + YO
_10 = QIO+ X
where X is a dummy variable which accounts for the case in which a signal processor has
experienced two failures causing opposite steady-state errors. Because it is impossible to
5-13
.°o
say which of the two failure will control the output for a general case, the worst case condi-
tion is assumed and in the models both are assumed to exist simultaneously. By virtue of
threshold decision rules the subset l'I may be defined by
e +_/ <_ T-I
I0 I
810 +._zt0 < R-T
The reliability of threshold voter is, then
T" T___IO R_._IO /R" R _ (R- _10-'_/i-'_/0 )%.( =,) ,X_'0) (P) (q010)010(q XI/,}_'' (q XII0}_'0
8o--°_1=° _Zo=O
(16)
°_
where T" = T-1 or R-T whichever is smaller.
For example, if R=5 and T-3 we have
RTh( t ) = p5 + 5pH(I-p) +lOp :5(I-p)2 + 30p2( q_xtl )( q_/o )2
+60p 2(qOlOl(q_i/l) (q,_/O) + 30p (q_z/i) 2 (q_.O)2
+ 50 (p)2(q_/i)2{q_T,/O)
(17)
E. THRESHOLD PARAMETERS AS A BOUND ON DYNAMIC PARAMETERS
It was shown that the terms in the analytical models corresponded to probability mea-
sures associated with specific members of the subset I'[ within the sample sapce. Criteria
for membership inI'[ was determined by the intersection of a set of linear inequalities de-
termined from a decision rule.
It will now be shown that a dynamic restoring circuit can now be as effective as the
threshold voter when the optimum threshold T for the threshold voter is (R + 1)/2 and the
optimum threshold for a dynamic voter is _ (R + 1)/2. It has been shown that when
q_ "" qq]0 (defined earlier) within a certain range, the optimum threshold for a threshold
voter is (R + 1)/2. The decision for the threshold voter now becomes, using }he relations
previously described in the threshold restoring circuit model:
R--I
81o ÷ at + "Yt _ 2
(18)
_10 4- ao +'YO <-. R-I
2
(19)
5-14
•Assume also that the ratio of q/(qal 0 + qa I + qa0) is such that the optimum dyn-
amic restoring circuit threshold is also (R + 1)/2; hence, the decision rules for the dynamic
circuit becomes
R-I
alO 4- a I _- 2
R-I
al 0 4- a0 _ 2
(20)
(21)
R-I
+ %+ r, + ro -< 2
(22)
when )" = Yl + Y0 and al0C 010. Let allthe terms generated by inequalities (18)
and (19) form the set IITh and those by (20), (21), and (22) the set rli_ The proof consists
of simply showing that l]H C llTh . Clearly each random variable consisted one at a time
will form the non-empty sub-sets of the form:
R-I
i:1
where k=010, al0, al ' a0, y I' y0" IIH C Th by virtue of the fact thatal0C
010. The proof becomes even more obvious when we consider the non-empty subsets
formed by combinations of random variables taking two at a time. Choosing one variable
from inequality (18) and one from (19) will generate non-empty subsets of the form
R-I R-I
i:1 j:l
R-i-j (qk)i(ql)j FOR (k _el )
(23)
where k, l=Ol0, al0,al,a0,Y, l,y. 0.
empty subsets of the form
R-I
R-i-i,i,i (P)
_+j:2
(qk)i (ql)J
Choosing two terms from (6) will form non-
(24)
5-15
Now II C II
H Th
number in (24) is
because the number of terms generated by(23)is ( R-1 )
2
4 ÷ .oo +
R_-I
R+I
2
M--3
M
2
and the
(25)
and for all R --- 5 R-3
2
M=I
M
(26)
Likewise, the same reasoning may be applied to combinations of random variables taken
three at a time. Thus, it has been shown that if the dynamic restorer is to show superior
performance it can only do so when its optimum threshold is reached at values less than
R+I
2
F. A COMPARISON OF TRANSOR AND THE HAMMING DISTANCE RESTORING CIRCUIT
In previous discussions, ithas been noted that the Transor is controlled by two thres-
holds as opposed tothe single threshold of the Hamming Distance Restoring Circuit. R
might be argued that the utilityof two thresholds, not necessarily set at the same level,
would present an added advantage in a high asymetrical environment, i.e., one in which
eigher "one" or "zero" errors are more likely. That this is not the case will be shown in
the following discussion.
In an earlier Westinghouse report I itwas shown that in an asymetrical environment,
a great increase inthreshold voter performance could be had by using thresholds less than
or greater than (R + 1)/2 according to a criterion developed in that report. Since dynamic
restoring circuits cannot distinguishbetween outputs fialedtoa continuous "one" and those failed
to a continuous "zero", they cannot take advantage of the asymmetry in steady state errors.
This leaves for consideration only asymmetrical transitional errors.
The results of the previous section have shown that for a dynamic restoring circuitto
show improvement over a threshold voter, the optimum dynamic restoring circuit threshold
must be reached ata value less than (R + 1)/2.
1. P. A. Jensen, "Decision Making in Redundant Systems", Report No. EE-2599,
December 1961.
.o
5-16
• If it is assumed that the optimum value of threshold for the Hamming Distance Restoring
Circuit is reached at a value Tpt where Top t is less than (R + 1)/2 and R=5, the following
possibilities exist for the Transor thresholds.
1) T 0=T 1 =Topt
2) T O _ T 1 = Top t
3) T 1 _ T O = Top t
The first case is trivial. If all thresholds are equal, then the II formed from the
T
Transor criteria is clearly a subset of r[ H ' i.e., 11_T c H H by virtue of the factor
2 a 10 in the Transor inequality.
In case (2) T O can either be greater or less than T I. IfT O < T 1 then (R - TI)< (R o TO)
and is the controlling factor. But since (R- T1) =(R- Topt) rITCHH. If T 0>T 1=
Top t for example, T O = T 1 + 1 = Top t + 1 then (R - T0) is the controlling factor. But (R = TO)
= (R - Top t - 1) so that, effectively,while the number of terms containing transient proba-
bilitieshas been increased, the number ofterms containing steady-state probabilities has
been decreased by the same number and since qy, >> qa0 the reliabilityof the Transor
will never be as good as that of the Hamming Distance restoring circuit. The same reason-
ing may be applied to case (3).
5-17/18
IV. SIMULATION PROGRAM
The success of the computer simulation program in evaluating self-repairing systems
encouraged the use of a similar program for use as an analytical tool in this phase of the
failure free systems study. Such a computer program has been written and has provided a
variety of interesting results. Insights into the Transor circuit's most vulnerable areas
were gained through this program. One of the results was the development of the Hamming
Distance Restoring Circuit. The development of the system failure criteria statements for
the program contributed to the development of the general decision rl_es which have been de-
fined for Transor, Hamming Distance, and Threshold restorers. The program was used to
find the ratio of steady-state to transient error probabilities for which the dynamic restoring
circuits were at least as effective as the Threshold voter in deriving correct system outputs.
Finally, the program provided a check for the analytical models when numerical examples
were considered.
Computer simulation programs are commonly used to analyze the performance of de-
terministic systems which are so large and complex that a mathematical model would be
unwieldy or of probabilistic systems which are difficult to model, or when specialized infor-
mation is desired. The Dynamic Restoring Circuit Evaluator (DRCE) program fell into this
last category.
The computer program which has been written for this study retains all of the basic
philosophy of the program previously developed for the evaluation of seif-repairing systems. *
Some portions of the seif-repalr program were used directly in the DRCE program, but the
sections of this latter program which concerned system operational state (i. e., working or
failed) are much simpler than those of the seif-repair program. These simplifications were
possible because of the reduced size and the non-adaptive nature of this simulation problem.
In this simulation program, the range of numbers between zero and unity is divided into
intervals, and each interval is assigned to one of the subeomponents of the system. In a
system containing (s) subcomponents, the range is divided into (s) intervals each assigned to
a different subeomponent. This procedure guarantees that all the numbers in the range are
assigned in a manner which uniquely associates every number with only one component and
similarly, all components are assigned intervals in the range. By judiciously specifying
the lengths of the intervals, random numbers from a population uniformly distributed between
zero and unity can be used to simulate naturally occurring random subcomponent failures with-
in the system. To do this, the length of the component interval is made equal to the conditional
*This program is described in Appendix 6.
5-19
probability of failure of the subcomponent given that a failure exists somewhere within the
system. This probability is given bythe expression k i
Pi = _1
R _Xl
I=I
where M ig'the number of subcomponents in a single processor and R is the order of redun-
dancy (i. e., the number of signal processors in a state). A component failure is simulated
by determining a time to failure* and then locating the subcomponent to be designated failed
by associating a random number with a particular interval of numbers. Having done this,
the type signal processor output error is automatically specified, and the effect of this error
on system operation can be found.
As the first step, a system is set up with no initial failures. The above process is be-
gun and continued repetively until the system under consideration no longer meets one or
more operational criteria. At this point, the total system operating time is computed as the
sum of the times between component failures. This entire procedure is now repeated many
times (usually 100), and data concerning number of failures withstood and system operating
times are recorded. From this data various curves are plotted, and system response to
various failure patterns is observed.
* The method used to determine the time between each succeeding failure is identical to that
used in the self-repairing systems simulation. That method is described on pages 10 and
11 of Appendix 6.
5-20
.V. DISCUSSION OF RESULTS
A. SIMULATION RESULTS
Before proceeding with a discussion of the _esults, a brief description of how compara-
tive reliability versus time curves were obtained is required. For each system simulation,
the computer print-out includes a number which indicates the total operating time of the sys-
tem before the occurrence of a critical failure pattern caused loss of system function. These
numbers are ordered and split into groups so that a histogram of percent of systems failed
versus time can be formed. A typical histogram is shown in figure 4. From this histogram
an approximate reliability vs. time curve can be easily constructed by starting a line at
unity (100%) on the ordinate and zero (0.0) on the abscissaor time axis and proceeding hori-
zontally to the right until the time corresponding to the first spike on the histogram is
reached. At this point the line is dropped vertically by the arithmetic magnitude of the spike,
then continued to the right again until the next spike is reached. Continued repetition of this
procedure produces a curve such as that shown in figure 5.
The question that immediately arises is "How many system simulations must be run in
order for a curve constructed in this manner to be smooth enough to provide a meaningful
approximation to the true system reliability curve?" Because the question of '_hat is smooth
enough?" cannot be precisely stated without a series of opinionated assumptions, a simpler,
much less rigorous method of evaluation was used. The number of runs was arbitrarily set
at 100 and a curve was plotted for a particular Transor voted system. This was compared to
a series of points computed from the analytical reliability expression for the same system
subject to the same failure rates. The curve and points are shown in figure 6. The corre-
spondence of the curve and the set of points was close enough that the no increase in the
number of simulated runs was considered necessary. This relatively low number of runs
had the distinct advantage of requiring a computer running time of only about 30 seconds, in-
cluding compilation time, while producing acceptable results.
One more detail must be pointed out before the curves can be completely understood.
The primary interest in the study was the effect which changes the ratio of probabilities of
steady state to transient errors. For this reason, the total failure rate of the signal pro-
cessors was held constant for all simulations. This means that not only the general shape
of the reliability curves can be meaningfully compared, but also their locations relative to
the time axis. Holding the total failure rate constant in no way restricts the generality of
the results because a change in this rate would simply cause a linear shift of the curves along
the time axis.
5-21
I20
PERCENT OF
SYSTEM
FAILURES
DURING I0
EACH
INTERVAL
TIME INTERVALS
II
I
I
I
I
I
I
I
Figure 4. Typical Histogram
PERCENT OF
SYSTEMS
OPERATING
ioo !
90
80
70
60
50
40
30
20
I0
0
0
l_ L
L_
L
l__
T IME ._--_
I
I
I
I
I
I
I
5-22
Figure 5. Approximation to Reliability Curve
0000
O00000
|1 H li II
ggg 
XXXX _-
_JJJ
n
m_N m
XXX_
_.ja
- 6 d o o o d d d
fl) 3_11 ,I.V 9NII_13cI0 SN31SAS .,40 NOIJ.:)tlbl_l
0
N
t
o4
o!
0
W
o4
0
_J
14J
0
0
_E
.J
u
I,,--
I,iJ
n-
O
W
-f,.
I-
3E
0
n.-
I,
I-.
Z
®
Figure 6. Transor Order 5 Redundancy
5-23
B. CURVES DISCUSSION
The firstTransor simulations showed that in the region where Transor was competitive
to the threshold voter, the optimum T O and T 1 were both equal to two for an order five sys-
tem. The discovery that relationship held even under highly asymmetric failureprobability
conditions stimulated the development of the Hamming Distance Restoring Circuit. Ithas
since been shown analytically(see SectionIII)thatthe Hamming Distance Circuit always domi-
nates the Transor for order five redundancy applications. This result correlates with the
simulation comparison for the same configuration, subject to the same failure mode condi-
tions. An example of the simulation results is shown in figure 7.
In comparing the curves for the Hamming Distance Restoring Circuit and those for the
threshold voter, ithas been found that the latter tends to produce a more reliable output for
steady-state to transient error probability ratio below approximately seven to one (7:1)and
the Hamming Distance Restoring Circuit slightlymore reliable above that ratio, This ratio
cannot be exactly determined because certain worst case assumptions have been made in
establishing system operational rules for both circuits. These assumptions are slightlymore
detrimental to one than the other and may not be precisely realistic in either case. This is
demonstrated by the combination of points and curves shown in figure 8. In this figure, the
Hamming Distance curve appears to be slightlybetter than the threshold simulation curve in
the high reliability region of the curves and worse in the long life region. For this plot of
threshold curve, the assumption was made that the first steady-state error to occur in any
processor assumed permanent control of the output of the processor and any future transient
or steady-state errors in that processor were ignored. The points in that same figure were
plotted from a theoretical analysis in which it was assumed that the most detrimental steady-
state error which had occurred always controlled the outputs. This worst case assumption
does not affect the Hamming Distance curve but it heavily influences the threshold curve.
Under this assumption, the Hamming Distance Restoring Circuit clearly dominates over a
large section of the curve.
It is interesting to observe the changes which occur in the reliability curves of the re-
storing circuits as the ratio of steady-state to transient error probabilities is increased.
The fact that as this ratio is increased the Hamming Distance curve and the threshold curve
get closer together until they cross, indicates that one or both of the curves are shifting in
response to the change. The first possibility seems to be the case. The points on the thres-
hold curve tend to remain fixed. (NOTE: a slight shift to the right may be observed. This
is caused by a reduction in the Pal0 as the ratio increases). The Hamming Distance curve
5-24
0t'q
n,,
tA,I
I'-
0
_ Z
i
I
I
I
!
r
I
I
I
I
J
0
N
0
lID
J
/ N
cO
sD
cs o 6 6 6 o o 6 o
(l) 3_1_ J.v 9NllV_3dO Sn31SXS _0 NOeZ3V_J
0
(/)
0
"r
h
0
0
0
I--
laJ
=E
I-
Figure 7. Comparison of Transor and Hamming Distance
5-25
5-26
o
Figure 8. Comparison of Threshold and Hamming Distance
i
i
|
i
i
l
i
I
l
l
I
I
I
is sensitive to changes in the ratio and shifts rapidly enough to the right to overtake the thres-
hold curve. At approximately the ratio when this occurs, the Hamming Distance curve rapidly
becomes less sensitive to changes in the ratio. The ratio continues to be increased, the
curve stabilizes and finally begins to slowly fall back to the left, thus indicating that an opti-
mum ratio exists in the region near (7: 1). This phenomenon appears to be caused by the discrete
nature of the threshold which controls the Hamming Distance decision rules. As the seven to
one (7:1) ratio is greatly exceeded, the threshold of the Hamming Distance should be reduced
to (1) if additional improvement in the reliability curve is to be expected. This threshold
reduction, however, would make the circuit vulnerable to single transient errors. Despite
the probable improvement in the overall reliability curve, this sensitivity to single failures is
generally considered undesirable. For this reason, no effort was made to simulate systems
with this thresholch
In figure 9, a comparison is made between an order five Hamming Distance curve and
an order seven threshold curve at a ratio of seven to one (7 : 1). It can be observed that in
the high reliability region, the curves are almost indistinguishable. This implies that unde_
these ratio conditions, an order five Hamming Distance restorer system might be as useful
as an order seven threshold voter system. This would allow an obvious saving in redundant
equipment.
5-27
_0
04
0 _ Go i_ _D _ _I" fr) ed --:
- 6 6 6 6 o o 6 o o
(1) 31_11 1v 9NIltl_13dO $14131SAS alO NOIAgV_I.-I
0
.J
I.iJ
C_
0
_1
I--
0
I-
0
ill:
h
U_
I--
Z
D-
(3
Figure 9. Comparison of Order 7 Threshold and Order 5 Hamming Distance
5-28
I
I
I
I
!
I
I
I
I
I
I
I
I
I
I
I
I
I
i
VI. CONCLUSIONS
From the results obtained by manipulating the analytical reliability expressions for
the Transor and Hamming Distance Restoring Circuits, it may be concluded that the output of
Hamming Distance Circuit is more reliable than that of the Transor in order five redundant
systems. This conclusion holds for any ratio of steady-state to transient error probability
or any asymmetry (tendency toward "ones" or "zeros") of error probabilities.
From comparison of the simulation curves, it may be concluded that the threshold cir-
cuit is more re_Aab!e than either of the dynamic restoring circuits until the ratio of the pro-
bability of steady-state errors to the probability of transient error exceeds approximately
seven to one. Above this ratio, the dynamic restoring circuit outputs are more reliable.
Further comparison reveals that the difference in the reliability curves tend to stabilize or
slightly decrease as the ratio becomes much larger than 7:1. The stabilizing effect is more
pronounced as the order of redundancy is increased from five to seven.
Finally, it may be concluded that in the short life, high reliability region with approxi-
mately a seven to one probability ratio, an order five system using Hamming Distance Re-
storers may be as reliable as an order seven system using threshold voters.
5-29/30
Appendix 6
SELF REPAIR TECHNIQUES
Self-Repair Techniques
For Failure-Free Systems
Contract Nasw - 572
Reference WGD-38521
by
M. R. Cosgrove
C. G. Masters
September 1963
The Westinghouse Electric Corporation
/Y') \F Electronics Division
/_ //(_ _/L' BOX 1897, Baltimore 3, Maryland
_VE. Lomax, Director
Advanced Development Engrg.
ABSTRACT
This report describes the initial step in the design of an optimal self-repairing
system. The report contains a description of the several classes of "repair" strategies
under consideration and the computer simulation program which is used to determine the
performance of the systems for each strategy,
The computer simulation program determines the performance of a particular strategy
by injecting random failures throughout the system and simulating system reaction according
to the "repair" pattern of the strategy in question. The program prints out system performance
in terms of:
1, total time to failure
2. average time to failure
3. number of failures to system failure
4. number of switches affected.
The results for the two classes of strategies for which curves were drawn show
that with the addition of a minimal amount of self-repair capability, the reliability of the
system can be substantially increased over that of a comparable system using fixed
redundancy alone for failure protection.
6-ii
I
I
I
II
II
I
l
D
I
D
B
TABLE OF CONTENTS
ABSTRACT .............................
L INTRODUCTION ..........................
H. STRATEGY DESCRIPTION .....................
A. Basic Assumptions .......................
B. Basic Strategy Classes Considered to Date ............
HI. THE COMPUTER SIMULATION PROGRAM ..............
A. The Reason a Simulation Program was Used ...........
B. How the Program Works ....................
C. Sample Format ........................
D. Production Format ......................
IV. RESULTS .............................
A. Failures Withstood (as percent of system) vs. Spare Mobility .
B. Reliability vs. Time Curves ..................
V. SUMMARY AND CONCLUSIONS ...................
VI. FUTURE STUDIES ........................
VII. APPENDIX ............................
Page
ii
1
5
5
5
9
9
9
12
13
15
15
17
25
2?
29
6-iii
Figure
9
I0
LIST OF ILLUSTRATIONS
Multiple-line Redundant System ..................
Multiple-line Redundant System with Self-Repair Capability .....
Probability Distribution of a Component Failure ..........
Simulation Matrix .......................
Average Number of Failures Withstood (As Percent of Gamma 1
Systems) Versus Number of Moves per Spare .........
Average Number of Failures withstood (as Percent of Beta Systems)
Versus Number of Spares per Block .............
Minimum Number of Failures (as Percent of Gamma 1 Systems)
Versus Number of Moves per Spare .............
Minimum number of Failures (as Percent of Beta Systems)
Versus Number of Spares per Block .............
Percent of Systems Operating (Beta Class) Versus Time ......
Percent of Systems Operating (Gamma Class 1) Versus Time ....
Page
2
2
10
12
16
18
19
20
22
23
6-iv
i
i
i
I - INTRODUCTION
In an effort to increase the reliability of complex electronic systems, several methods
have been proposed for using "redundant" equipment to provide failure protection within these
systems. Two of the most useful types of redundancy techniques are multiple-line, majority
voted logic and multiple component grouping schemes. Although both techniques are very
effective, a large percentage of the "redundant" equipment is not efficiently used, L e., the
system fails with much of the "redundant" equipment still functioning. This undesirable
feature is inherent in systems of this type because random failures do not tend to distribute
evenly throughout the system, instead, they "almost hivarlably tend to group and c .ause a
critical failure pattern to occur in one subsystem area before many failures have occurred
in the remainder of the system. The roost drastic example of this is the failure of an order
three, multiple-line, majority voted system upon the occurrence of two successive failures
in the same stage with no other failures in the remaining stages.
Westinghouse has devised a new solution to the failure protection problem which exploits
most of the desirable features of the multiple-line, majority-voted schemes, but is not as
sensitive to critical failure patterns as the more standard techniques. This solution is in the
form of a set of strategies for allowing the reorganization of the systems in response to
failure patterns which may develop. The systems which employ these strategies are called
sel_-repairing systems.
The general approach of the self-repair strategies can be described through the use
of an example. Figure 1 shows a block diagram of an order three, multiple-line system.
Figure 2 shows the same system after some self-repair capability has been added. It is
assumed that all blocks in the system are functionally indentical such as the multivibrators
in a shift register, and are interconnectedby switching and voting circuits. If two blocks in
the same column fail and the blocks on either side of this column are still operating, the
self-repair switching mechanism senses this condition and shifts the required additional
working blocks to the failed column. The failed block can now be eliminated or "voted out. "
This procedure decreases the remaining protection provided the adjacent columns, but it
prevents system failure at a critical point and thus extends the life of the system. As
additional blocks fail, other blocks are switched into the failed columns. The choice of
_vhich block shall be brought in to aid the vulnerable column is determined by the particular
strategy in use.
6-1
mT¥
BLOCKS
Figure 1. Multiple-line Redundant System
Figure 2. Multiple-line Redundant System with Self-repair Capability
i
The unique feature of these strategies is that the switching circuitry can be completely
distributed rather than "lumped" into a central controller. As a result, most failures in
the switching circuitry are equivalent to signal processor (block) failures and are elimi-
nated in the normal manner. This means that individual failures in the switching circuitry
do not cause the loss of the entire self-repair capability.
Before a "hardware" design of self-repairing systems can begin, the full range of
feasible switching strategies must be examined, and from these an optimum strategy or set
of near optimum strategies must be selected. The majority of this report is concerned with
6-2
ill¸ i
a description of some of the more promising strategies and with the computer program
which is being used to simulate the failure response of systems which employ these
strategies.
There are a great number of possible strategies which may be investigated, many of
which are quite similar to one another. The strategies being considered are arranged in
groups called classes, the individual members of which are special cases of the general class.
This allows the investigation and programming of a few classes of strategies rather than
many individual strategies. This facilitates comparison of strategies within a class as well
as adding a certain degree of generality to the analysis.
Before proceeding to the description of specific strategies or classes of strategies,
the properties a self-repairing system should have must be noted and the basic assumptions
stated. A short list of the general desirable properties is compiled below.
a. Self-repairing systems should be more reliable than ordinary
redundant systems of identical function capability and cost.
b. The switching strategy used should make optimum use of the
redundant function blocks for a fixed amount of switching
complexity.
c. Instantaneous failure masking must be provided for system
applications which cannot withstand a temporary loss of data.
An example of this is the key-stream generator used in secure
communication channels.
d. The strategy must be suitable for implementation by a distributed
(non-centralized) switching network.
6-a/4
°11- STRATEGY DESCRIPTION
A. BASIC ASSUMPTIONS
Almost all large computing and control systems are formed by interconnecting a
relatively small number of different types of basic circuit blocks. As a result, the com-
ponents of these systems can be split up into homogeneous groups of functionally similar or
identical blocks. It is assumed, therefore, that such groups can be formed and that seli-
re_.hr stz__togies cPm be applied withAn e_ch group. Note: The members of any gr_-mp are not
required to be physically or functionally adjacent but may be located in scattered sections of
the overall system.
It is also assumed that at least two blocks must be performing the same nominal function
before a failure can be detected, and at least two correctly operating blocks must be perform-
hug the same function before a third (failed) block can be eliminated from this function.
If at least three blocks are performing a function and one of them fails, the elimination
process is assumed to be instantaneous, and the failure is assumed to be completely masked.
If, however, only two blocks are performing the function and one fails, a third block must be
switched to that location to eliminate the failure. This process is not assumed to be in-
stantaneous and errors appear in the system temporarily. As a result, systems using the
basic order-three redundancy with self-repair (as will be described in the Beta and Gamma
Class strategies of this report) must be capable of withstanding temporary data loss without
mission failure. If this assumption is not true, a higher order of redundancy must be used
as in the Alpha class strategies or higher-order versions of the Beta and Gamma classes.
If, because of particular failure and response patterns ,single blocks are left to per-
form particular functions it is assumed that the system continues to operate with one or
more stages existing in the non-redundant state either until one of these blocks fails or until
another critical failure pattern occurs elsewhere in the system.
Finally, it is assumed that a stage shown pictorally at one end of a system is, in
reality, adjacent to the opposite end and enjoys the same repair facilities as stages shown in
the center of the system.
B. BASIC STRATEGY CLASSES CONSIDERED TO DATE
The following few paragraphs will indicate the general principles of each of the three
strategy classes which have been simulated thus far. Detailed examples of each class are
shown in the Appendix_md the reader will probably need to refer to these for detailed con-
sideration of the following descriptions.
6-5
1. Alpha (a) Class
Systems employing the a class strategies are basically multiple-line redundant
(usually order three) systems which are equipped with sets of spares. These spares are
additional function blocks which can be automatically used to replace failed blocks. In
general, spares can not economically be given enough mobility to aUow a single spare to be
capable of replacing each operational block in the entire system• Instead, individual spares
are usually given restricted capability and may replace only blocks in a single row* or
portion of a row. A large number of strategies, each belonging to the ( a ) class, can be
generated by varying (a) the total number of spares available for a fixed system size, (b)
the mobility of each spare (c) the pattern in which the spares' repair capabilities overlap.
If it is assumed that spares will immediately replace failed blocks regardless of
whether it is the first failure in a function column or not, complete failure masking is
achieved. The threshold vote technique will continue to absorb failures after the spares
complement is exhausted until a majority of unrepairable failures have occurred at a
particular function. At this point the system will fail since both the self repair capability
and the network redundancy have been exhausted.
2. Beta (B) Class
Beta Class strategies do not utilize inactive spare blocks as does Class a
With no failures, the system operates as an ordinary multiple-line redundant system. When
a critical failure i. e., one which would cause failure of a multiple-line redundant system,
occurs, the failed block is removed from the system and replaced by a properly functioning
block from an immediately adjacent function. The individual strategies in this class differ
from one another primarily in the number of spares which they can draw from the rest of
the system.
Because failures are replaced by function blocks only from the adjacent functions
there is a smaUer amount of switching circuitry involved with Class 19 than with other classes
of self-repair strategies• This advantage is partially offset, however, by the one drawback
inherent in this class of strategies. That is these systems are more vulnerable to fail-
ures which are grouped in one area of the system than are the more flexible strategies.
The three strategies of this type which have been simulated are described in the
Appendix. These particular strategies do not usually allow blocks to move a second time
after an initial repair has been made. This restriction has been made for a variety of
reasons, but other strategies are being considered which will release this restriction. In
addition, strategies having increased spare mobility will be considered in future studies.
* For example the top line or row of signal processor in Figure 1.
6-6
|
|
|
l
• 3. Gamma ( r ) Class
The Gamma ( y ) Class of self-repair strategies contains much more variety
than either Class a or Class B. The class is characterized by a shifting of the spare
blocks in one direction to alleviate the critical condition caused by the failed function
blocks. Unlike the strategies of Class B, it is possible for a spare to move several times
in response to failures. When a critical failure occurs, one of the function blocks adjacent
to the failure will replace it, leaving a void. This void, if it creates a vulnerable situation
i. e., one block per function stage, will be filled by the function block immediately adjacent
to it in the opposite direction from the original failure. The next failure to occur in the same
stage as the original failure causes another shift of the function block now adjacent to the
failure. This may be a function which has already shifted in response to a failure. As long
as spares are available, they will continue to shift laterally to replace failed blocks or to
fill voids.
Since the spare function blocks are allowed much more mobility in this class of
strategies, more failures can be corrected. However, the amount of switching circuitry
necessary to implement the strategies is a monotonically non-decreasing function of the
mobility of the spares. This creates problems of implementation which limit the usefulness
of high spares mobility.
The individual members of Class ), strategies differ primarily in amount of
mobility allowed to" the function blocks. This, in turn, affects the failure absorption capa-
bilities of the strategies. Again, the individual strategies are described in more detail in
the Appendix.
6-7/8
HI. THE COMPUTER SIMULATION PROGRAM
A. THE REASON A SIMULATION PROGRAM WAS USED
Although the reorganization features of self-repairing systems improve the failure
absorption capability of redundant networks, these features drastically affect the analytical
reliability expressions developed for multiple-line, majority-voted systems. Not only does
a slight amount of reorganization capability greatly complicate the expressions, but each
modification of each strategy class appears to require a different solution. Extensive efforts
to model some of the simpler self-repairing systems have been unsuccessful. Because of
this, efforts to write exact reliability expressions have been dropped, and a general computer
simulation program has been written to facilitate a Monte Carlo approach to the reliability
analysis. This program can be used to simulate a broad range of strategies, and it provides
data about the actual switching patterns which tend to occur in a system. This latter infor-
mation could not be easily determined from reliability expressions even if they were avail-
able. A plot of reliability versus time can be obtained directed from the program results
with no more additional input information than would be required by calculations made using
analytical expressions.
B. HOW THE PROGRAM WORKS
1. The General Program Philosophy
A redundant system of the desired order of redundancy and number of functions
is set up in matrix form. The strategy class is then selected from a group of sub-programs
and input data which specifies the particular strategy to be tested is read in. Through the
use of a series of random numbers, individual blocks are designated as failed, and the
switching strategy responds to each failure until the system fails to pass the operational
criteria. A second series of exponentially distributed random numbers determines the time
between each simulated failure, and the sum of these is the time to system failure. Once
the system fails, the pertinent data is recorded, and the computer resets and begins to
generate two new sets of random numbers. Continued repetition of this process provides
the compilation of data mentioned in part A of this section. The following paragraphs indi-
cate specifically how the various portions of the program work and the form of the print
out.
2. The Failure Selection Program
A simple procedure for randomly selecting the failed function blocks has been
set up. Each block is assumed to have an exponentiaUy decaying reliability = e -)'t where
is a constant failure rate. It has been shown that the conditional probability that a failure
6-9
has occurred in the i th block given that a failure has occurred in the system is equal to the
constant, N
X
i=l
Xi
Xi
If the interval between zero and one is split into N subintervals, each proportional
to the associated conditional probability, a set of random numbers uniformly distributed
between zero and one can be used to determine which blocks failwith correct conditional prob-
ability of picking any one box. In this particular computer program, the random number
specifies the block to be failed. The system then responds to eliminate the failed block. If
the response is possible, i. e., a spare block is available to make the repair, a new random
number is chosen and the procedure repeats.
as failed.
o
If no spare is available, the system is judged
Time Determination
For each of the simulated failed blocks selected above, a time to failure for the
block is also determined. A.M. Mood 1 has shown that random numbers taken from the
uniform distribution can be transformed into any desired continuous distribution by letting
f(y) = 1 0 < Y < 1
y = G(x)
Where G(x) is the cumulative distribution of x.
Thisrelationship is shown graphically in figure 3.
1Mood, A.M.
1.0
Y!
Y
(UNIFORMLY
DISTRIBUTED
RANDOM
NUMBERS)
O.0 XI
:X [RiP, lOOM NU RS
' DISTRIBUTE_)t_A_G(X|:]
Figure 3. Probability Distribution of a Component Failure
- Introduction to the Theory of Statistics McGraw Hill Book Co., Inc. 1950
6-10
Y is a single valued function of x and vice versa. For each Y chosen from a uni-
form distribution, a unique value of x is determined.
-kt
The G (x) function which is of particular interest here is G(t) = 1 - R(t) = 1 - e
This is the distribution function associated with the probability that the first failure has
occurred within a system. This curve is shown in figure 3.
For the first function block failure, a random number is chosen from a uniform
population and transformed to a corresponding number from the exponential distribution.
This latter number is the time from system start to the first failure. To calculate the time
to the second failure, the X associated with the first failed block should be subtracted from
the Y X's and the procedure repeated. The new number thus obtained would be the time from
the occurrence of the first failure to the occurrence of the second failure. When the system
fails, the sum of these individual failure times will determine the total system operating
time.
In the present program, the above procedure is slightly modified to make com-
putations easier. Instead of decreasing the Y X's after each failure, this sum is left the
same and blocks are allowed to fail more than once. When a block fails for the second time
no action is taken other than to add the time to this failure to the system operating time.
This modified procedure would not be acceptable if the times between subsystem failures
were of interest, but since total system operating time is the only factor to be considered,
the results are almost identical to these which would be obtained in the more straight-
forward approach.
4. The System Reactions
It is obvious that many specific reactions are different for different strategies,
but the general manner in which the program performs the various shifts and the type
'"oookkeeping" involved can be briefly described. Figure 4 schematically illustrates the
form in which computer "views" the system to be simulated. The height of the '_asic array"
is set by the original order of redundancy, the width by the number of stages, and the depth
by the number of data words associated with each block. The "failed block array" is a two-
dimensional array into which the data words for failed blocks are shifted as the failures
occur. The only indication to the computer that a block has failed is the shifting of these
data wor:ls into this latter array.
When a set of data words is moved into this array, the computer examines the
remainder of the system and makes any necessary response. This is done by shifting the
data words associated with the appropriate spare blocks from their original locations i_to
the locations specified by the particular switching strategy being considered.
6-11
M
(ORDER OF -<
REDUNDANCY)
K
(DATA _ _-_ /
WORDS/BLOCK)j it- / _ /
%..
I _2e_
Y
N
(NUMBER OF STAGES)
/
/
/
/
/
J
/
FAILURE
RESPONSE
FAI LED
BLOCK
ARRAY
S
INDICATES
EMPTY
LOCATIONS
Figure 4. Simulation Matrix
C. SAMPLE FORMAT
A check must be made to determine _vhether the computer simulation program is
operating correctly, i. e., selecting the correct function block for failure according to the
random number set, responding properly to failures according to the particular strategy,
and failing at the proper time and under the proper conditions. In order to accomplish this,
a samL:,le format has been developed. This sample format prints out the following informa-
tion:
1. * The function block designations and the random number range
which describes failure of the block.
2. * A list of failures which occur with all the information associated
with the failure such as:
a. The random number which was selected
b. The location of the failed block
c. The amount of time from the previous failure to the time
of failure of the block in question
d. The cumulative time from the beginning of system operation.
3. The average time between failures.
This information is printed out for each failure until the system fails.
6-12
I
I
I
I
When a critical failure of a function block occurs, an operating spare is switched into
the vacant position by assigning random number limits of the spare block to the failure
location. This permits checking of the switching pattern to determine if the simulation
program is working, since an incorrect switching operation will place the random number
limit designation in the wrong position. This event can be detected when the incorrectly
switched function block fails and the position specified by the random number does not
correspond to that printed out in the sample format.
To check a strategy, several runs are made using different random number sequences.
.... 11The sample format prints u-t _ the above L_ormation for _,,_,,o°_hcase. Fro-*'_ _Jis L_formation
a determination can be made as to whether the simulation is following the rules for the parti-
cular strategy.
In addition to performing the function of checking the simulation program, the sample
format provides another valuable service. By observing the vicissitudes of the system with
respect to the switching patterns which develop, information can be gained about changes in
the strategy which might profitably be used to implement more efficient system operation or
more economical switching circuitry implementation. This is the manner in which Class _, 2
was derived from class _, 1"
D. PRODUCTION FORMAT
A typical production run of the computer program simulates system operation for one
hundred randomly selected failure patterns. Up to the present time, all runs have included
one hundred patterns simply because relatively good estimates of the average system para-
meters such as total time to fail, number of failures withstood, etc. are obtained without
requiring excessive amounts of computer time.
The production format directly provides the following information for each of the one
hundred cases:
1. Average time between function block failures
2. Total time to system failure
3. Total number of function block failures before each system failure
(including multiple failures of the same block)
4. Net number of failed function blocks at time of system failure
5. Total number of switching moves experienced by each system
6. Total number of moves made by each spare function block.
In addition to printing out columns of numbers covering the first five items on the
list above, most of the data is compiled into bar graphs. Each of these graphs reflects the
6-13
performance of the set of one hundred runs with respect to a particular parameter. On the
graphs, either discrete points (e. g. net number of failures) or interval terminal points
(for continuous parameters such as time) are plotted on the absissa. The height of the bar
above each point or interval shows the number of spares or system simulations which are
described by these positions on the absissa. The program includes a normalization routine
for each graph which is used to compute the average, the variance and the standard deviation
associated with each graph.
6-14
IV. RESULTS
The strategies discussed here (and any new ones which may be invented) must be com-
pared and contrasted to determine their usefulness in increasing the reliability of electronic
systems. The primary goal of this comparison is the determination of which strategy pro-
vides the greatest net increase in system reliability. Because it appears that the switching
circuitry associated with spare blocks increases as the mobility of these blocks increases
and because the failure protection effectiveness of added flexibility is non-linear, it cannot
be simply assumed that the best strategy is the one with the greatest spare block mobility.
The best way to compare these strategies would be to completely design functionally
identical systems using each strategy; get the best available estimates of the failure rates
of all the parts; feed this into the computer program and, in the manner described below,
plot the reliability versus time curves. The comparison would merely require that one
directly observe which strategy has the highest reliability curve. This approach would re-
quire a detailed system design for all strategies. To avoid wasting time on strategies which
can be shown to be inferior to others with much Iess detailed input data, several less exact
comparisons can be made. These comparisons, which are described below, are the ones
which are being made at this point in the study.
A. FAILURES WITHSTOOD (AS PERCENT OF SYSTEM) vs. SPARE MOBILITY
An important consideration in the comparison of systems is the number of failures
which can be withstood without system failure. In order to compare strategies with one
another where the variable is the number of moves allowed per spare, the number of
failures withstood is an important and meaningful criterion. To further compare systems
of different sizes on a common base the curves plotted for these systems are expressed in
terms of average percent of total system failed versus spare mobility. In figure 5 curves
are plotted for three systems of different sizes, 24, 48 and 96 stages employing strategy Z 1"
They are plots of average percent oi failures versus number of moves per spare.
These curves provide very useful and interesting results. They are characterized by
a sharp rise, a knee and a rapid leveling off. The knee occurs at a small number of moves
per spare compared to complete (total system) spare mobility. According to this graph, a
great increase in number of failures withstood by a system is effected by increasing spares'
mobility up to a point. The increase, then, is diminished and a point is reached beyond
which little or no increase in number of failures withstood accompanies an increase in
mobility. The characteristic exhibited by these curves illustrates that great increases can
be attained in system performance by the introduction of self-repair Class Z 1 with
6-15
44
4O
35
30
/
/
/
/
/
/
f
/
/
24 FUNCTIONS
f
PER SYSTEM
40 FUNCTIONS PER
96 FUNCTIONS
_v
SYSTEM--
PER SYSTEM
BLOW UP OF CURVE _F2
25
0 20 40 60
NUMBER OF SPARES PER FUNCTION BLOCK
(GAMMA I SYSTEM)
6-16
Figure 5. Average Number of Failures Withstood(as Percent of Gamma 1 Systems)
Versus Number of Moves Per Spare
relatively little mobility. The addition of more mobility adds little to the effectiveness of
the technique. This indicates that the most gain is attained with a small degree of mobility;
therefore, the most efficient operation of the technique can probably be accomplished with
relatively little switching circuitry.
Plots have also been made for the percent of system failed vs. number of spares per
function block for the /_ class strategies. These plots are illustrated in figure 6. The
curves in figure 6 are plots of the Average Number of Failures Sustained versus Number of
Spares per Function Block. The results show substantial gains over the multiple-line case
for each increase in spare mobility. _-nese curves are restricted to low mobilities because
of the fact that the Beta class draws spares to replace failures only from the immediately
surrounding area.
Since an important consideration is the worst failure patterns, a plot is shown of the
lowest number of failures which were sustained to system failed vs. mobility for the Gamma
Class strategies. (See figure 7). These curves agree very closely with those of figure 5
thereby substantiating the conclusion even for the worst case.
Figure 8 shows the Minimum Percentage of Failures Sustained versus Number of
Spares per Function Block for the three different length # Class systems. These curves,
like those for class Gamma, show a gain over multiple-line system for each advance in
mobility.
B. RELIABILITY VS. TIME CURVES
The reliability of a system as a function of time is the probability (P) that the system
will be operating correctly at that time, or, out of a given sample, s, Px s of these will be
operating correctly. From the production run printout of the computer program, it is
possible to plot the percentage of the systems which are operating versus total operating
time. This plot closely approximates the reliability curve associated with a particular
strategy. The plots made here represent one minus the cumulative sum of the bars of the
graph for number of systems failed versus time. For each interval of time in which failures
occur a step function is subtracted from the curve corresponding to the number of systems
which failed in that interval This process produces a curve which is a series of discrete
steps, starting at 1 and going to 0 as time increases. Smoothing out this curve would result
in a curve which is identical in form to the standard s-shaped reliability versus time curve
which is common to redundant systems.
As it was mentioned in the introduction to this section, this type curvewould be an
excellent comparative tool if accurate estimates of the switching circuit failure rates could be
made using completed system designs. Because the designs are not yet available, the use-
fulness of these curves is restricted to that of investigating which strategies are best under
6-17
0.4
t_
¢¢
0.3
0.2
/
f
/
/ /
• ///
o,/j
/
24 FUNCTIONS PER SYSTEM
f
48 FUNCTIONS
f
PER SYSTEM
96 FUNCTIONS
/qv
PER SYSTEM
0 I 2 5
NUMBER OF SPARES PER FUNCTION
Figure 6. Average Number of Failures Withstood (as Percent of Beta Systems)
Versus Number of Spares per Block
6-18
40
8
|
30
20 ¸
/J /
// /
///
I11
!1,
,off/
71
rl
f
|4 FUNCTIONS PIER SYSTE_M
I I I
48 FUNCTIONSjI PERI SYSTiM
96 FUNCTIONS PER SYSTEM
O 5 I0 15 20 25 30
NUMBER OF MOVES PER SPARE BLOCK
Figure 7. Minimum Number of Failures (As Percent of Gamma 1 Systems)
Versus Number of Moves Per Spare
6-19
0.2
_J
k-
_L
,
IE
24 FUNCTIONS PER SYSTEM
48 FUNCTIONS PER SYSTEM
i
96 FUNCTIONS PER SYSTEM
0 I 2 3
NUMBER OF SPARES PER FUNCTION
Figure 8. Minimum Number of Failures (As Percent of Beta Systems)
Versus Number of Spares Per Block.
6-20
iri_i_ ¸
certain limiting failure rate conditions. Even under these conditions, the reliabilityversus
time curves are very useful because they provide a universal means of comparing all stra-
tegies in all classes.
Examples of these curves for the Beta and Gamma Class strategies are shown in
figures 9, and 10. The following comments indicate some of the significant features of
these curves.
1. Beta Class Reliability Curves
The reliability curves for the three members of the class are shown in figure 9.
The curve for an order-three, mtfltiple-line redundant system is also shown. These curves
show a significant gain in reliability of all three strategies of the Beta Class over the re-
dundant case. The effective gain will not be as great in reality because perfect switching has
been assumed in plotting the curves.
With the limited amount of switching allowed to strategy _ 1 an increase in
MTBSF of approximately 100% results. As more switching capability is allowed to the
system the reliability continues to increase, showing that strategy _ 3 provides significant
increase, reliability-wise, over either _ 1 or _ 2 and very significant increase over the
multiple-line redundant case.
2. Gamma Class Reliability Curves
Figure 10 illustrates the reliability curves for four gamma class strategies.
ILlustrated are the limiting cases 1 move per spare and 23 moves per spare*as well as a
multiple-line redundant system. Two strategies of intermediate mobility are also shown.
These curves, again, show that the introduction of a minimal amount of switching
capability, 1 move per spare, causes a significant gain in reliability and operating time over
the redundant system. It is obvious, also that the first few increases in mobility capability
of the spares induce further noticable gains in reliability over the one move per spare case.
As additional mobility is granted to the system, the reliability gained begins to diminish.
This is illustrated by the fact that as much gain in reliability is attained by increasing
mobility from one to three moves per spare as is gained by going from three to twenty-three
moves per spare. This also reflects the flattening effect observed in the curves of percent
of Failures Sustained versus Mobility of the System, wherein the additional mobility after a
certain point bought no additional gain in reliability.
* 24 Function System
6-21
8
0
o_
_W-
o. m. Qo _ _ _o 'K. m N. "-.
- o d d d d o d o o
9NI/VM3dO S_31S,,(S (SSV7,9 V./.38) .,-tO IN..7,..OM_TM
Figure 9. Percent of Systems Operating (Beta Class) Versus Time
6-22
,|
>.
o
Z
o
hi
vr
_0. = = = b4
J
_: I
143 = = • liJ
O= cn -J
WW CL
O_
_E _
__OUQW _JI
r
f
rS}
r %
.7- ___
_.r-
f jf
J
O. m ao f_. w _) _ _ N
- d d o d d d d o
9/V/.ZVU3dO $_31SA$ ($SV73 P'NNVg) 4O IN3_3_
Figure 10. Percent of Systems Operating (Gamma Class 1)
Versus Time
n-
_r
v
w
=E
)--
6-23/24
V. SUMMARY AND CONCLUSION
Before self-repairing systems can be implemented, many feasible switching strategies
must be considered in an effort to determine the most effective manner to maui_zltate the
redundant or "spare" blocks. The extreme complexity of the reliability expressions associated
with these strategies has resulted in the use of a computer simulation program for comparing
the effectiveness of the strategies. Rather than proceeding to write separate programs for
each strategy, a more general program has been written which employs a small number of
subrouHncs, each of which describes an entire class of strategies. Input data determines
which class subroutine is being used and which strategy in a particular class is being simu-
lated. Although this generalized program is a great improvement over the individual pro-
gram for each strategy approach, itstillrequires additional programming each time a new
class subroutine is added. At this time, the change to a more general program, whose simula-
tion strategy can be completely determined from input data, does not seem to merit the pro-
gramming time which would be required.
The present program includes subroutines for three classes of switching strategies.
Each class subroutine contains a great deal of flexibility,thereby including many individual
strategies. This method facilitateseasy comparison between members of a class. This
comparison allows immediate elimination of many possible strategies as obviously uneconomi-
ca/. For example, the flattening out of the Percent of System Fai|ed versus Spare Mobility
curves (figures 5 through 8) indicate that ail possible strategies on the Rat part of the curves
cannot be optimum strategies.
From the results of the simulation program, curves for Percent of Systems Failed
versus Spares Mobility have been plottedfor the Gamma Class strategies. These curves
have been referenced to that of a multiple-line majority voted system because this particular
technique has been the most effective of the passive, failure masldng, circuitlevel redundancy
techniques. In all cases these curves show not only that great gains can be realized over
multiple-line redundant scheme but thatby far the greatest part of these gains are realized
for the firstfew moves allowed to the spare function blocks. Beyond the range of reiatively
limited mobility, littleor no gain in the average number of failures absorbed is realized by
the additional mobility allowed to the spares. This is an encouraging result since the great
majority of the gain due to self-repair can be retained without the use of an exorbitant amount
of switching circuitry.
In the /9and ),classes of self-repair strategies the degree of failure masking is the
same as that for a muRiple-line redundant system of the same order of redundancy. This
is due to the fact that no "repair" is made until an ambiguity is present on the output of a
6-25
stage. This event corresponds to redundant system failure which activates the switching
mechanism and the "repair" is effected. However, until the failure is "repaired" no
failure masking is present, and incorrect information may be transmitted to the next stage.
The a class strategies provide additional failure masking because repairs can be
initiated by the first occurrence of a failure in any stage. However, because this class im-
plies a higher order of redundancy it cannot be compared to order-three multiple-line
redundancy as the ]9 and y class have been.
The curves of figures 9 and 10 show a very definite gain in reliability for the self-repair
strategies over multiple-line redundant systems. The curves for the Beta Class strategies
show an increase in reliability for each increase in "repair" capability. Strategy B 3 yields
the highest reliability but even strategy B 1 shows a significant gain over the multiple-line
system. The reliability curves for the Gamma Class show essentially the same result with
respect to the multiple-line case. However, investigation of the curves show that increasing
the "repair" capability produces gains for the first few increases after which the magnitude
of the gain diminishes. These curves tend to bear out the conclusions drawn from Percent
System Failed versus Spares' Mobility curves which flattened out after a certain mobility
was reached. The gains illustrated here must be considered as ideal because the switching
circuitry for self-repair is here assumed to be perfectly reliable. More realistically, the
gains obtainable will be a function of the switching circuitry complexity and will not be as
great as shown here.
6-26
VI. FUTURE STUDIES
All of the computer simulation results discussed in this report have been based on
the assumption that the switching circuitry was perfectly reliable. Efforts are now being
made to determine the range of aUowab le failure rates which can be associated with each
strategy for it to be of maximum effectiveness. These ranges are to be studied as a function
of the failure rates of the associated signal processor blocks. As a result, before actual
system designs are begun, information specifying the optimum switching strategy correspond-
ing a given sigaal processor failure ,-_*_,_should be available.
From the sample and production simulation run printouts it has become obvious that
many of the spare function blocks do not experience as many switching operations as they
have the capability for. When all spares are assigned a uniform mobility some reach their
limit and, in doing so substantially extend the life of the system. However, in many cases
when system failure has occurred, there are many spares remaining which have not been
used to any great extent. In order to capitalize on this phenomenon a class of strategies 7" 2
is being developed which will assign different mobilities to the spares in a stage. Class 7" 2
will be simulated by a new sub-routine which is being written for the computer program.
When data is available comparisons will be made between this and the other classes.
Additional classes will be simulated in a similar manner as they are developed.
None of the strategies considered so far have permitted spares to return to previous
locations. It is possible that removal of this restriction might add to the failure absorption
capability of a system. This area certainly should be explored in this study series.
Although little has been said about the physical switching techniques to be employed,
it has been tacitly assumed that the failure detection and replacement circuitry would be
combined as much as possible. It has been suggested that these two phases of the repair
function might profitably be separated and made almost completely independent from a circuit
viewpoint. This is another area which should be given careful attention.
The Alpha class strategies have not been thoroughly investigated to determine the
optimum degree of spare overlap (i. e., two sets of spares serving some of the same
functional region). The information from this investigation should influence the design of
new strategy classes as well as indicating the optimum strategy for the Alpha class.
6-27/28
VIL APPENDIX
A. CLASSa
Illustrated in figure A-I is an a class strategy wherein each spare can "repair"
failures in one row and either of two stages. Spare "I" can "repair" stages 1 or 2;
"2" can "repair" 3 or 4, etc. Each spare can repair failures only in its own rows. This
c_.nbe expanded such Lhat, for example, three spares can each repair function blocks in any
of ten stages or, in general, r spares for n stages. Overlapping of spares capability may
help guard against "lumped" failures.
Many different strategies and system repair capabilities can be developed by simply
varying r and n or by overlapping possible individual spare "repair" ranges.
Fq m
ff]ff]ff] ff]
ff]E] Fq D ff]
v'-
SPARES
E! 13 E]
Figure A-1. Alpha Class Self-Repair
B. CLASS
There are presently three specific strategies of _ Class. The major difference
between these strategies is the number of spare function blocks which can replace a given
failure.
1. Class B 1 (Figure A-2)
Class _ 1 allows only one "spare" for a given fatlure response. For example,
function block "H" is given capability as a spare for stage # 4. Figure A-2a shows the
system before failures occur. When one function block, J, in stage #4 fails no switching
results other than the elimination of the failure. (See figure A-2b). When the second failure,
say K, occurs in stage # 4, function block "H" will move into stage # 4 (See figure A-2c. )
and resolve the ambiguity caused by the fatlure. After the failed block has been eliminated
block "H" remains in stage #4.
6-29
o. _! _ _ D ®
STAG E NO. | 2
SYSTEM BEFORE FAILURE
3 4 5
Figure A-2a. Beta Class Self-Repair
OPERATION OF CLASS Bt STRATEGY
b. _-_ _ _-_
STAGE NO. t 2 3
FIRST FAILURE --NO RESPONSE
®
5
FAILED FUNCTION BLOCKS
Figure A-2b. First Failure
Ce
STAGE NO. t 2
SECOND FAILURE --4sJ RESPONSE
ITI ,s_, ri I-fflRESPONSE L --
5
D
FAILED FUNCTION BLOCKS
Figure A-2c. Second Failure Response
6-30
Itis possible that one function block will remain working alone without system
failure. For example, iffunction block "G" failedbefore "K" function block 'T' will
carry the load for stage 2 after "H" switches until itfails. (See figure A-3. ) System failures
occur when a lone operating function in a stage failsor when no spare is available to resolve
an ambiguity. Failure of this system could occur when function block "E" and "G" have failed
and failure of blocks "H" or 'T' occurs (figureA-4), since for this strategy, block "E" is the
only spare capable of "repairing" a failurein stage #3.
STAGE NO.
®
®
[]
I
I ST RESPONSE
/
L--J / L---/
/ j3RD
2
N N
FAILURE
®
®
5
FAILED FUNCTION BLOCKS
Figure A-3. Third Failure Response
F-I _ui
NO SPARE AVAILABLE
[_ [] FAILED FUNCTION BLOCKS
Figure A-4. Catastrophic Failure Sequence
2. Strategy/_2 (Figure A-5)
Strategy/_ 2 is similar to /_ 1' but it allows one additional function block to re-
place failures in a given stage. In strategy/_2 function block "M" in addition to "H" is
given the capability of replacing failed blocks in stage #4. Strategies /_ 1 and /9 2 operate
6-31
identically through the first two failures. When the third failure in stage #4 occurs block
"M", if still operative, will switch into stage # 4 in the same fashion as did function block
"H" in Class _ l" This move is labeled "2 ' response" in figure A-5. System failure in
strategy ]9 2 occurs in the same manner and under the same conditions as in strategy /9 l"
STAGE NO.
IL RE
2
I-J-1 [] FAILED FUNCTION BLOCKS
Figure A-5. Beta 2 &3 Strategy
3. Strategy ]93 (Figure A-5)
Strategy ]9 3 extends the scheme one step further. Here, a third function
block is allowed to move in addition to the two responses allowed to strategy ]9 2" In this
strategy the ability is imparted to function block "G" in stage 3 to replace failed blocks in
stage #4. This is the 3rd response shown in Figure A-5. Again, failure occurs in the
identical fashion to the other two strategies.
C. GAMMA (y) CLASS
Gamma Class is divided into two parts: Class y 1' where all spare function blocks have
the same mobility, and Class _. 2 where one spare in each stage has a greater mobility than
the other.
1. Class y 1 (Figure A-6)
As in Beta Class strategies, the first failure in a stage of a Gamma Class system
evokes no response from the system. The second failure creates an ambiguity on the output
of the stage. This activates the switching mechanism to switch block "H" into stage 4 thereby
dissolving the ambiguity. (See Figure A-5b. ) The second failed block is now identified and
switched out of the system. Block "H" remains in stage 4 to detect subsequent errors.
another failure occurs in stage 4, for example block "L", block "G" from stage 3 will switch
into stage 4 in the same manner as did block "H". This leaves no error detecting capability
in stage 2. To overcome this, block E from stage 2 switches into stage 3 to fill the void created
by the switch of block "G". (See figure A-6c.)
6-32
Qo
STAGE NO.
[]
I
[]
E]
2 3 4 5
FAILED FUNCTION BLOCKS
FIRST FAILURE - NO RESPONSE
Figure A-6a. Gamma 1 Strategy - First Failure
be
STAGE NO.
r-lI_1 ITI l_1 =__
RESPONSE
I 2 3
SECOND FAILURE - IN "STAGE NO. 4
®
5
FAILED FUNCTION BLOCKS
Figure A-6b. Second Failure Response
CQ
STAGE NO. I
l-Zl_r_ _L_J LJ
LJ
2 3 5
FA!
FAILED FUNCTIONBLOCKS
Figure A-6c. Third Failure Response
6-33
do
STAGE NO.
-] RESPONSE r-- "-I
I 2
F'I r-I
LJ L_
:ALLURE
E] E]
E:]
3 4
I_]F'_[_i FAILEDBLocKsFUNCTION
®
E]
5
Figure A-6d. Single Block Operation
Now if a failure should occur in stage 2, block "D"; a spare function block "B",
from stage l will switch to stage 2 and the failed block "D" will be switched from the system.
(See figure A-6d. ) As additional failures are sustained this process continues until a limit
is reached. The end to this process can be reached in one of two ways:
l) A limit can be set for the mobility of a particular function block.
In this case, once a function block has reached its limit it can no longer act as a spare for
failures in the stage following it. If a critical failure occurs and all possible spares ha ve
failed or reached their limits the system fails. Voids which cannot be filled due to spares
reaching their limit remain as voids but the system continues to operate until the remaining
function block fails. This limit sequence is illustrated in figure A-7a. Block "A" has a
°
STAGE NO.
Fq F-I F'-I F-I F-I
LJ LJ LJ LJ LJ
El N rq El
I 2 3 5
_ [-'_ _ [-H--]_ FAILEDBLocKsFUNCTION
Figure A-Ta. Function Block Limit
6-34
mobility of 3 and after a given failure pattern the system appears as in Figure 7a. Block "A"
has reached its limit. Upon the occurrence of a critical failure in stage #4, block "A" can-
not act as a spare for this stage. The ambiguity remains on the output of stage 1 and the
system is considered failed. However, if the critical failure occurred in stage 2 rather than
stage 4, block "M", since it hasn't reached its limit, would switch into stage 2 and resolve
the ambiguity. This leaves a void in stage 1. Function block :'G:: cannot switch into stage I,
hence, the void remains and the system works properly as long as the remaining block in
stage I does not fail.
2) Another failure mecb_._ism can exist for class y. When t_hesystem
has sustained a large number of failures such that the number of remaining spares is equal to
the number of stages this second mechanism case becomes effective. When an additional
failure occurs, each spare function block will respond once, the initialone will resolve the
ambiguity and others will fillthe successive voids which appear in the immediately preceding
stages. Since there is now one less spare than there are stages a void must remain some-
where in the system. Ifthe next failure is in the stage which contains the void or that stage
for which the void would have been a spare, the system goes down. For example, referring
to Figure A-7b iffunction block '_" fails,block '_D" will switch into #4 to correct for the
failure. Block "A" will fillthe void for block "D", block "M" for "A" and block "H" for block
"M". The process stops here. There isa void in stage 5. Now failure in stage 1 or stage
5 will cause system failure. Class _, 1' allows uniform mobility to each spare function
block in the system.
h
STAGE NO.
F7 F'7 F-I F'1 F7
LI LI LJ LJ L/
E] E] I-q l-q IZ!I 2 3 4 5
Figure A-To. Marginal Operation
Many different strategies are contained under the heading of Class y, r These
differ primarily in the limit assigned to the mobility of the spare function blocks. A
particular strategy may be identified by specifying "n" in the statement "n moves per spare. "
The value of n prescribes where a given function block will reach its limit and therefore con-
trois the differences between the various strategies of Class ), 1"
6-35
2. Class >, 2
Unlike the Gamma 1 Class, which assigns the same mobility to all spare function
blocks, Gamma 2 Class allows the two spare function blocks to differ from one another in
mobility. Figure A-8 will assist in the description of the switching processes which occur
for strategy Gamma 2. The members of the top row are assigned a mobility 3, those of the
middle row, a mobility 2.
The first failure in a stage will evoke no response aside from the elimination of
the failed block f_om the system. Upon failure of the second function block in a stage (stage 4),
the spare will be drawn from the next stage (stage 3). Block "G" which has the greater mobility
will switch from stage 3, to stage 4. (See figure A-8a) This is the only switch which will
Oi
STAGE NO.
FAILURE
I 2 3 4 5
_--_ FAILED FUNCTION BLOCKS
Figure A-8a. Gamma 2 Strategy - First Failure
occur. Since there are two function blocks remaining in stage 3 the void created by the
switch will not be filled. The next failure occurring in stage 4 will require another spare
to be switched into the stage. This spare is drawn from next stage which has a spare with
high mobility and which is within range to supply the need i. e., block D from stage 2 will
switch into stage 4.
needs not be filled,
b.
½
N
(See figure A-8b. ) This leaves another void which is not filled and which
In the system described in figure A-8, the next failure in stage 4, cannot
rn F'-I
STAGE NO.
D 5
I 2 3 5
F_I['-_ FAILED FUNCTION BLOCKS
L---J L..--J
Figure A-8b. Gamma 2 Strategy
6-36
II
draw a high mobility spare A, because it is out of range for stage 4. In this case the lower
mobility spare from stage 3 is used spare "H'. This leaves a void in stage 2which must be
filled since there is only one remaining operating function block in that stage. This void
is filled as though it were a failure; if a high mobility spare is available it will be switched,
i. e., function block "A" will switch to stage 3. (See figure A-8c. ) This process continues
until either a failure occurs and no spare is available or a lone remaining function block in a
stage faiis. System failure occurs at this point.
Ce
STAGE NO.
E]
I 2 3
L/
f¥ ®5
_-_'-__ FAILED FUNCTION BLOCKS
Figure A-8c. Gamma 3 Strategy - Third Failure Response
6-37//38
