Research on failure free systems  final report by unknown
NASA CONTRACTOR
REPORT
NASA CR-|05
11ia N+5
RESEARCH ON FAILURE FREE SYSTEMS
Peepared under Con_rac: No. NASw-S72 ;;-_
THE _',.gTESTINGH(:+[TSE ELECTRIC CO_51P()RATION
Bai_L-.>ore_ Md+
_:_"
EAT!ORAL AERONAUTICS AND SPACE ADMINISTRATION • WASHINGTON, D. C. • NOVEMBER I964
https://ntrs.nasa.gov/search.jsp?R=19650001742 2020-03-17T00:20:46+00:00Z
NASA CR-105
RESEARCH ON FAILURE FREE SYSTEMS
t. Distribution of this report is provided in the interest 0f information
exchange and should not be construed as endorsement by NASA of
the material presented. Responsibility for the conf_nts resides
in the author or organization that prepared it.
c'
Prepared under Contract No. NASw-572 by
THE WESTINGHOUSE ELECTRIC CORPORATION
°
Baltimore, Maryland
for
NATIONAL AERONAUTICS AND SPACE ADMINISTRATION
For sale by the Office of Technical Services, Department of Commerce,
Washington, D.C. 20230-- Price $3.50
TABLE OF CONTENTS
Page
PURPOSE .................................... 1
SUMMARY ................................... 3
CONCLUSIONS AND RECOMMENDATIONS ............. ........ 9
Appendix 1 - Design and Testing of Redundant Systems
Appendix 2 - Reliability of Imperfect Redundant Systems
Appendix 3 - A Survey of Components for Adaptive Restoring Circuits
Appendix 4 - Transor Analysis
Appendix 5 - Comparison of Dynamic and Threshold Restorers
Appendix 6 - Self Repair Techniques
PURPOSE
This final report is prepared in accordance with the requirements of Contract NASw-
572, "Research on Failure Free Systems", between the National Aeronautics and Space Ad-
ministration and the Westinghouse Electric Corporation {reference WGD-38521). The
research that is reported herein has the general objective of the advancement of the state-
of-the-art in the design of highly reliable electronic systems associated with the national
space effort. The design objectives which are studied are those which permit the proper
operation of systems to be relatively independent of the effects of individual component or
module failures within systems. The scope of this objective includes the use of the more
conventional techniques of multiple-line, majority voted redundancy, as well as the study of
sell-repair and advanced voting techniques. The research has been divided into the following
major tasks:
TASK 1:
TASK 2:
TASK 3:
IMPLEMENTATION
ADVANCED VOTING TECHNIQUES
SELF REPAIR TECHNIQUES
SUMMARY
TASK 1 - IMPLEMENTATION.
This portion of the study is concerned with developing suitablecircuits, systems, and
testingtechniques for use with currently availableredundancy techniques. The circuitand
system design is expected to be suitablefor general use "inspaceborne or ground support
equipment, free from extremely detrimental failuremodes, and compatible with whatever
testingtechniques are to be applied. The testingtechniques are expected to be suitablefor
a wide variety of applications. They are, therefore, similarly varied according to the pur-
pose of the testing,the system configuration involved, and the information which is available
for the test. The testingof redundant systems represents a unique problem, since individual
component or module failures do not indicatetheiroccurrence by affectingthe system per-
formance. The various purposes for testing are indicated by the following types of.diagnostic
tests whic.h have been considered:
The verification that all signal-processing elements are working properly, or
additionally that the voters are capable of transmitting a correct signal, or
further, that all signal processors and voters work properly under all possible
design conditions. This may be further extended to include the verification that
any additional hardware which is added for the testing is also capable of proper
operation. This range of test requirements is also encountered when the purpose
of the tests is not only to detect any failures, but to locate these failures to facil-
itate repair or replacement in redundant systems where repair is desired, or
systematic maintenance is used.
Another type of testing is referred to as "statistical measure of quality", which obtains
a limited amount of information concerning the failure pattern existing within the system to
estimate the reliability of the system. Many different types of tests can be used to obtain
this information, depending on the confidence required for the reliability estimate, the cost
of obtaining the information, and the type of analysis which will be applied to that information.
Much of the preliminary work necessary for the determination of suitable circuitry for
redundant systems has been described in an earlier C o m party A report, "Failure Effects
in Redundant Systems ''1. The report describes in detail the effects of catastrophic component
1. A.R. Helland, W. C. Mann, "Failure Effects in Redundant Systems", Westinghouse Re-
port EE 3351, March 1963.
failures which were induced into a laboratory model of a portion of a typical redundant sys-
tem. Potentially serious detrimental failures which might occur are discussed. A major
portion of the report is concerned with the random failure simulations and their results.
Briefly, a computer program generates random failure lists using available reliability data
for each part. Each failure list includes all component failures which might have occurred
in a typical system which had been operated for the specified time interval, and therefore
simulates the actual testing of such a system. The indicated failures are induced into the
system, which is then tested to determine whether it is capable of performing all of its de-
sign functions, or if it has failed. This actual test result can be compared with the analytical
result which would have been obtained with the same group of failures, to test the validity of
the assumptions used for the analytical result. These tests showed that the most common
analytical model is excessively pessimistic for a well-designed system. For these tests it
predicted more system failures than actually occurred by a ratio of more than 2:1. The
reasons for this departure, and more accurate analytical models, are discussed. A new tech-
nique is described which permits the reliability of a redundant system to be estimated by the
product of exponentials, using the failure rates of the components or modules involved.
Finally, several circuit design considerations are discussed.
The results of implementation studies as part of the research on failure free systems
have been previously published in special technical reports. Two major areas of interest
are discussed in Special Technical Report No. 3, "Circuits and Circuit Testing for Space-
borne Redundant Digital Systems". The entire report is reproduced as Appendix 1 of this
final report. The first portion of the report is concerned with efficient initial design and
contains a discussion of several possible circuit implementations. The latter portion is con-
cerned with the diagnostic testing of a multiple line, majority logic redundant system. Sev-
eral techniques are described for detecting and locating failures within an operating redundant
system to greatly increase reliability. The report is summarized below.
Section I contains a discussion of the general problems concerned with the design and
testing of redundant systems. These problems include the most appropriate choice of circuit
implementation, special design requirements, and the realization of high system reliability
with available circuits.
Section II contains a discussion of the possible use of magnetics to reduce the total
power consumption and provide non-volatile storage in redundant spaceborne systems. Mag-
netics appear to be most useful for applications requiring memory associated directly with
simple forms of logic, or for non-volatile data storage when the data is altered at very slow
rates, but is not recommended for general logic use.
SectionIII contains descriptions and comparisons of types of semiconductor circuits
suitable for use in redundant systems. Since integrated circuits offer many important ad-
vantages for redundant systems, they are chosen as a basis for system design with semicon-
ductor circuitry. Since custom design of integrated circuits is not especially practical for
low volume operation, the circuit design problem includes the choice of the most suitable
type of available circuits. Integrated Diode-Transistor Logic elements were chosen as the
most appropriate for general use. A majority voter restoring element, which is not subject
to the detrimental failure modes found to be characteristic of conventional elements, is de-
signed using positive logic D-TL NAND elements.
The discussion of Section IV is concerned with the testing of redundant systems. Various
solutions to the problem of failure detection within a redundant system are discussed in this
section; some are more suitable for simple failure detection, others also provide information
concerning the location of any failures. The failure detection tests alone are expected to be
most suitable for initial acceptance and verification tests to indicate that all parts are work-
ing. The combined detection and location techniques are most applicable to systems where
additional information is required to facilitate repair or replacement of individual parts of
the system.
It is shown that failure location and maintenance of a redundant system does not require
the test equipment and operator skill which are usually required to maintain a conventional
non-redundant system. Techniques are described which permit a redundant system to be
systematically maintained to provide much higher operational reliability than possible with-
out maintenance. It is shown that a major portion of the maintenance may be performed dur-
ing normal system operation.
The partial testing of imperfect redundant systems to estimate future reliability is dis-
cussed in part two of Special Technical Report No. 4, "Transor Decision Functions and Sta-
tistical Measure of Quality". The second part of the report is reproduced as Appendix 2 of
this final report.
The objective of this portion of the study has been to develop a test philosophy from
which a good statistical estimate of the probability of mission success could be made from a
limited amount of test data. Several possibilities have been formulated. The failure masking
characteristics of redundant systems prohibit the use of simple test programs which merely
determine the performance capability of the system at the time of test. Such programs
cannot differentiate between systems containing many component failures with correspondingly
many stages vulnerable to succeeding failures, or few component failures with few vulnerable
stages. Because the probability of mission success after the time of test is heavily influenced
by the component failure pattern existing at the time of test, a test program must be devised
from which mission reliability can be predicted with a reasonably high degree of confidence.
The general complexity and microminiature size of modern systems generally precludes the
possibility of testing each signal processor in each stage.
In the proposed extention of this study the various philosophies will be considered in
more detail, and an effort will be made to evaluate the usefulness of each one with the pur-
pose of determining which of the candidate philosophies provides the most accurate estimate
of probability of mission success for a fixed cost of testing.
TASK 2 - ADVANCED VOTING TECHNIQUES
This study is concerned with advancing the state-of-the-art in developing new restoring
circuits for use in redundant systems. Several advanced voting techniques have been studied
as part of the research on failure free systems. The results of the Adeline-Neuron study
and the initial results of the Transor study have been previously published as special techni-
cal reports. Further study of Transor and a new dynamic restorer (the Hamming Distance
Restoring Circuit) has been conducted, but the results have not been previously published.
These results are, however, contained in Appendix 5 of this report.
The results of the study of the Adaline-Neuron adaptive voter with continuously variable
input weightir_g have been previously published as Special Technical Report Number 1, "A
Survey of Adaptive Components for Use in Failure Free Systems". It is reproduced as Appen-
dix 3 of this report. Briefly, it concludes that suitable analog memory devices are not cur-
rently available for use in this class of adaptive voters, although the mercury cell integrator
with photoelectric readout is apparently the most suitable technique.
Since the Adaline-Neuron adaptive voter requires an analog memory for each input,
the selection of a suitable input device is important to realize a practical adaptive voter.
Several types of analog memory devices were surveyed in order to evaluate their suitability
for use in implementing an adaptive voter for redundant systems. It is desirable that the
devices be simple, reliable, relatively linear, and store the analog variable weighting for a
relatively long time. It was found that most of the available devices which have been de-
veloped for pattern recognition or learning machines are too complex, unstable, or unreliable
for use in adaptive voters.
Devices which were included in the survey included the Device 1
plated resistor, the solion iodine ion cell, the mercury cell integrator (with either
capacitive or photoconductive readout), the MAD magnetic integrator, the orthogonal core in-
tegrator, the second harmonic magnetic integrator, and the magnetostrictive integrator. The
mercury cell integrator with photoconductive readout appears to be the most suitable device
amongthose which were surveyed. It incorporates an electroplating technique for providing
the continuously variable input weighting for adaptive voters, with relatively good stability,
reversibility, and permanent storage. Since it is a four terminal device with electrical cur-
rent as the input and electrical resistance as the output, it is relatively simple and generally
compatible with conventional circuitry. It is, however, currently in a relatively early state
of its development as a device for general use. It appeared that any detailed circuit design
for adaptive voters should not be unctertaken before the expected progress in the development
of more effective cells is accomplished.
The proposed continuation of the development of this class of adaptive voters includes
monitoring the state of the art in the development of more effective devices, followed by the
design and breadboard construction of at least one Adaline-Neuron adaptive restorer, or pre-
ferably a small redundant subsystem using these restorers, in order to demonstrate their
effectiveness in redundant systems.
The objective of the Transor study portion of the research.was to evaluate the Transor
Restoring Circuit for possible use as a replacement for threshold voters in redundant systems.
In the process of performing this evaluation, another dynamic restorer, the Hamming Dis-
tance Restoring Circuit, was invented. The study was extended to include an evaluation of
both circuits.
The initial portion of this study has been reported in part one of Special Technical Re-
port No. 4, "Transor Decision Functions and Statistical Measure of Quality" which is repro-
duced as Appendix 4 of this final report. In that report, analytical reliability expressions
for systems using Transor restorers axe obtained for the case when signal processors are
restrained by certain failure mode assumptions. An appendix to that report shows how the
probability of occurrence of various failure modes might be computed. The results of later
portions of this work are presented in Appendix 5 of this final report. In these results, gen-
eral reliability expressions for the Transor and the Hamming Distance Restoring Circuit are
obtained which are relatively free of restrictive assumptions. A computer simulation pro-
gram which was developed for use in the evaluation, is described and some results obtained
from the program are discussed. Finally, the conclusion is drawn that the Hamming Dis-
tance Restoring Circuit is always superior to the Transor but that it is as good as or better
than the threshold voter only in certain failure mode environments.
TASK 3 - SELF REPAIR TECHNIQUES.
This study is concerned with the development of new, more efficient means for employ-
ing redundant equipment. Using these techniques, a system may be designed to absorb more
internal failures without system failure than is possible with the same amount of fixed,
multiple-line redundant equipment. The results of this study have been previously published
as Special Technical Report No. 2, "Self Repair Techniques for Failure Free Systems".
The report is reporduced as Appendix 6 of this final report.
As a part of the effort to develop hyper-reliable systems, Company A has devised a
class of techniques for using redur_dant blocks of circuitry more effectively than has been
done previously. The systems using these techniques are similar to the familar multiple-
line, majority-voted redundant systems except blocks of circuitry are allowed to shift around
as component failures leave certain subsystem functions more vulnerable than others to suc-
ceeding failures. The object of this phase of the study has bee_to devise sev, eral general patterns
in which systems could be organized to absorb relatively large numbers of internal failures
without system failure and to develop a means for evaluating the effectiveness of the various
patterns for performing this function.
Three broad classes of organization patterns have been developed, and several specific
patterns within each class have been examined. A versatile computer simulation program
has been written from which approximate reliability vs. time curves and a variety of other
pertinent information about each pattern can be directly obtained. Both of the patterns which
have developed and the computer program have been described in detail in Appendix 6.
A three-part program has been proposed for future study in this area. In the first part,
the computer simulation program will be used as an evaluation tool for establishing a set of
rules for designing optimumor near-optimum self-repairing systems. The rules will be pri-
marily concerned with the organizational patterns to be used and with the maximum allowable
ratio of repair circuitry complexity to signal processor complexity. Secondly, an implementa-
tion study has been proposed to determine effective means for implementing the organization
patterns which have been and will be devised. Finally, an appropriate study vehicle will be
selected and designed with sufficient detail than a breadboard model could be constructed
from the specifications produced. Such a vehicle design is required in order to verify the
usefulness of both the organizational pattern theories and the implementation techniques
which are being developed.
CONCLUSIONS AND RECOMMENDATIONS
TASK 1 - IMPLEMENTATION
1. Design of Redundant Systems
Redundancy is a powerful tool for achieving extended reliability, but effective design
is required to achieve the reliability goals with a minimum of additional complexity. Although
magnetic logic is often cited as having several advantages applicable to spaceborne computers,
the use of magnetic logic is limited to special applications. Magnetic logic is not particularly
suited for general logic use in redundant systems, due to the lack of steady output signals,
low speed capability, high peak power requirements, and the complexity required for general
logic functions. It appears that no proven magnetic restoring element exists which is suitable
for general use in redundant systems. Magnetic logic does, however, offer non-volatile
storage and very low average power for slow speed operation. Magnetic devices appear to
be suited to special applications where certain logic functions, such as transfer and OR,
are intermixed with the memory function, and very low speed capability is acceptable. It is
useful for low speed shift registers, counters, and timers which consume negligible stand-
by power.
Integrated semiconductor circuitry offers many desirable characteristics for use
in redundant spaceborne systems, including small size, reduced weight and power consump-
tion and high frequency capability. A comparison of the currently available integrated logic
elements indicates that diode-transistor logic (D-TL) is the most suitable for general logic
use in redundant spaceborne systems. A majority voting restorer, designed using inter-
connected NAND elements, has been described which is not subject to the detrimental failures
of more conventional restoring elements.
2. Testing of Redundant Systems
It is a characteristic of redundant systems that they offer a high reliability for a
period of time after the initially failure free condition, and that the system reliability decreases
rapidly when internal failures are present. It is therefore important to insure that no initial
failures exist in a redundant system to obtain maximum system reliability. Since an initially
failure free, order three system can withstand any single failure, as well as a relatively
large number of randomly scattered failures, it offers very high reliability for the period
of time when the probability of individual failures is low. Techniques are described which
permit even higher reliability by the Use of systematic maintenance of a redundant systems.
0It has been shown that a relatively simple technique called singular rank testing
may be used to determine that all of the replicated signal processors in a redundant system
are working properly, and that the majority voters are sufficiently failure free to insure that
the system is not vulnerable to single failures. The system is monitored to determine if each
individual rank is able to perform all system functions correctly, in a manner similar to the
verification of a non-redundant system. This testing places no restrictions on system size
or configuration. A somewhat more complicated testing procedure, referred to as interwoven
rank testing, has been described which will completely test all voters to insure that they will
make correct decisions for all possible input combinations.
Although a redundant system is more complex that its conventional counterpart,
failure location within a working system does not require the operator skill and simulation
equipment usually required to locate failures in a non-redundant system. Since a working
redundant system always has at least one correct signal available at each stage in the system,
these correct signals may be used as a basis of comparison. A difference detector on the
signal processor outputs to restorers may be used to indicate either permanent or sporadic
failures among these signal processors. The failure location techniques described may be
performed during normal operation, since they do not jeopardize system operations.
3. Reliability of Imperfect Redundant Systems
The mission reliability of an operating redundant system which contains internal
failures depends strongly on the number and location of initial circuit failures, as well as
the failure rates of the circuits which make up the system.
One very important task is the design of simple and efficient tests to be performed
at the beginning of a mission. These tests are required to obtain the information required for
the reliability estimates. A maximum amount of information is desired from a minimum
number of tests. The work which has been done will provide a basis for future efforts in
this area.
Several tests are proposed that may be made just before a mission is to begin to
determine, at least approximately, the mission reliability without complete information on
the state of the system. It proposes some procedures for using the results of the tests to
estimate the mission reliability with varying degrees of accuracy. A procedure for making
the decision on the useability of the system without estimating the mission reliability is also
presented.
Although a basis for future study has been provided, the details of these procedures
are still to be worked out and the accuracy of their results are still uncertain. It is recom-
10
mended that efforts be made to develop an appropriate measure for comparing the techniw_es
so that they may be evaluated relative to a common scale.
TASK 2 - ADVANCED VOTING TECHNIQUES
1. Components for Adaptive Restorers
A survey has been conducted of several devices which are potentially suitable for
use in the Adaline-Neuron adaptive voter. The survey concludes that none of the suggested
devices were sufficiently developed to justify the immediate circuit implementation of an
adaptive voter.
In general, magnetic devices do not appear to be suitable for use in adaptive voters,
due to their environmental sensitivity and cgmplexity required for useful operation. Similarly
electro-chemlcal dovices do not appear to have sufficient simplicity, stability and compati-
bility with electronic circuitry to justify their use in adaptive voters.
The mercury ceil integrator with photoelectric readout appears in principle to offer
the most attractive approach because of its simplicity, stability and general compatibility
with conventional circuitry. Since the output is essentially a variable resistance pro-
portional to the interval of the control input current, the device offers the possibility
of providing a simple interface with standard circuitry. The mercury cell integrator
is, however, still in a rather primitive state of development. It is recommended
that detailed circuit design should not be undertaken until further device development is
completed and that present effort on the design of an adaptive voter be restricted to that of
monitoring the state of the art in device development and to begin detailed circuit design
when more suitable devices become available.
2. Threshold and Dynamic Restorers
o
The majority voting class of threshold restorers are the most commonly used
restorers in present technology. Because the majority voter requires a majority of correct
inputs to provide a correct output, its error-correcting capability is limited. Since many
circuit failures result in steady-state outputs, restorers which detect only changes in input
states offer the capability of deriving a correct output with less than a majority of working
inputs. Restorers which detect changes in input states are referred to as dynamic restoring
circuits.
The mission of this part of the Failure Free Systems Study has been to evaluate
the potential usefulness of one proposed dynamic restoring circuit implementation,the Transor.
11
Theresultsof sectionIV haveshownthattherearecertainenvironmentsinwhichTransor
canbeusedto advantagein improvingsystemreliability. For example,themaximumerror
restoringcapabilityof Transoris shownto beR-1failuresof R redundantlines in anenviro-
mentfreefrom transitionalfailures. This is a significantimprovementover themajority
thresholdrestoringcapabilityunderthesameconditions.Thereis needfor caution,however,
for inenvironmentswheresymmetricaltransitionalerrors arepossible,error correlation
maymakeTransorperformanceinferior to threshold.
Duringthecourseof thestudyof TransorRestoringCircuits, a new class of
restoring circuits was conceived. This class, called "Hamming Distance Restoring Circuits"
is similar to Transor in many ways. It was compared with Transor analytically and by
simulation. From the results obtained by manipulating the analytical reliability expressions
for the Transor and Hamming Distance Restoring Circuits, it may be concluded that the
output of a Hamming Distance Circuit is more reliable than that of the Transor in order-five
redundant systems. This conclusion holds for any ratio of steady-state to transient error
probability or any asymmetry (tendency toward "ones" or "zeros") of error probabilities.
From comparison of the simulation curves, it may be concluded that the threshold
circuit is more reliable than either of the dynamic restoring circuits until the ratio of the
probability of steady-state errors to the probability of transient error exceeds approximately
seven to one. Above this ratio, the dynamic restoring circuit outputs are more reliable.
Further comparison reveals that the difference in the reliability curves tends to stabilize or
slightly decrease as the ratio becomes much larger than 7:1. The stabilizing effect is more
pronounced as the order of redundancy is increased from five to seven.
Also, it may be concluded that in the early life, high reliability region with
approximately a seven to one probability ratio, an order five system using Hamming Distance
Restorers may be as reliable as an order seven system using threshold voters.
Since the improvement available from Transor is limited, and since the Hamming
Distance Restorer is normally superior, further study of the Transor is not justified.
TASK 3 - SELF REPAIR TECHNIQUES
Before self-repairingsystems can be implemented, many feasible switching strategies
must be considered inan effortto determine the most effectivemanner to manipulate the
redundant or "spare" blocks. The extreme complexity of the reliabilityexpressions associated
with these strategieshas resulted in the use of a computer simulation program for comparing
the effectivenessof the strategies. The present program includes subroutines for three
classes of switching strategies. Each class subroutine contains a great deal of flexibility,
12
thereby including many individual strategies. This method facilitates easy comparison
between members of a class. This comparison allows immediate elimination of many
possible strategies which are obviously uneconomical. For example, the flattening out of the
Percent of System Failed versus Spare Mobility curves indicates that none of the strategies
on the flat part of the curves can be optimum strategies.
From the results of the simulation program, curves for Percent of Systems Failed
versus Spare Mobility have been plotted for the Gamma Class Strategies. These curves have
been referenced to that of a multiple-line majority voted system because this particular
technique has been the most effective of the passive, failure masking, circuit level redundancy
techniques. In all cases these curves show not only that great gains can be realized over the
multiple-line redundant configuration, but that by far the greatest part of these gains are
realized for the first few moves allowed to the spare function blocks. Beyond the range of
relatively limited mobility, little or no gain in the average number of failures absorbed is
realized by the additional mobility allowed to the spares. This is an encouraging result
since the great majority of the gain due to self-repair can be retained without the use of an
exorbitant amount of switching circuitry.
All of the computer simulation results have been based on the assumption that the
switching circuitry was perfectly reliable. There is a need to determine the range of allowable
failure rates which can be associated with each strategy for it to be of maximum effectiveness.
These ranges should be studied as a function of the failure rates of the associated signal
processor blocks. As a result, information specifying the optimum switching strategy
corresponding to a given signal processor failure rate should be available before actual system
designs are begun.
It has become obvious that many of the spare function blocks do not experience as many
switching operations as they are capable of performing. When all spares are assigned mobility,
those which use their mobility extend the life of the system substantially. However, in many
cases when system failure has occurred, there are many spares remaining which have not
been used to any great extent. In order to try to capitalize on this phenomenon, a class of
strategies should be investigated which would assign different mobilities to the spare in a
stage.
The curves show a very definite gain in reliability for the self-repair strategies over
multiple-line redundant systems. The curves for the Beta Class strategies show an increase
in reliability for each increase in "repair" capability. Strategy Beta-3 yields the highest
reliability but even strategy Beta-1 shows a significant gain over the multiple-line system.
The reliability curves for the Gamma Class show essentially the same result with respect to
13
Ithe multiple-line case. However, investigation of the curves show that increasing the "repair"
capability produces gains for the first few increases, after which the magnitude of the gain
diminishes. These curves tend to bear out the conclusions drawn from Percent System
Failed versus Spares Mobility curves which flattened out after a certain mobility was reached.
The gains illustrated here must be considered as ideal because the switching circuitry for
self-repair is here assumed to be perfectly reliable. More realistically, the gains obtainable
will be a function of the switching circuit complexity and will not be as great as shown here.
Although little has been said about the physical switching techniques to be employed, it
has been tacitly assumed that the failure detection and replacement circuitry would be
combined as much as possible. It has been suggested that these two phases of the repair
function might profitably be separated and made almost completely independent from a
circuit viewpoint. This is another area which should be given careful attention.
None of the strategies considered so far have permitted spares to return to previous
locations. It is possible that removal of this restriction might add to the failure absorption
capability of a system. This area certainly should be explored further.
The Alpha claSS strategies have not been thoroughly investigated to determine the
optimum degree of spare overlap (i. e., two sets of spares serving some of the same
functional region). The information from this investigation should influence the design of new
strategy classes as well as indicating the optimum strategy for the Alpha class.
In general, investigations to date have shown that self-repair techniques can be much
more powerful than presently available redundancy techniques. Further studies are expected
to show effective ways to apply the techniques to real equipment needs.
14
Appendix 1
DESIGN AND TESTING OF REDUNDANT SYSTEMS / ./
y////
by
H. Brinker
A. R. Helland
September 1963
ADSTRACT
This report describes the results of the study on the imple-
mentation of majority logic redundancy. Most of the work concerns
spaceborne systems, but some portions are more applicable to gro_md
support equipment. The report is concerned with the initial design
of the system as well" as the testing of redundant systems.
The possible use of magnetic logic to reduce the total power
cons_m_ption and provide non-volatile storage is discussed. Magnetics
seems to be most usef_ll for non-volatile memory and simple forn_ of
logic where the data rate is very low. Various types of semiconductor
logic are described and compared for use in redundant systems. In-
tegrated Diode-Transistor Logic elements are chosen as the most suitable
for general use.
Several methods of testing redundant systems _re discussed and
described in the section on detection and location of failures. V_rious
solutions to the failure detection problem are discussed in this section.
Some are more suitable for simple fail,Ire detection; others also provide
information concerning the location of any fail_._es. It is shmm that
maintenance of a red,mdant system greatly increases system reliability
and reduces the test eq_]ipment and operator skill v_ich are usually
required to maintain a conventional s_stem. Techniques are described
which permit a major portion of the maintenance to be performed during
normal system operation.
l-ii
TABLEOFC0h_fENTS
I. _2RODUCTION
II. NAG_TIC LOGIC
A. Introduction
B. Dynamic Storage and Sequential Logic
C. Hybrid Devices
D. All-_mgnetic Logic
E. S_mry and Conclusions
III. SEMICONDUCTORL GIC
A. Introduction
B. Classification of Basic Types of Logic
C. Comparison of Logic Types
D. Description of Logic Types
E. Logic Selection
F. _jority Voter Design
IV. FAILURETESTI_GOFR_)DI_A_f SYSTEMS
A. Introduction
B. Singular Rank Testing
C. Interwoven Rank Testing
D. Circuit Implementations
V. SL_g_ARY& CONCLUSIONS
Page
i
5
5
6
7
13
21
25
25
26
31
34
41
43
45
45
61
71
79
84
I-iii
Figure
I
2
3
4
5
6
7
8
9
i0
II
12
13
14
15
16
17
18
19
2O
21
LIST OF FIGURES
TITLE Page
OR Gate 9
Negation iO
Block Diagram, A_ Function Ii
SRI MAD Shift Register 14
Device _ Flux States 17
Device 2 Shift Register 19
R-TL Resistor-Transistor Logic (+NOR) 27
DC-TL Direct Cou_led-Trsnsistor Logic (+NO_) 28
R-DC-TL Resistor-Pirect Coupled-Transistor Logic 28
NS-DC-TL Non-Saturated-Direct Coupl_ed-Transistor
Logic 29
D-TL Diode-Transistor Logic (+NARD) 30
NS-D-_% Non-Saturated-Diode-Transistor Logic 30
T-TL Transistor-Transistor Logic 31
Soeed-Power Performance 37
Majority Element with Input Isolation 43
Reliability of Conventional vs. Redundant Systems h5
Singular Rank Testing 62
Inter_Toven Rank Testing 73
interwoven Rank Testing 74
Signal Processor Output Control 80
Difference Petector 82
1-iv
=I. Introduction
Past studies of redundancy techniques and consideration of the
basic characteristics of somereduudancy techniques have yielded in-
teresting insights and problems. Nmnyof these considerations are in
the area of engineering method. Others concern the design of redundant
systems with high reliability and other desirable characteristics. This
section is intended to review someof these considerations and to preview
someof the thoughts behind the disc,Assion in later sections.
The report itself deals primarily with someof the problems which
are enco,lutered in designing and testing usef,A1red_Andantdigital systems.
Someof these problems are at least comparableto non-red_Audsntdesign;
others are rather ,_que to redundant systems. Possible solutions for
these problems, as well as more detailed problem descriptions, are con-
tsined in sppropriate sections of the report.
Circuit and system design must reflect the fact that red,mndancy
is only a tool to realize reliability. The proper use of redundancy is
often a more efficient and powerf, ul technique to reslize a reliability
req_lirement than are the more conventional techniT1es such as conservative
design or component selection. Redundancy is, however, most powerf,al when
used in con_unction with technia2_es that increase basic reliability.
It is important to recognize that a red_mdant system is expected
to operate with relatively large n,mmbers of random failures. Since con-
ventional systems usually fail when a_v of their parts fail, it is relatively
u_important what effects these fail1_es have, except when repair is desired.
i-i
Circuits for redundant systems, however, must be designed so that the
effects of individual component failures are minimdzed, and usually limited
to the circuits in which the failure occurs. This does not imply, however,
that redundancy includes "useless" parts. Each part of the system must
contribute to the asst_ra>ce that the system will perform all of its functions
properly.
The use of redundancy will alter the characteristics and performance
of the system. Redundancy will usually increase design complexity, power
requirements and dissipation, signal propagation time, size and weight,
number of interconnections, and initial cost. Redundancy, therefore,
emphasizes the need for continuing development of low-power circuitry,
_cro-miniaturization_ and intercormeotion techniques. The type of circuitry
which is used to implement a redundant system must be carefully chosen to
meet the system requirements without incurring excessive costs. W_enever
there is a need for high reliability, the circuitry should be chosen to
have a high basic reliability, low sensitivity to parameter var_ ations, and
low power dissipation to minimize temperature stress. In addition, specific
systems have special req._irements which must be considered in the system
design as well as the choice and design of the circuitry. For example, the
total available power is ofte_ severely limited for spaceborm eq_pment_
although the processing rate is usually quite low. It is usually desirable
to provide some means of testing to verify that all parts of the redundant
system are working to insure that all Of the reliability initiall_ designed
1-2
into the system is a,Jailable for the duration of the mission. The system
and the circuitr-j therefore must be designed so that accurate and meanlng-
ful tests maybe applied to verify that the parts are working. %Then
extended lifetime is desired and repair is possible, a redundant system ma_j
be systematically repaired to greatly increase the expected time between
system failures. If a s'jstem is completely repaired prior to each mission
in which it is used, it will exhibit the high mission reliability character-
istic for each mission. Such s}'stems must be designed so that ccmplete,
efficient tests mE: be periodically applied to these s$_stems which _ill
verify that all the parts are working properly, or that will facilitate
maintenance procedures which will return the system to the initially perfect
condition. It is important for this type of maintenance that all failures
be detectable, otherwise these undetectable failures will tend to accmmulate.
These accumulated failures will eventually tend to dominate the system
behavior by causing additional system failures.
Mar_- failures ma__ be detected as they occur in a redundant s-jstem.
These may be repaired while the system is in operation to obtain a very low
system failure rate compared to the failure rate for the parts of the
._ystem. Periodic maintenance must be performed in _ddition to the continuous
monitor and repair described above to detect those failures which cannot be
detected during regular operation of the system.
Systems which will be maintained must therefor e be designed both
with the capabilit], for detecting all failures and facilitating the main-
tenance and repair procedures. With proper design, many of these failure
z-3
detection, maintenance and repair procedures may be accomplished during
operation of the system.
The following sections of this report will discuss the problems
associated with circuit design, choice of the type of circuitr_j, failure
detection, and maintenance of redundant systems. This report describes the
results of the study of these oroblems and Dossible solutions. The results
are summarized in the SurmT_ary and Conclusions section of this report.
1-4
II. Magnetic Logic
A. Introduction
The past decade has witnessed the development of a variety of mag-
netic devices suitable for performing storage and logic in digital com-
puters. Perhaps the most important application of magnetics to digital
technology has been provided by the development of large capacity, random
access memory systems composed of ferrite cores. Advances in techniques
for performing logic have received some attention, but to date magnetic
logic does not appear to be widely accepted as a superior replacement for
the conventional transistorized counterpart. This general reluctance to
utilize the special attributes of magnetic logic is often Justified by
several difficulties inherent to the device characteristics and system
configuration.
Much of the magnetic logic research has been motivated by the
potential ability of magnetic devices to provide higher reliability at
lower cost while consuming negligible standby power. These attributes are
understandably important in any large electronic system, especially in space
applications where reliability must be high and available power is invari-
ably low. To evaluate the potential ability of magnetic logic schemes %o
provide these advantages a discussion of some of the more promising approaches
appears to be in order. An all inclusive survey and treatment of the
myriad of suggested approaches could easily fill a book. It appeared
,m , Jl H
* Edited by Meyerhoff, A. J., Digital Applications of Magnetic Devices,
New York; John Wiley and Sons, Inc., (1960).
1-5
reasonable therefore to restrict the detailed discussion to the more pop-
ular approaches and to provide references for other. Of particular in-
terest are those devices which utilize magnetic componetswhich are either
commercially available or in an advanced state of development.
B. DynamicStorage and Sequential Logic
The state of a magnetic device is determined by the direction of
remanent flux. Information stored is not directly accessible and a clock
or read pulse must be used to determine the state. The read process in
most schemes also destroys the information which was stored. An output
signal is available only for that portion of the read cycle during which
d_rnamic flux change is in progress and thus level output and as_mchronous
operation is not obtainable. The rioole-carry binary counter, the parallel
adder, and many familiar digital comfigurations are net directly amenable to
magnetic implementation. In contrast, the powerful combinational logic
approach utilized in conventional computers consists of a cascade of com-
patible logic modules which form complex functions simultaneously during
the interim between clock pulses. In a magnetic logic machine using
dynamic logic this is not possible and operations involving OR, AND,
transfer, buffering, negation and delay require several clock periods to
generate a particular function. This step by step process usually consumes
considerable time which may be further extended if the magnetic logic
modules are limited in fan-in and fan-out and thus require additional operations.
i-6
C. H_brid Devices
The principle involved in using square looF material to store a
remanent flux has been known for some time. With the development of small
torroidal structures employing sintered ceramic ferrites and ferromagnetic
tape materials, magnetic devices began to demonstrate prac_cal utility.
The magnetic shift register has received the most attention primarily be-
cause of its general utility and simple configuration and has been the
subject uf much of the magnetic li_rature. Although playir_g an important
part in most digital systems, several additional devices are required in
order to provide the variety of logical operations required by typical
computer systems.
The task of performing general logic requires circuitry capable of
being arranged to perform any Boolean output function of a set of input
variables. In order to provide this operation a complex function is usually
formed by using logic modules to perform OR, AND, negation, storage, delay,
etc. If gates are to be connected in various configurations the devices
used must provide a clearly iden_fiable "I" and "0" state, unilateral
information transfer and the capability for fan-in and fan-out. To meet
these requirements with magnetic devices has not been an easy task.
A major difficulty which impeded rap_d development of devices to
meet these requirements has been the inherent bilateral nature of simple
magnetlc structures. In the early devices this was largely overcome by
combining diodes with simple torroids to achieve unilateral information
i-7
flow. Obvious limitations in impedancelevels, fan-in and fan-out
drive capabilities necessitated in manycases the further inclusion of
resistors for tailoring impedsnce levels, capacitors for temporary storage
and transistors for power gain. Although this hybrid l o_c approach led
to the development of a number of clever magnetic devices, the potential
of achieving high reliability at low cost is seriously cha_lemgedby the
requirement for using non-magnetic componentsand the more complex wiring
and system organization which becomesnecessary. An excellent survey
of a wide variety of hybrid devices has been provided by Haynes.I One
such approach, parallel transfer core-diode logic, will be used as a vehicle
for describing the principles of dynamic logic and to indicate the opera-
tion of a typical practical device.
Shownin figure 1 is the ORgate, the simplest of logical functions
which rosybe implemented with magnetic cores and diodes. The_ and O
notations denote cores of the samerank, i.e. threaded by a series con-
nected, current driven clock line. The two phase clock system effects
readout and transfer of data by driving the core to the "0" state. If
a core was previously in the "l" state the clock, in driving the core to
the "0" state_causes the core to switch and provides an output sufficient
to drive the next core to the "l" state. If a core was previously in the
"0" state a negligibly small cutout occurs when the clock drive is applied.
Diodes are shownto prevent output loading when a core is being set.
Additional componentssuch as resistors for tailoring impedancelevels and
diodes to prevent reverse data transfer m_y be required in a practical design.
It should be noted also that the core output _indings must contain more
turns than core inputs in order to allow a transmitting core to set a
receiving coretwhich also tends to prevent reverse d_ta transfer.
X •
__ • 0
X'tY
c ocK
[-----'1 ° _
y • •
CLOCK A
Figure I OR Gate
Operation is initiated by reading inputs X and Y into the
cores. The phase A clock then transmits the state of each of the input
cores into a dual winding storage core. If the storage core was set by
i-9
any of the transmitting input cores, a readout signal is generated when the"
storage core is reset by the phase B clock.
The AND function is not as easily implemented unless a coincident
current threshold technique is employed to set the storage core. This
technique does not appear to be sufficiently reliable however, due to the
associated threshold and drive tolerances normally encountered in a typical
system. A more conventional system employs the principle of logical
negation in combination with the OR gate to provide the AND function.
For example, consider the negation arrangement of figure 2.
DUMMY CORE
( "1 " GENERATOR )
IB----_
Iki
vI
X
e(__ INHIBIT
cLocKB
P
CLOCKA
Figure 2 Negation
i-i0
The upper core is used as a "I" generator which in the absence of an input
from the X core causes the inhibit core to be set by the phase A clock.
The phase B clock will then generate an output whenever the X signal is
absent _d thus represents the negation of the input. When both the "I"
generator _d X input signal appear simultaneously at the inhibit windings
they effectively cancel each other and the inhibit core remains in the "O"
state. The phase B clock in driving the inhibit core to the "O" state
will not generate an output signal for this case.
The principle by which the AND function m_ be performed is based
_-----=
on the well known logic relation X ÷ Y = S. A block diagram of a typical
AND gate scheme is shown in figure 3.
X NEG.
OR X+Y NEG,
X+Y = X* Y
|
NEG,
Figure 3 Block Diagram, AND Function
I-ii
Since each of the logic modules require two clock periods and each operation
is performed in sequence, the output signal is seen to appear six clock
periods after the inputs were applied. If the resultant output of the
AND function is to be further combined with other AND-OR operations it
becomes evident that the total number of clock periods required may become
prohibi tire.
In view of the system complexity snd speed limitations suggested by
the simple example described, magnetic logic is seen to introduce problems
of system organization which are alien to conventional DC level logic.
As far as cost and reliability are concerned, the prospect of winding cores
with several turns and the large number of cores and connections required
do not appear to provide a significant cost advantage. In the hybrid
approach the use of additional components such as diodes and resistors
appear to seriously negate the basic reliability inherent to the magnetic
material. These difficulties not withstanding, several companies are
active in the manufacture of magnetic logic modules. The major emphasis
has been placed on the usefulness of the magnetic shift register to provide
cost, size and power advantages over the conventional approach. Magnetic
shift registers employing the hybrid approach have been successfully applied
to a wide range of airborne equipment. Sequential programmers, counters
and timers operating at low clock rates represent the majority of applica-
tions. When operating at shifting rates higher than lO kc however, the
1-12
Advantage that the magnetic shift register has in consumingnegligible
standby power is obscured by a power requirement which is often greater than
the solid state counterpart. A leading supplier of hybrid magnetic logic
modules and shift registers is currently marketing a 10 bit shift register
which requires a maximumaverage power of .4 watts to operate at 10 kc
and 3.7 watts at 750 kc. Since it appears reasonable to assumethat these
power requirements are reflected also to general logic systems, the appli-
cation of hybrid magnetic logic to power-limited environments is li__ted
to systems whose shift rate is very low.
D. All-Magnetic Logic
The obvious limitations of the hybrid approaches in reliability and
cost has to someextent motivated an effort to develop systems using only
magnetic material and connecting wire. Several novel approaches were
developed which madeuse of magnetic device geometry to achieve coupling
isolation, flux gain and unilateral information flow. One of
these devices is the lhulti-Aperture Device (_D_D),2'3 a three
aperture ferrite structure similar to the Transfluxor. 4 Input-output
isolation is possible because the flux stored around the minor output aper-
ture maybe sensed non-destructively without affecting stored flux about
the input aperture.
Shownin figure 4 is a typical MADshift register developed at
Stanford Research Institute.
1-13
o E 0
ADV, 0 "-bE
CLEAR 0
ADV.E --'b0
CLEAR E
F_gure 4 S.R.I. _D Shift Register
An advance current is applied to the parallel connection of output and
input aperture windings in order to effect information transfer from the
transmitting core to the receiving core. In accordance with the state of
the flux stored around the transmitting aperture and the resultant magnetic
threshold thereby established, the advance current will divide between the
input and output windings. If the transmitting aperture is in the "0"
or cleared state the advance current will divide equally thus not exceeding
the magnetic threshold of either apertures. If a "l" were stored the output
aperture with its lower threshold is swamped by the advance current and the
transmitter switches flux locally about its output aperture with low values
1-14
of current. By voltage or impedance steering the majority of advance current
will flow through the rcceiver input aperture causing it to exceed its
setting threshold and be set. In time as the flux switching is completed,
both currents will return to their nominally equal values.
Since the read-out and transfer process is nondestructive to the
state of the core, a clear line threading the major aperture is required
to return the core to the reset condition. In order to provide information
flow from left to right a basic four clock cycle is required with the
following sequence: .... , ADV.O-)E, CL.O, ADV.E-)O, CL.E, ... The
ADV O-_E pulse switches flux locally about the output aperture of the 0
element and causes the E element to be set. The CL 0 pulse then clears
the 0 element and in so doing switches flux through the output winding.
This results in a loop current flow that negatively sets the E element
receiver without affecting the flux state about the output aperture of the
E element. Note that neither the ADV. O-_E nor CL. 0 pulse causes any
flux to be switched in the output leg of the E element thus eliminating
the need for a diode to prevent backward data transfer. In this manner
unilateral data transfer is possible using only MAD devices and conducting
wire.
Thus far the discussion has been devoted to techniques for achieving
unilateral data transfer with the S.R.I .-MAD approach. The problem of
achieving reasonable flux gain and fan-out is one which could not be solved
1-15
in a practical sense with the simple transfer schemepreviously discussed.
H.D. Crane has done muchof the work in arousing interest in the all-
magnetic MADapproach. In a paper5 describing the design of a moderate
sized computing system using S.R.I.-.MAD dev&ces however, the basic transfer
gate had to be seriously modified in order to operate in the system.
Problemsinherent to the flux threshold relationship between receiving
and transmitting apertures, flux gain, fan-out as well as flux decay and
build-up in circulating loops madesuch modifications necessary. As a
consequencethe revised gate module required flux doubling and clipping
operations in addition to the previously described clear and advancecycles.
The complexity involved in the resultant device implementation appears to
be a serious encumberance. The system chosen to demonstrate the ability
of all-magnetic devices took the form of a decimal arithmetic unit with
the ability of performing addition, subtraction, and multiplication. The
system was made exclusively of modules which perform either the two input
OR function or the two input OR with negation (NOR).
Rather than describe the comolex details of the S.R.I.-MAD logic
gates it appears more reasonable to present an alternate
approach to the design of M_D devices developed by Comoany I. In this
approach a priming operation is performed to reverse the flux stored about
the transmitting aperture prier to readout. The readout process in this
case is destructive and resets the core. The priming operation provides
an adequate flux level which, wh_ reversed by the clear or transfer
1-16
operation, delivers an output pulse to set the next core ti_ough its
major aperture. Since data flow is from minor aperture to major aper-
ture and since the state of a core is not disturbed by reverse currents
flo_zing through a minor aperture, the possibility of reverse data flow
is prevented.
The flux conditions present for the various states of a typical
MADelement of this type (referred to as Device 2) is shownin figure 5.
o) RESET OR CLEARED STATE
OUTPUT
PRIME
d) RESET CORE AFTER PRIMING
INPUT %
ADV.
(CLEAR)'
b) SET STATE
PRIM _E OUTPUT
C) SET CORE AFTER PRIMING
Figure 5 Device 2 Flux States
1-17
In the cleared state (figure 5a) the core is saturated in the clockwise
direction by a previously generated advance current which threads the major
aperture. Upon application of an input signal threading the inner oortion
of the major aperture, the flux nearest the major aperture is reversed thus
providing the set condition shownin figure %b. This read-in operation does
not affect the flux linking the outout aperture and thus a diode is not
required to block data transfer to receiving cores. In order to obtain
an output from a properly set core it is necessary to provide a prime
current as shownin figure 5c to reverse the flux stored about the output
aperture. Priming current is of a lower magnitude than the advance current
and because of its slow rate of change is not sufficient to cause the core
linked by the output winding to be disturbed. Once a core has been set and
primed, the application of an advance current causes a flux reversal about
the output aoerture. This in turn, _rovides an induced voltage of suffi-
cient magnitude to drive the next core to the set condition. If the core
was initially in the reset condition it will remain in this condition after
priming (figure %d). For this case, the application of the advance current
does not provide a flux reversal and thus no output occ,lrs.
Device 2 elements maybe connected in a variety of shift register con-
figurations including parallel input-parallel output, parallel input-serial
output, serial input-serial output, etc. Such shift registers take the form
of 2 core-per-bit arrays and require a two clock system in combination with
1-18
a priming source. A typical serial input-serial output shift register
section is shown in figure 6.
ADV. 0 _ E
0 0
I
PR,.E -// \\ /
ADV. E _ 0 _,_/_C
Figure 6. Device 2 Shift Register
The propagation of a "I" from left to right proceeds by activating clock
a_d prime signals in the following sequence: ... PRIME, ADV O_E, PRIME,
ADV E-_O, P_IME, ADV O-_E, .... AMP-___D shift registers require relatively
high values of p_se current for performing advance, prime and set oper-
ations. Nominal opera_ng level for the advance current is 2 to 3 amperes
in a typical design. Prime and set pulse currents are lower being IO0 ma
and 250 ma respectively. Because of the requirement for slow priming and
in order to keep average power dissipatio_ at reasonable levels, these
1-19
shift registers are limited to repetition rates of IO Kc. A tyoical driver,
which utilizes a capacitive storage-discharge schemeand dual Shockley
diodes for triggering the advancecurrents, requires an average power of
5.3 watts to drive a IO bit shift register at i0 Kc. A iO bit shift register
with its associated driver requires a package occupying approximately 9
cubic inches.
The implementation of general logic operations using MADdevices is
not easily accomplished, due to the difficulty of achieving logical inversion
and reasonable fan-out without an imposing complexity. The treatment of
muchof the general logic capabilities of MADdevices is reported in rather
implicit terms by the current literature. The ORfunction maybe provided
relatively simply by threading additional winding_ about the input anerture
if care is taken in preventing reverse information transfer. The negation
operation maybe achieved by extending the current inhibiting and "one"
generator technique described in the _brid approach to the MADtopology.
Perhaps the most difficult problem w_ich faces the all-magnetic logic de-
signer is that of providing fan-out. This arises from the fact that all
the power which is used to provide inputs to receiving cores comesfrom
the clock source. Power gain in,he ordinary sense is not available except
in those _brid schemeswhich use transistors to provide regeneration.
A MADdevice with a reliable fan-out of two is sufficient, however, to
allow the performance of general logical operations requiring muchgreater
fan-out. This maybe accomplished by utilizing additional clock pulses to
1-20
sequentially transfer data in a "tree" wiring arrangement until the ori-
ginal single core data is available simultaneously in several c_es. As
far as fan-out is concerned, it appears that the hybrid approach using
transistors provides an important advantage over the all-magnetic tec.hni-
ques which necessarily require considerable device and system complexity
to achieve the sm_e result.
E. Su_aary and Conclusions
The foregoing description of magnetic logic has not attempted to
describe the variety of possible approaches. The techniques for accomp-
lishing general logical operations have been implicit, reflecting the treat-
ment of the current literature. Examples from two general classes of
magnetic devices have been described to provide a basic understanding of
the techniques involved. If the approaches described may be regarded as
typical, then some conclusions about their utility may reasonably be expected
to apply in a general sense.
Information regarding transfer and shifting operations are covered in
considerable detail by current literaturej but the treatment of general
magnetic logic schemes has been seriously neglected. This suggests the
degree of difficulty which has been encountered in the design of practical
devices. Complex clock progran_ing studdevice configurations are necessary to
achieve operations which conventional designers have come to consider as
1-21
trivial. In general, magnetic devices do not display a natural ability
for performing logic. The primary attribute of magnetic devices is that
of non-volatile storage, the ability of a core to remain in a particular
state indefinitely without further application of energy. This feature is
an important consideration in power limited environments such as space
vehicles where the standby power between clock pulses maybe madeto approach
negligible values. If the clock processing rate exceeds approximately I0 Kc
however, the average power req_lired often exceeds that of a conventional
transistorized counterpart. This limits the application of magnetic shift
registers, timers, etc. to equipment with low clock rates.
Recent advances in low power microminiaturized devices are seriously
challenging the magnetic attribute of zero standby power while providing
higher speed, smaller size and the greater utility of combinational DC
logic. NASA'sLewis Research Center is sponsoring muchof the work in this
important area. Operating speeds of several newly developed circuits are
approaching I00 Kc at power levels in the microwatt range. A complete
logic system with a power consumption of I0 microwatts per stage is anti-
cipated for space application using micropower logic circuits. With the
basic reliability of micromlniaturized devices constantly improving by
virtue of an industry-wide effort, the role of magnetic logic appears to
be fading.
Another advantage claimed for magnetic devices is the reliability in-
herent in the use of magnetic material and connecting wire. It is assumed
here that magnetic parameters affected by temperature have been compensated
1-22
Tor by proper design and that clock current amplitude and rise time are
within the limits of proper operation. Under these conditions the basic
mechanism of magnetic storage and switching _pears devoid of any known
failure mode. This reliability is however obscured by the large nm_ber
of connections required by the device configuration and the complexity
inherent to the system organization. The reliability of a magnetic system
depends upon the connective paths and the clock pulse drivers.
Simplicity and low cost is often claimed as a virtue for magnetic
devices because of the simolicity and cost of the basic cores utilized.
It should be noted however that the task of providing several turns abcut
the various apertures 8nd connecting cores in a configuration to perform
the basic logical operations of AND, OR and negation is not generally
amenable to automated assembly. The extensive amount of hand wiring and
soldering appears to represent an item of considerable cost.
The p_sical size of magnetic devices are generally one or two
orders of magnitude larger than their micro_iniaturized counterparts.
Advances in thin film magnetic logic hold some promise for a significant
size reduction, but developments in this area have not been extensively
reported to date.
The flexibility of magnetic devices is seen to be severely limited
by the dynamic logic approach and the difficulty of achieving reliable fan-
out in the absence of active devices. The flexibility of conventional
1-25
DClogic systems is evidently superior because of the power gain and the
inherent signal level standarization.
After considering the attributes of magnetic devices for performing
general logic, the popular core techniques do not appear to provide an
evident superiority in power consumptlon, reliability, simplicty, cost,
size _d flexibility over the conventional solid state circuit approach.
Indeed, the requirements of performing the logical operations characteristic
of digital computers appear to be at variance with the capabilities of
magnetic logic. The applications which are best suited to magnetic imple-
mentation are those in which the operations to be performed are not clearly
separated into "logic" and "memory". A strong case can be made for mag-
netic circuits applied to the performance of integrated storage az.d transfer
operations required by a variety of digital processing functions. Most
appropriate are the low speed operations inherent in input-output, inter-
face and peripheral equipment. Typical applications include shift registers,
programmers, timers, sequencers, etc. where the magnetic modules perform
e_tire functions rather than discrete operations of storage and logic.
In these special applications where speed is low, the advantages in simpli-
city, reliability, cost and power to be gained through the use of magnetic
circuits should not be neglected. In general applications, however, the
presemtly developed magnetic circuits do not appear satisfactory due to the
several problems inherent in their use.
1-24
TII. Semiconductor Logic
A. Introduction
In contrast with the numerousdisadvantages and the general un-
availability of magnetic logic devices, conventional semiconductor logic
has been used widely. Logic modules are commercially available for con-
struction of general logic systems. Integrated semiconductor circuits
offer an order of magnitude reduction in size comparedto magnetic logic
modules; they do not req_ire high voltage or high peak p_er pulses.
They operate at frequencies manytimes greater than comparable magnetic
logic requiring the samea_erage power, and provide the convenience of
steady voltage outputs.
Integrated semiconductor circuits offer a significant size and
power reduction comparedto discrete componentsemiconductor circuits.
The rapid acceptance of integrated and semiconductor logic elements attests
to the advantages of their use. Therefore, integrated circuits have been
chosen as more suitable for spaceborne digital applications than the dis-
crete component circuitry. The circuit design problem is then translated
to the problem of the choice of suitable types of circuitry and logic.
A variety of such elements is available with predictable characteristics
for a wide range of operating environments. The selection by the Air
Force of integrated circuitry for use in the improved Minuteman is a
significant factor in the availability of reliable integrated circuits _d
appropriate reliability data. There is also a large amount of goverment
1-25
and industry effort devoted to research add development of new and impro_,ed
integrated circuits.
The low weight and power consumption of integrated circuits offers
an important compensation for the increase in the number of circuits required
for redundant design of spaceborne equipment. It is expected that advances
in integrated circuit technology will allow more complex circtdts to be
included within a single package to further decrease size and weight. In-
tegrated circuits also offer significantly improved reliability performance;
it is exoected that the reliability of single chip containing an entire
function can be shown to approacb that of a single discrete transistor.
The low power consumption characteristic also tends to increase reliability
by reducing temperature stress. The significant reduction in the number of
interconnections is also an important factor in reliability improvement.
Most integrated logic modules are available in the form of a univer-
sal gate function (NAND or NOR). These logic elements are quite appropriate
for the construction of the restoring function required for a multiple line
majority voted redundant system. Several types of logic available for the
universal gate function have been studied. Each basic type is described
below; those commonly available are compared for suitability for use in
spaceborne redundant systems. One of these is chosen as particularly suit-
able.
B. Classification of Basic Types of Logic
It appears that most of the common types of transistor logic (TL)
may be classified according to three bas_ c coupling schemes used for the
1-26
universal gate function. They are described below.
I. Linear impedance coupling to an input transistor maybe used
to form R-TL, as shown in figure 7. This type of logic is generally not
available in integrated circuit form.
I(
I(
( T
+V
>
-V
Figure 7 R-TL Resistor-Transistor Logic (+NOR)
II. Direct coupling to a multiple output transistor array (DC-TL),
may be used as shown in figure 8. It is commonly used in the more practical
modified forms, such as R-DC-TL (type II-A) shown in figure 9. An impedance
is inserted in each input line to improve operational characteristics.
Although this type of logic is sometimes referred to as resistor coupled-
transistor logic, its operation is not the same as R-q_, described above.
1-27
÷¥
Figure 8 DC-TL Direct Coupled-Transistor Logic (+NOR)
÷v
Figure 9 R-DC-TL Resistor-Direct Coupled-Transistor Logic
1-28
@¥
-¥
Figure ii D-TL Diode-Transistor Logic (+ NAND)
÷¥
-¥
Figure 12 NS-D-TL Non-Saturated-Diode-Transistor Logic
1-50
TypeII-B coupling involves current switching and output buffering
to prevent saturation of the input transistors. This type of logic is
sometimes referred to as emitter coupled-transistor logic (EC-TL) or current
mode-transistor logic (CM-TL). One type of non-saturated-direct coupled-
transistor logic (NS-DC-TL), which uses an emltter-follower output buffer,
is shown in figure iO.
,+V
>
• <
q
_+V +V
_REF >
>
k.
Figure IO NS-DC-TL Non-Saturated-Direct Coupled-Transistor Logic
III. Diode coupling uses non-linear input summing to form the
logical AND or OR function. The most common form of D-TL is shown in
figure Ii, which performs the positive logic NAND (AND-NOT) function.
Saturation of the output transistor may be prevented by limiting the
minimum saturation voltage, as shown in figure 12. This results in a more
constant "zero" output voltage, and diverts excess base current to improve
trsnsient response.
1-29
Type III-A coupling, shown in figure 13,is a va_ation referred to
as T-TL which uses transistor coupling to obtain improved response.
Logic operation is equivalent to D-TL when inverse transistor gain (_ i )
is low; coupling transistor action removes stored change during turn-off,
and generally permits the elimination of the output traDsistor base bias
resistor.
÷v
Figure 13 T-TL Transistor-Transistor Logic
C. Comparison of Logic Types
A comparison of the types of circuits described above is shown in
the table below for five types which are commercially available. They are
arranged in the table in increasing order of the number of equivalent com-
ponents required for a 3-input universal gate function. A larger number
of components generally increases fabrication complexity and increases
1-51
power dissipation. The general characteristics of these logic con-
figurations are discussed and compared in the paragraphs following the
table.
The isolation and speed-power rarkings for the three saturated
logic types were obtained from "The Changing Prospective in !_icrocircuits",
Electronic Design, February 15, 1963, p. _6. This article describes the
result of a study of different types of logic for single substances
conducted by PSI. The author observes that no one logic type is superior to
all others for every application, but rather that the characteristlcs of
each type must be considered according to the particular over-all _-stem
requirements.
The isolation ranking is a qualitative measure of the
input loading, the isolation between inputs, noise immunity, and varia-
tion of input loading with parameter changes, internal failures, and out-
put loading. Logic types with the highest isolation are ranked first;
those with lower isolation are ranked in increasin£ order. The non-
saturated logic types are inserted into the original ranking by a com-
parison of their general characteristics with those of the three saturated
logic types.
The speed-power ranking is a quantitative measure of the product
of propagation delay and power dissipation of the different logic types
when similar components and techniques are used in fabrication. This
l-Se
D. Description of Logic Types
Resistor-transistor logic (R-TL) is a basic schemefor providing
the NORfunction for NPNpositive logic. The resistors are used for linear
input summinginto the output transistor, which is normally biased off
unless at least one input is present. The bias maybe increased to provide
either the inverse majority or the NA_q_output. The addition of speed-up
capacitors to the input resistors, although significantly increasing transient
response, is not sufficient to reduce the power-speed product _ that avail-
able with other types of logic. Thebilateral interconnection may create
interaction problems between inputs; performance of the device is sensitive
to variations of the input resistors, biasing, and transistor gain. The
difficulty of fabricating an integrated resistor-capacitor combination for
each input further decreases the suitability of this type of logic.
Direct coupled-transistor logic (DC-TL) is a theoretically simple
method of performing the NORfunction for NPNpositive logic. Irnuts are
applied directly to transistor bases; the commoncollector is the output.
_ctual operation, however, is limited by the high sensitivity to parameter
variations, input current "hogging" and low input impedancewhich limits
fan-in and fan-out, and the low noise margin. These severe limitations
have resulted in the actual use of a modified version (R-DC-TL) which includes
D
a low impedance resistor-capacitor combination on each input to reduce the
sensitivity to noise, parameter variations, and current "hogging". This
modification increases power dissipation, propagation delay, and fabrication
complexity. Since the fan-out capability of most NPN positive logic NOR
1-34
characteristic varies considerably according to the design and tecbnology
used for the construction of actual circuits. Logic types with the lowest
power-speed product are ranked first; those with higher power-speed
products are ranked in increasing order. The non-saturating logic types
are inserted into the ranking order indicated according to available data.
TABLEI COMPARATIVERANKINGOFAVAILABLELOGICTYPES
NAME Function for _oe of Numberof Speed- Isolation
+ Logic Coupling Components Power Ranking
Ranking
T-TL NAND III-A 3 i 4
D-TL NAND III 5 3 2
NS-D-TL N_ND III 6 2 3
R-DC-TL NOR II-A 7 5 5
NS-DC-TL NOR II,B 9 4 I
1-53
schemesis derived from the output collector resistor, the power
dissipation must be increased to allow fan-out capability regardless of
whether the fan-out is used or not.
The basic DC-TL scheme may be modified to provide non-ssturated
input logic (NS-DC-TL). The common emitter resistor reduces the _roblems
of input current "hogging", and increases input impedance so that this
type of logic offers bi_h input isolation. Various methods may be used
to provide outputs; both the OR and NOR may be provided conveniently.
Good matching of components and close tolerance on a special reference
voltage supply are required. The clocking function may be obtained by
controlling the negative voltage supply by gating or a sinusoidal voltage.
A two phase clock is required for flip-flop functions more complex than
simple storage. An additional transistor, which shares a common collector
_th other input transistors, is required for each input. The voltage
difference between the "i" and "0" level is usually very small, resulting
in reduced DC stability and noise margin. NS-DC-TL offers high speed oper-
ation at the expense of high power dissipation.
Diode-transistor logic (D-TL) is probably the most popular type of
integrated circuit logic, due to its similarity to discrete component
circuitry and the excellent operating characteristics. D-TL circuitry
operates with wide parameter variations to minimize the possibility of
malfunction due to drift failure. Actual failure testing has shown that
redundant D-TL is not sensitive to most catastrophic failures. D-TL is
most commonly available as NPN positive logic NAND integrated circuits.
1-35
The newer versions of commercially available D-TL circuits offer about the
lowest power-speed product available for circuits o_erating at moderate
speeds and with good noise margins. Consideration of integrate4 circuit
characteristics has signif$cantly reduced the number of individual
isolated components compared to the number of discrete components required
for an equivalent circuit. The entire input diode array, as well as one
level-shifting diode, may be constructed as one multiple-evitter transistor.
Each additional input merely requires an additional emitter connection.
Transistor-transistor log_c (T-TL) is a simplified variation of
D-TL employing transistor coupling directly to the base of the output
transistor. The elimination of cne coupling diode reduces the noise margin
and voltage swing to about the equivalent of DC-TL. Input isolation is
sir_lar to D-TL, except that inverse gain of the coupling transistor allows
some "hogging" of input current. The inverse gain cannot be reduced without
increasing the offset voltage of the coupling transistor_; increased off-
set voltage, in turn, decreases DC stability and noise mar_n. Increased
speed at low power levels is possible because the coupling transistor
removes stored change from the output transistor to reduce turn-off time.
The output inverter of D-TL may be designed to prevent saturation
to reduce excess drive and stored-change effects. This may be accomplished
by limiting the minimum "0" output voltage by a base to collector clamp
to prevent saturation of the output transistorj as shown above for non-
saturated diode-transistor logic (NS-D-TL). The increased "0" output
voltage will, however, be more constant with increases in output loading,
• i. ii --
"VeE
#s
1-36
if sufficient gain is available. Logic operation is equivalent to D-TL
with increased speed and lower power dissipation under comparable
conditions. Additional gain may be easily obtained for D-_ by sub-
stituting an emitter follower for the final level shifting diode.
The speed-power performance of some of the commonly available
logic elements currently available are shown in figure I_. This figure
shows the advertised performance characteristics of different logic types
available from different suppliers.
SOO! --
200 --
I00 --
50 --
_ 30 --
I0
5 --
3
_ T_NMp_,%yB (_ _ D-TL NAND
_ _ D-TL NAND T COMPANYA
.__ COMPANY DT-TL NAND
COMPANY E NS-D-TL
_ _R-DC-TL NOR
F I (IMPROVED)
D-TL NAND
COMPANY E
NS-DC-TL NOR/OR
COMPANY H ®
I ! ! !
1.0 2 3 5 7
AVERAGE POWER DISSIPATION, P-MW
I I I I
IO 20 30 50
Figure 14 Speed-Power Performance
1-37
The wide variation of performance characteristics for
different suppliers of the same logic types is due to several causes:
differences of circuit parameter design, lack of standard test conditions
(temperature, fan-out, voltages, etc.), as well as the rapidly improving
technology in this field. Two recently announced improved versions of
previous elements (Company A D-TL and Company D R-DC-TL) are indicated
in the figure. The rapid rate at which improvements have been made in
the field of integrated circuits makes it impractical to make an arbitrary
decision to use only one logic element for all future _naceborne redundant
systems. General characteristics, as well as the specific requirements
of redundant systems, may be used to make recor_endations, however,
based on available information. The gereral characteristics discussed
below may be used as a guide to the choice of circuits, even through
exact requirements may vary.
Since systematic redundancy is most efficient and powerful when
the basic elements are highly reliable, the realization of high system
reliability with minimum weight and power penalties requires circuitry with
high basic reliability. High circuit reliability, especially for extended
periods of time, is usually realized when the circuit configuration is such
that proper operation is not excessively sensitive to parameter variation
or environmental extremes. High speed performance does not appear to be
a particular requirement for most spaceborne systems; low power dissipation
1-58
• is a muchmore desirable characteristic. Available power (and total
energy) is often limited on space missions; the additional circuitry
required to reduce the probability of system failure will further emphasize
this problem. The power required by individual circuits must be held to
a minimum to keep total power within available limits. The reliability
performance of most integrated circuits depend on the temperature stress.
The use of low power circuitry is an important factor in reducing the
temperature stress, which, in turn, improves the basic reliability and
performance characteristics of the individual elements.
Although T-TL offers high speed at low power levels, its
sensitivity to parameter variation, noise, and input current "hogging"
has reduced the general suitability of T-TL. This sensitivity a_pears to
be a major disadvantage because the individual circuits in a redundant
snaceborne system are required to operate reliably despite severe emriron-
mental variations and the occurrence of failures within the _jstem. Since
inverse transistor action can limit the input voltage signal, failures
within the circuit or on the output may affect the inputs. This transfer
of failure effects to inputs would be a serious disadvantage in redundant
systems, where the effect of failures must be minimized.
DC-TL appears to be even more sensitive to parameter variations
and failure effects, except for the various modifications which are used
to reduce this problem. Positive NOR logic appears t o be particularly
vulnerable to output failures resulting in failure of input signals. This
occurs because the transistor turn-on current is obtained from inputs; any
1-59
input must be able to Drovide sufficient drive to cause the output to be "
"0" for proper operation. Fan-out capability is obtained by providing
each output with the ability to drive several inputs. If actual failures
may cause all of the inputs to a circuit to be overloaded, then any other
c_rcuit receiving any of these inputs are also effectively failed. Addi-
tional fan-out capability is usually reflected in increased power co_sum-
tion, which, in turn, increases reliability problems.
In contrast,the turn-on current for positive NA_3 logic is obtain-
ed within each logic element. This drive current is diverted to a low
impedance input whenever any input is "0". Fan-out capability is provided
by the output transistor gain, and may be increased withcut significantly
increased power requirements. Since drive current is provided by each
circuit, rather than by inputs, failures within an NA_ circuit usually
do not affect proper operation of inputs. The back-to-back diode coup-
ling also offers good isolation characteristics. Actual failure testing
has verified that failure effects in D-TL is usually limited to the
circuit in which the failure occurs.
Limited testing for the effects of both transient effect of
high gamma radiation and the permanent effect of integrated neutron flux
has shown that D-TL integrated circuits are more resistant to radiation
than forms of DC-TL 6 The transient effects of high gamma radiation aDpear
to be primarily due to the leakage of the collector isolation diode. DC-TL
is more susceptible because the larger number of common-collector transis-
tors used creates a larger junction area. DC-TL was seriously affected at
1-40
gam_Lalevels of 106 to I0 7 R/see, while one company's D-TL withstood an
order of magnitude increase. The ssmecompany's D-TL also showed Nlore
resistance to integrated neutron flux, but no microcircuits showeddamage
at ordinarily expected dosages. _t a flux dose of 2.8 x I0lh neutrons/cm2
(equivalent to about IO0 years of continuous exposure in the Van Allen belts),
one company's elements failed, another showedwaveshapedeterioration, while
another microcircuit brand and discrete componentD-TL showedno noticeable
effects.
E. Logic Selection
Integrated D-TL circuitry appears to be the most appropriate tTpe
of logic for general use in redundant logic _y.stems for spacecraft missions.
It has been chosen for the general advantages of features described above,
and particularly for its suitability for use in redundant spaceborne equip-
ment, which requires both high immunity to noise and parameter variation,
as well as reasonably low power dissipation. These requirements are
generally not available _n the various forms of DC-TL. Although T-_v_logic
is equivalent to D-TL, currently available elements are toe sensitive to
input current "hogging" to be suitable for use in redundant systems.
D-TL is knownto have high noise immunity, good input-to-output
isolation, good capability with other circ_itry and relatively low power
consumption. D-TL is particularly insensitive to drift failures; failure
testing had shownthat the effect of most catastrophic failures is not
especially harmful in redundant logic networks. The speed capability of
1-41
available integrated D-TL circuits appears to exceed the requirements of
most spaceborne systems. Some of this excess speed capability may be
traded for lower power requirements by reducing the power supply voltages.
Power dissipation could be further reduced by a redesign of present D-TL
circuits to use higher resistance values. High resistance is a diffi-
cult problem in present circuits, since the characteristically low resis-
tivity of diffused resistors requires a large area for hi_b resistance
values. The use of thin film resistors and capacitors on the silicon block
in which the semiconductors are diffused, as planned by Westinghouse for
the near future, would permit circuit design for significantly lower power
dissipation without the large areas and narrow strip layout required for
totally diffused circuitry. Such single-chip hybrid circuits are not
presently available for general logic use.
It is expected that the positive logic NAND function will be
used, since this permits logic design of functions as the sum of products,
w_ch is convenient for reduction and simplification by familiar methods.
The NAND circuits shown are particularly versatile, since the collector
outputs may be connected together to form AND-OR-NCT logic functions
directly. R-S flip-flops may be formed by interconnected NA_ elements;
formation of more complex functions such as a compatible counter element
require a large number of N_D elements and a two-ohase clock. The majori_ _
voter is not a co_ercially available element, but it is easily constructed
from NA_ elements.
1-42
F. Majority Voter Design
Failure testing has shownthat oarticular care must be used for
the design of restoring elements so that failures on one input to the
restorer do not cause failures on other inputs, and the failures in the
restoring elements do not cause failure of a majority of inputs. TbAs
testing has shown that a conventional majori_r element (whether constructed
as the _nimum discrete component circuit, or of interconnected NOR or NAND
elements) may experience failures which either cause immediate failure of
the entire set of restorers, or which would cause the same result if a
single input error occurs.7 If such effects are overlooked, the system
reliability may be seriously degraded. Shown in figure 15 is a three
input majority element using NAND elements which cannot cause an entire
set of restorers to fail due to any single failures.
B MAJ ( A ,B,C )
Figure 15 Majority Element with Input Isolation
i-_ 3
The NANDimplementation shownutilizes commonoutput logic so that
the voter requires only two more gates than conventional majority voters,
and retains a two element input to output propagation dels_. NOR implemen-
tation, however, would require a total of eight gates and four element
input to output propagation delay to obtain input isolation for NPN positive
logic. It is expected that the isolated input majority element shown will
be more reliable in normal operation (all inputs alike) than a more conven-
tional configuration, since very few single failure modes can cause the
output to disagree with the inputs when all inputs are identical.
If higher orders of redundancy are used, then each inout is
provided with isolation gates. Since component redundancy is not used to
protect against single failures, a simple test consisting of monitoring
the logic output while applying all combinations of logic inputs will
completely test the operation of the circuit. A custom-packaged majority
voter would significantly reduce the size and weight of a redundant system
when compared to one using individual packages. The packaging of this
majority voter is of particular importance because it is used repetitively
in a redundant system.
1-44
IV. Failure Testing of RedundantSystems
A. Introduction
I. Characteristics of RedundantSystems
The outstanding attribute of a redundant system is that of
providing high reliability for a longer period of time than the non-
redundant counternart. Typical reliability curves depicting this relation-
ship for a simple system shownin figure 16. It is assumedhere that both
systems begin operation with all circuits, subsystems, wiring, etc. in a
failure free condition.
I.O
REDUNDANT SYSTEM
RELIABILITY
I
e
O
CONVI_NTIONAL
SYSTEM
MTBF(CONVENTIONAL SYSTEM),,
OPERATING TIME
Figure 16 Reliability of Conventional vs. Redundant Systems
1-45
The statistical relationship betweon reliability and operating
time is derived by assuming that failures occur at constant rate and are
inherently random and independent. After some period of operation without
maintenance_the reliability of a typical multiple line, majority voted
redundant system falls off and becomes less reliable than tbe non-redundant
version. This behavior is normal since the greater number of components
subject to statistical failure eventually cause the majority voters to have
incorrect outputs. The initially flat portion of the redundant system
reliability curve is the characteristic which is exploited to provide high
mission reliability.
Since current spaceborne equipment is unattended after mission
commencement, it is important to assure that the equipment is in perfect
working order "before launch". It may not always be practical to completely
test each part of a redundant system after final assembly and installation
|
into a space vehicle, and thus the term "before launch" includes diagnostic
testing before final assembly. It will be shown that a redundant system
may be conveniently diagnosed for the presence of failures after final
assembly and installation in a space vehicle. This may be accomplished
during the pre-launch test period when the vehicle is about to begin its
mission. Essentially the technique employed is that of removing the failure
masking effects of redundancy and testing the replicated systems separately.
The function of these tests is initially to detect the occurrence
of a failure and secondly to determine its location. The tests would be
1-46
useful in deciding whether the equipment should be finally assembled and
installed into the space vehicle or if the equipment is free of failures
and ready for launch. The goal here is to assure that all of the initial
failure protection which has been designed into the system is available.
In a non-redundant system the best one can do is to test the system
and then hope that no failures occur. The statistical nature of failure
occurence, however, offers little assurance that a failure will not occur
Just after mission commencement. This occurrence often precipitates total
mission failure in a non-redundant system. The redundant counterpart is
obviously better suited to tolerate random failures. Further, a typical
order three redundant system which has been diagnosed to be free of failures
prior to mission commencement is not vulnerable to single failures and thus
offers a high degree of assurance of mission success.
Further tests would be utilized to isolate and locate the failure.
The goal here is to effect repair and thus return the system to oerfect
working order. Since this may consume considerable time and involve special
repair or replacement facilities, a duplicate system, which has been found
free from failure, ms_ be required to expedite scheduled installation into
the space vehicle.
For redundant systems which receive maintenance the purpose of
diagno stic testing is again to detect and locate failures. The goal, how-
ever, is to return the system to perfect working order and thus assure the
highest possible reliability during the entire operational life of the equip-
ment. In order for periodic maintenance to be effective it follows that the
1-47
period between maintenance checks should be sufficiently short so that the
reliability for the maintenance period is high. The probability of operation
repeatedly traverses the initially flat portion of the redundant reliability
curve.
The general problem of diagnostic testing is to provide suitable
test facilities and methods which are effective in determining whether a
failure has occurred, and to determine its location. In a redundant system
the implementation of test facilities entails manyconsiderations, ranging
from basic system configuration to the details of circuit design. In a
conventional non-redundant system, test provisions are all too often given
only token consideration. Although the test features Drovided maybe in-
effective or inconvenient, the diagnosis, failure location and reoair of the
equipment is often madepossible through the ingenuity of an experienced
technician. A redundant system similarly encumberedimooses a muchmore
difficult task. Thus the need for integrating system configuration and test
facilities in the initial design stages becomesextremely important.
2. Testing of Conventional Systems
The techniques for detecting a failure in a redundant system
represents a problem which is alien to the test philosophy of conventional
systems. In a non-redundant system the effect of a failure is rather
dramatic and is usually evidenced by either partial or total system failure,
or obvious changes in operational behavior. This simplifie_ the problem of
detecting an error, but is small consolation to the user who loses the
service of a system without warning, perhaps at somecrucial moment. Total
1-48
system failure usually indicates the failure of a major function, such as
a power supply or clock generator. Changesin operational behavior and
partial failures normally provide symptomswhich_wh_nanalyzed_are valuable
in converging on the failure location. In a redundant system the effect of
a non-critical failure is not evidenced by any change in system behavior.
This meansthat the effect of a failure does not provide gross symptoms
which maybe used to indicate its occurrence or deter_,_ne its location.
The solution to this unique problem is suggested through several avenues of
approach which represent diagnostic routines and implementation schemes
unique to redundant systems.
Before considering the unique demandswhich a redundant system
imposes on the required test facilities, it is useful to consider some
approaches which are applicable to digital systems in general. These
general approaches include waveshapemonitoring and the application of
various stresses to enhance the chance of detecting present or potential
failures. The combination of general approacheswith the specific ap-
proaches to be suggested appear to offer a more inclusive repertoire of
techniques from which to choose.
In a conventional system a failure of somecircuit or sub-system
normally provides an indication of its occurrence by the resultant chsmgesi_
in operational behavior. These are usually designated as catastrophic
failures. Degradedcomponentswhich are not sufficiently marginal to cause
circuit failure are more difficult to detect because there is no indication
of a change in system behavior. Often, however, a degraded componentms_
1-49
°be detected at the circuit test point level by changes in normal wave-shape.
At the component level the degradation may be considered as a failure. At
the circuit level this condition represents an impending failure. Under-
standably it is important to detect and re_air impending failures since it
is very likely that the circuit will soon fail. This is one of the more
important aspects of periodic maintenance of non-redundant systems. Often
the system may be operated normally and the various test points monitored
to detect marginal voltages, wave shapes or rise times. This represents
a very time consuming procedure and is severely limited in effectiveness
by the number of test points which are provided. Many marginal components
are then essentially undetectable.
Another problem which often arises is when a failure in circuit
operation becomes sporadic. In this case the system may operate normally
for most of the time making the location of the fault a difficult task.
As so often happens, just as maintenance personnel are in the orocess of
converging on the fault location, the fault disappears and the system
o_erates normally. The problem here is that the fault is not oresent long
enough to allow an adequate diagnosis of the difficulty.
A more powerful approach for locatin_ impending and sporadic fail-
ures involves the application of stress to the system. This will often
precipitate a circuit failure by subjecting components to a condition which
magnifies any degradation. Consider now the two general classes of approaches
for imposing system stress--environmental and electrical. Environmental
1-50
stress may be typically sub-divided into temperature, humidity, pressure
vibration, shock, radiation, etc. The application of one or combination
of these environmental stresses is seen to present three main problems;
I) the size, complexity and cost of the facilities required, 2) the
difficulty of performing measurements in an alien and often dangerous en-
vironment, and 3) the possibility of subjecting components to unnecessary
stresses and thus causing unwarranted damage or destruction.
Temperature stress is perhaps the most popular approach because of
its utility in causing parameter changes in resistance, capacitance, leakage,
gain, threshold, etc. A second advantage is the small amount of additional
facilities which are required. Often, temperature stress may be conven-
iently applied by controlling the system cooling to increase or decrease
operational temperature. Component variations caused by temperature stress
often make circuit operation marginal when such changes are beyond the
normal specified design limits. Thus a component which has become only
slightly marginal at normal operating temperature, a_d is indicative of
impending failure, m_, be magnified by temperature stress to precipitate
circuit failure. This method is often used, for example, in testing tran-
sistors for leakage current degradation at elevated temperatures. In a
system test the increased leakage current of degraded transistors causes
circuits to become sufficiently marginal to effect circuit failure.
The remaining types of environmental stress are difficult to imoose
on a system without test facilities of vast complexity. For this reason
1-51
they are net readily amenable to system testing but find greater utility
at the comoonent or sub-system level. A case in point is the development
of highly reliable components, i.e., by carefully controlled production
followed by extensive testing under a variety of environmental and elec-
trical conditions.
Electrical stress is a more convenient method for detecting
marginal components and impending failures, h convenient method for stress-
ing an entire system simultaneously is that of marginal voltage testing.
In this approach the system power supply voltages are varied to combinations
of maximum and minimum levels for which the circuits were designqed. When
all defective components, modules or sub-systems have been detected and
replaced the system power supplies are returned to their ncminal values.
Mar_nal voltage testing is often combined with simulation routines and
static and dynamic measuring techniques to provide an inclusive test program.
Simulation programs provide a form of electrical stress which is
seen to exercise the variety of operational functions which a system may be
required to perform under actual operating conditions. Often however, a
simulation technique may subject the system to operational speeds which are
not encountered in normal system operation. This might be accomplished by
varyin_ the frequency of system clock generators to either increase or
decrease the spee8 of operation. In a spaceborne sequencer, for example,
it may be necessary to speed up the occurrence of time events by several
orders of magnitude in order to test all functions in some reasonable test
period. In other applications increasing the speed of operat5 ons to the
1-52
maximumdesign limit is often useful for magnifying the effect of marginal
components. For example this technique is seen to be useful in determining
degradation in capacitive coupling circuits.
A reduction in operating speed does not usually subject the system
to stress but is useful in ascertaining that some normally fast sequence
of operations is being performed correctly. Here, the reduction of clock
rate is utilized to allow operation sequence to be convenientl_- monitored.
The general approaches discussed are primarily useful in precipitating
static failures which are impending or sporadic. DC failures and catas-
trophic failures are usually i_nediately apparent from the Manner in which
the system behaves. When only a portion of the system fails in the static
state it often provides symptoms which may_ be used in diagnosing the
location of the failure. If a failure occurs near the "front end" of a
system, the majority of outputs will usually become static. In this case
the symptoms are not sufficiently explicit to allow ar adequate diagnosis.
Simulation equipment then becomes useful in determining the failure location.
This is accomplished by applying suitable signals at the various subsystem
inputs and monitoring outputs for the presence of the correct response.
3. Failure Detection in Redundant _stems
The problem of detecting a failure in a redundant system is
usually more difficult than in the conventional counterpart, because the
effect of non-critical fail,ares do not provide gross symptoms of their
occurrence. This difficulty in diagnosing a failure is amply compensated
1-53
by the vast improvement in reliability which a redl&udant system provides.
Since a conventional system normally provides little indication
of an impending fail,_'e, the only available resort by which the system qual-
ity may be diagnosed is by the application of stress. It is, however, an
inconchsive test of the systems ability to perform reliably. In a redun-
dant system the application of stress to components and circuits for the
p,n-pose of detecting impending failures is not of significant value because
the effects of individual fail_wes sre masked by the system configuration.
Although redundant systems are able to tolerate failures _&thout causing
total system faiTure, it is often desirable to diagnose the system to detect
any internal failures. It will be shmm that the application of conditions
which reduce the ability of a redundant system to v&thstand internal fail-
ure acts like stress by modifying the configuration so that the failure
masking effects are removed. In this manner, failures which are present
_ll be indicated by the behavior of the system. The following paragraphs
will describe techniques for detecting and locating failures in redundant
systems.
An order-three, multiple-line, majority-voted redundant shift
register system _ill be used to demonstrate basic approaches. This is done
for ease of explanation and is not intended to suggest that the approaches
may not be extended directly to more general system configurations, or to
higher-order red,_dant systems. It may be noted that the testing of redun-
dant systems will involve a hierarchy of tests involved with first testing
the signal processing parts, then the testing of the restoring elements,
and finally the testing of the hard_are added for the initial testing function
i-54
" itself. The extent and complexity of this hierarchy _ill depend on the
confidence which is req,Aired of the tests and the degree of automation
desired. It appears i_possible, however, that perfectly reliable opera-
tion can ever be e_pected from any hierarchy of imperfect eq_ipment
monitoring other eq_Aipment. Altho_gh these testing methods are intended
to makea significant contribution to the techniques available for testing
red_Audanteq_Aipment,it is expected that f_Arther _ork in this area _,_Lll
result in f,Arther improvements. The acc_Aracyand complexity of the tests
should be balanced to obtain efficient s_stem operation.
Often, the problem of fail,Are detection is directly connected
_lth th_ req,Airement for determining the location to facilitate mainten-
ance repairs. Therefore, someof the more complete testing methods _ll
include combineddetection and location. Although fail,Are location tech-
niques are usually more complex than the basic failure detection techniques
they often include complete fail,Are detection capability in order to locate
all fail,Ares _-_ich might exist in a red_Audantsystem. Fail,Are location
techniques also provide effective methods to detect and locate fail,Ares
in the fail,Are detection studlocation circuitry itself.
Dasic failure detection _ll probably be most useful as a
verification technique to indicate that at least a major portion of a
redundant system is fail_mre free. This will ass_Arethat the fail,Are pro-
tection which has been designed into a red,_dant system is available to
prevent system fe_l,Are. Simple fail,Are detection techrlques are also expec-
ted to be a preliminary technique which _ll indicate if an._failures are
1-55
present in a maintained redundant system, so that further corrective
action may be undertaken. It is important that all failures be detectable
in a maintained redundant system, so that failures are not allowed to
accumulate and degrade system reliability.
h. Failure Location in Redundant Systems
If a failure is known to exist in a redundant system, it is
often desirable to obtain further information concerning the location of
the failure. This is generally required so that the module containing the
failure may be repaired or replaced. Although it is very desirable to be
able to detect any failure to permit maintenance, it is only necessary to
locate failures to within the smallest replaceable module. Therefore, the
requirements of failure detection depend strongly on the contents of the
smallest replaceable module. If entire subsystems are contained in a module,
then each subsystem could be provided with independent failure detection
hardware. This would be sufficient to locate failures within the replace-
able module. It is possible that the requirement for test points at each
replaceable module to permit failure location may in turn determine the
practical size and contents of the module. If the test points and con-
nections occupy a large space compared to the basic module, then the volume
efficiency is rather poor, and a larger replaceable module might be more
practical.
If repairs are expected to be made while the system remains in
operation, then the module which contains the failure must not include the
remaining replications of that function. This is necessary to permit the
system to operate while the module containing the failure is removed.
1-56
:f the entire module is to be replaced if it contains a failure, then the
failure location technique must be sufficiently accurate to determine which
module contains the failure. This module may then be replaced without
inter_ption of normal system operation. Maintained redundant systems
which are continuously monitored and repaired require a combined failure
detection and location technique which may be anplied without altering the
operational characteristics of the system. It will be shown that relatively
complete testing may be accomplished during system operation. This is pos-
sible because the most frequent and harmful failures usually cause signal
disagreements at the inputs to the voters. These signals may then be
compared, either automatically or with the use of test points, to detect
and locate these failures. Certain system configurations are amenable to
controls which allow complete failure detection and location with access only
to the signals at the inputs to the voters. More generally applicable
techniques require access both to the voter inputs and outputs. These tech-
niques, as well as the implementation circuitry required, are described in
the following paragraphs.
5- Signal Comparison in Maintained Systems
The location of a failure in a conventional system requires
that a handbook be provided to indicate the correct wave shade and binary
sequence to be expected at each location. This is in addition to sim-
ulation equipment which may be required to place portions of the system
into dynamic operation. The redundant system masks the effect of individ-
ual failures and thereby makes the task of detecting their occurrence more
difficult. It will be shown, however, that the masking effects of a
1-57
a red_nqdant configuration may be conveniently removed by co_trolllng the
outputs of the signal processors. This is essentially a gross system
approach _hereb7 the occurrence of a failure is indicated by forcing the
system to assL_ne various _lnerable configurations. If the system is
allowed to either operate normally, or in some configuration for _4nich
all operations are performed correctly, the detection a_xl location of
failures may be conveniently accomplished by exa_Aning replicated elements
for signal disagreement.
In many respects, the location of failures in a redundant sys-
tem is a nmch easier task than in the conventional system counterpart.
This is because an improper signal may be determined by comparison _lth
its replicated versions. If a redundant system is operating correctly
in an overall system sense, then the correct signal of each monitored
element is available at least at a majority of associated test points.
This is seen to eliminate the tedious task of monitoring elaborate wave
shapes and sequences. 1[aintenance personnel are then presented _ith a
system _2ich, in principle, contains an integral handbook of normal sig-
nals to be expected at the various locations. The system may be permitted
to operate normslly, _thout simulation equipment, performing operations
_ose binary sequence at any single location is so complex that one could
not hope to describe them adequately in any handbook. This suggests the
possibility that maintenance personnel need not be completely familiar
with the detailed operation of the system.
1-58
_e determination of an error could be provided by a differcl_ce
detector in combination _th a suitable indicator. A technician _ould be
req_ired only to mor_tor the various test points in someprescribed sequence
_lutil arrivLug at the locstion of a signal disagreement, i_e _ould not _e
required to possess any special knowledge of _:hat constitutes a correct or
incorrect _,mvesh_e, binary sequenceor repetition rate. f_sc, most dif-
ference detector devices which might be employed_li signal any large de-
parture from normal signals, and may include memo_ to indicate the location
of trsnsient or sporsdic failures. From this _e may conclude that the
tralni_ requirements for maintenance persormel _y be appreciably reduced,
thus providing red,_udant systems _th a distinct maintena:_ce cost advantage
over the more conventional counterpart. This sttribute alone ndght become
a significant factor in evaluating the total utility of a redundant system
_nich is periodically maintained.
In order to reduce the total system failure rate, periodic _in-
tenance must be conducted at a sufficiently short interval so that indivi-
dual failures are not so probable that system reliability is appreciably
degraded. In addition, if system failure occurs it might be necessa_7 to
employ sinmlation equipment to place portions of the system back into oper-
ation. The advantage of not requiring simulation equipment to locate
individual failures is an important feature of a maintained red_mudantsystem.
_nus the f,_ction of periodic maintenance is not only to assure high system
reliability during the life of the equipment, but also to eliminate the
requirement for sim_J_ation equipment to locate failures.
Thus far in our discussion of maintained redundant syst_.s, it has
been implied that the signal comparison equipment is usually e_ernally
applied to the appropriate test points in much the ssJne manner as an
1-59
oscilloscope or voltmeter is u_e_iin a conventional system. As indicated
previously, it maybe undesirable to provide these test points at every
signal processor and voter output in the system. This may be due to the
lack of access to the signals, the physical size of the test points in
comparison to the circuitry being monitored, or the signal loading caused
by test point leads. In someapplications it may therefore be desirable
to provide error detection and display as an integral part of the system.
Integral signal comparators may be desirable for exszlple, in a maintained
rediAndant system which is continuously monitored dIAring operation and each
fail,Are is repaired as soon as it is detected. This maintenance philosophy
allows a muchhigher system reliability than available _._th periodic main-
tenance. With proper design it appears feasible to remove and replace
defective modules without disturbing the operation of the system.
Since signal comparators _ll indicate only when signal disagree-
ment occurs during the normal system operation, more extensive tests are
req,Aired to detect and locate such fail,Ares as mi_t occur in signal pro-
cessors which are not to be used for somemodesof system operation, some
of the failures in voters, and fail,Ares that might occ_ in the control and
signal comparison circuitry. This suggests a maintenance philosophy of con-
tinuous monitoring combined with periodic complete testing as follows: Signal
processor outputs are continuously monitored during the operation of the
system for the indication of the more frequent and harmful fail,Ares _ich
cause incorrect signals. These fail,Ares are located and may be repaired
without inter_pting normal system operation. Periodically the norn_l
.
i=60
"operation of the system is shut down to allow the system to be completely
exercised and the othemvise undetectable fail,ares to be located and repaired.
In contrast, the periodically maintained system is allowed to acc_amalate
failures, even tho_gh they may be easily detectable, until the end of a
sched_aled maintenance period. Continuous monitoring and repairing is there-
fore a very powerful technique for detecting and repairing most failures
as they occur, without seriously impairing the ability of the system to
operate continuously while individual failures are repaired.
B. Singular Rank Testing
1. Detection of Signal Processor Failures
An obvious method for detecting failures in a t_ical redundant
system is to separate and reconnect the replicated parts to create indi-
vidual, independent systems. Each system may then be separately diagnosed
for the presence of failures in the conventional manner. This would req,_ire
that the basic system be provided with a large n_nber of special switching
circuits which accomplish a separation. Such an approach is somewhat-im-
practical because of the expense, comple_ty and reliability degradation
which the additional circuitry and wiring wo,_d impose. As v_ll be shown,
a nr_ch simpler means is available to provide a pseudo-separation of repli-
cated systems without requiring an elaborate switching mechanization.
As en example, consider the simple redundant configuration shown
in figure 17. Esch of the complete replications of the non-red,ludant system
8me hereafter referred to as a rank of the s_tem. Each rank normally
1-61
A A
A A
A=O,I
,_._L.,/; ;,, ..--LIL# A
.o " l.o
• t/ %
,_._L_,!! e,o"_.#
B=I_
C=N,N
rLr
Figure 17 Singular Rank Testing
consists of the components of the non-red_mdant eq,&ivalent system, separated
by the majority-voting restorers. Each of the signal processing elements
(indicated by blocks) within the same rank are designated _lth the soi_le
capital letters; each of the majority voting restorers (indicated by circles)
within the same rank are designated with the same lo_:er case letters.
The corresponding replications of the same signal processors are
hereafter referred to as being on the same file of the system. Each element
in the file normally performs the same _nction, aud is designated _th the
ssme m_nber. Each signal processor file corresponds to individual functions
at the non-redtmdant system. If a signal processor file has a restoring file
associated _th it, the restoring file m2y be assigned the same nlmmber.
1-62
It will be assumed that the order of redundancy is uniform
throughout the portion of the system which is being tested and that the
only interconnections between ranks occur at the inputs to restorers.
Singular rank testing will assume that there is no restrictions on system
size, configuration, or uniformity of direction of signal flow. These
characteristics are chosen to be compatible with current redundancy synthesis
techniques.
Suppose that the control lines shown in figure 17 provide a
means of causing each output of the rank signal processors to assume
either the "i" state, the "O" state or "N" (ncrmal operation). In effect,
the output of the A and B rank blocks have been forced to assume definite
DC failure states. The mechanization to accomplish this is described in
part D of this section, and will be shown to entail only slight modification
to the normal circuitry. Consider the effect of causing all the A and B
rank signal processors to assume a static complimentary state, allowing
the C rank signal processors to operate normally, and that the system
is allowed to operate with its normal inputs. Under the conditions that
all A and B blocks are im a complimentary state the input to each voter con-
sists of "I", "O" and the output of the preceding C rank signal processor
output. This means that the dynamic signal predominates and causes this
signal to appear at the output of the voters. If all voters operate cor-
rectly, the system is equivalent to a non-redundant system, and may be
completely exercised in the same manner as the non-redundant system
to verify that all signal processing blocks in rank C are functioning
correctly. This test should also yield identical results if the
1-63
complimentary states of the A and B rank blocks are reversed. If an
incorrect final output results for both tests it indicates that at least
one failure is present in the C signal processors, the c voters or com-
binations of both. If only one test is successful, then a failure is
evidently present in one or mcre of the c voters.
Success of either of the above tests is sufficient to verify that
all C rank signal processors are failure free. It should be noted that the
presence of a correct output for both complimentary test conditions does
not verify with certainty that the c voters are failure free. Thi s is be-
cause each voter was subjected to less than the maximum possible number of
input signal combinations. Consider the various combinations of input signals
and the correct response of a three input majority voter in the table be-
low. States 1 and 2 represent the case when A="I", B="O", and C="N"; states
3 and 4 represent the case when the static signals on A and B are reversed.
All signals are the same for states 5 and 6.
C disagrees with the other two inputs.
State No. A B
w
i) 1 o
2) z o
3) 0 I
b) o 1
5) o o
6) 1 ]-
7) 1 1
8) o o
States ?. and 8 occur when
1 1
O O
1 1
O O
O O
1 1
O 1
1 O
1-6)4
Only the first four of the eight combinations were verified by the test
conditions described. States 5 and 6 are trivial however, since they
contain the combinational states of 2, _ and l, 3 respectively. If a
majority voter makes a "l" output decision for inputs consisting of two
"l"'s and a "O"git will make the same decision for an input of three "l"'s.
Similarly, if a majority voter makes a "0" output decision for inputs con-
sisting of two "O"'s and a "l"_it will make the same decision for an input
of three "O"'s. From this _t appears reasonable to assume that if the ma-
jority voter operates correctly for the first four states it will operate
correctly for states 5 and 6. Thus the combinations which have not been
tested and hence explicitly verified are states 7 and 8.
The tests conducted thus far have verified that all C rank blocks
operate correctly and that the voters operate correctly for six of the eight
possible input signal conditions. The A and B ranks may be similarly tested
with the res_Alt that the correct operation of all signal processing blocks
may be verified. This test philosophy is seen to be an approach for isolat-
ing each rank of a multiple line confi_Aration and thus determining the
presence of any fail,Ares which would jeopardize the ability of the system
to mask out fut_Are fail_Ares. Each rank is not operated sim,Altaneo,Asly and
independently, but rather one rank at a time is effectively removed from
the _Altiple line confi_Aration and separately diagnosed for the presence
of failures.
The success of all of these tests has verified the proper operation
of all signal processors. These tests have not completely verified the
1-65
condition of the voters as was described by the example of the C rank tests.
However, the following voter input-output operation has been verified with
certainty: All voters will make correct decisions if the input from the
rank in which the voter is located agrees with at least one of the other
inputs.
The condition which has not been verified is the uncertainty that
a voter will make a correct decision when the input from the rank in which
the voter is located is in disagreement with the majority of the remaining
inputs (both remaining inputs for order three redundancy). It should be
noted, however, that the complete set of singular rank tests will result in
the application of all possible combinations of inputs to the voters. These
tests are therefore sufficient to verify that any undetectable voter failures
cannot combine with further single failures to cause an order three system
to fail.
There are, however, a very limited number of component failures which
can occur in the majority voter which cannot be detected with singular rank
testing. These involve the failure of two of the input diodes for the three
input D-TL voter. If the voter has a conventional minimum design, singular
rank testing will indicate if either of these diodes is shorted. Due to
the additional input isolation, the occurrence of these input diode shorts
cannot be detected in the isolated input voter which has been shown in figure
1%. If either of these undetectable diode shorts has occurred in the isolated
input voter, the result is that the voter output is a "l" whenever the input
from the rank in which the voter is located is a "l". The majority function
is performed for all other inputs. The occurrence of either one of these
1-66
"diodes being open cannot be detected for either the minimal design or the
isolated input voters. The result of this condition is that the output
of the isolated input voter is "0" whenever the input from the rank in
which the voter is located is a "0"; if the input to a minimal design voter
is a "l", the voter output is a "l". If one of the diodes shorts and the
other opens, then the voter output is controlled by the input from the rank
in which the voter is located, although the diode short could be detected if
the minimal design voter is used. Therefore the existence of undetectable
failures cannot introduce additional errors, but may cause signal processor
errors to propagate through the restorers.
_e above analysis has shownthat the occurrence of undetectable
failures tends to cause the output of the voter to be dominated by the
signal from the rank in which it is located. In the worst possible case
(complete dominance caused by the one diode open and the other diode short
in every voter in ever_ restoring file when these failures are undetectable),
the restorers have been effectively replaced by conductive paths from the
output signal processor in the previous file to the input of each follow-
ing signal processors in the samerank. The result is equivalent to elim-
inating the restoring file completely (except that the reliability of the
signal processors is reduced by the additional voter circuitry). Although
it is extremely improbable that such conditions would predominate in a
system recently constructed from completely tested parts, the system becomes
more vulnerable to further failures if they are allowed to acc_mmlate.
1-67
2. Detection and Location of Voter Failures
It maybe desirable to have somemeansfor detecting the
presence of any failures within the system. Onesuch example in which some
method of complete testing is desirable is a maintained system which is
expected to operate reliably for extended periods of time. If such a method
is convenient, signal comparison maybe combinedwith singular rank testing
to detect and locate all voter failures. Since the combined singular rank
tests result in the application of all possible inputs to the voter, the
outputs of all voters in a restoring file maybe comparedfor agreement while
the inputs are applied. All voters are failure free if no outout disagree-
ments occur while all combinations of inout signals are applied.
Since the only purpose of reversing the complementary states of the
two ranks not being tested in an order three system was to gain additional
information concerning the voters, voter comparison testing eliminates the
need for interchanging the complementary states associated with each rank
test. This requires, however, that a systematic method be used to assure
that the complete set of tests results in the application of all possible
combination of inputs to the voters, except the trivial cases whenall
inputs are the same. This condition will be met if the following rule is
followed during singular rank testing: As each of the ranks is completely
exercised as an individual non-redundant system, the particular pair of
complementary DCstates of the remaining two signal processors is chosen so
that the state of either rank does not duplicate the DC state during any
previous testing of the other ranks. Since the choice of which pair of
1-68
complementary PCstates for the testinE of the first rank is arbitrary,
either of two alternate sequences maybe used for the complementary DC
states; these states will be complenents of those in the alternate sequence.
Thus it ma_-be shownthat only three tests (one for each rank) are required
for complete singular rank testing with signal comparison. If each test is
successful in demonstratinF_that the system will perform the entire set of
functions for which it was designed, all signal processors are verified to
be failure free and the voters are capable of transmitting a correct dynamic
signal for someof the possible input states. If, in addition, all voters
make the samedecision while the proper sequence of controls is applied
during the above tests, the voters are verified to be failure free.
3. Detection and Location of Control and Comparator Failures
The basic conceots cf singular rank testing may be extended
to verifying that the controls used for singular rank testing are operating
correctly. Rather than allowing each rank to operate individually, each
rank is individually controlled by the singular rank testing controls. If
the controls are working properly, a signal comparison on the output of
each signal processing file should indicate a disagreement whenever the
dynamic signal on the remaining ranks is in disagreement kith the DC state
of the rank being controlled. In the case where difference detectors are
used on the output of all signal processor files, t_s testing will also test
these difference detectors. The detectors should indicate a difference at
each signal processor file whenever the signal on the controlled rank dis-
agrees with the dynamic signals. If the signal comparison of the signal
1-69
processors is accomplished while complementary DC states are applied to
each pair of ranks, as described above, all possible input combinations
involving disagreements are applied, and the difference detectors should
give a continuous indication. If signal disagreements are noted for each
signal processing file while all of the ranks are being controlled (either
individually, in pairs, or for all possible input combinations involving
disagreements, but not when the entire system is allowed to operate without
signal processor failures) then the associated singular rank control
circuitry is verified to be failure free.
h. Summary
It may be concluded that singular rank testing techniques are
a very powerful tool for verifying that a redundant system does not contain
internal failures. This testing would be valuable for use in acceptance
tests which verify that all the reliability desiEned into a redundant system
is available, or as the failure testing for continuously monitored and
repaired systems with periodic complete verification, or in a system which
is only periodically diagnosed to determine if any repairs are needed. The
basic singular rank testing is a simple and effective method to allow a
redundant system to be tested as if it were a non-redundant system to verify
that all signal processors are operating correctly, and that the restorers
will introduce no additional errors. This is equivalent to verifying that
an order three system is not vulnerable to single failures. Basic singular
rank testing techniques may combine with signal comparison to detect and
locate failures which may exist in the signal processors, the restorers, th_
1-70
control equipment, and any signal processor difference detectors.
Failure detection and location are often directly associated
problems; failure location techniques are also effective failure detection
techniques when they are available. It is expected that basic singular
rank testing will be used as an effective and efficient technique for verify-
ing that a redundant system is nearly failure free for regularly sched_f[ed
maintenance, or for relatively simple acceptance tests. The more complete
detection and location techniques are expected to be used for the more
thorough maintenance checks where any failures would be repaired, or for
complete final tests after assembly. Signal comparison on all signal
processor outputs may be used to contlnuously monitor and locate most failure s
in a continuously maintained system. These tests can be designed as part
almost ar_-majority voted, multiple line system with a uniform order of
redundancy thrcughout the portion being tested. No special signal sim-
ulation equipment is required, excePt the normally required inputs. The
equipment required fcr the tests is described in more detail in part D of
this section.
C. Interwoven Rank Testing
1. Complete Failure Detection
In some systems it may be desirable to completely diagnose a
redundant system without the use of the signal ccmparison and failure
location technique described above. In some cases, it is possible to per-
form this diagnosis without the requirement for any of the test points
necessary for signal ccmparison. One such technique, which will be described
1-71
in the following paragraphs, is referred to as interwoven rank testing.
It represents an extension of the singular rank testing, since the signal
paths are interwoven between the ranks to form an equivalent non-redundant
system in which the signal is switched from one rank to another at the
restoring files. This is possible only if the system configuration has a
sufficient degree of regularity. The example will assume that the system has
restorers on the output of every signal processing file, and that these files
may be assigned odd and even n_mmbers in such a manner that odd files receive
inputs only from even files, and likewise that even files receive inputs
only from odd files. These restrictions are in addition to the assumptions
on which singular rank testing is based. It will also be shown that the
controls used for failure detectionmay be used to locate voter failures
without requiring test points or difference detectors on the output of the
voters. Comparison of signal processor outputs is sufficient to continually
monitor signal processors and locate all voter failures.
Shown in figures 18 and 19 are six replications of the previously
discussed configuration, with the exception that the two control lines for
each rank Individually determine the state of the odd and even n_mnbered
signal processors. If the two control lines for each rank were connected,
the system would be identical to the one used in describing singular
rank testing. Consider that the control lines and associated signal proces-
sors are placed in the following states: AO="0 '', AE='I '', B0='N', BE="0",
C0="I _', CE="N ', as shown in figure 18a. If an input signal is applied to
the first file of signal processors, the signal flow will take the path
shown by the arrows. This is because the two remaining signal processors
in each file have been placed in complimentary static states. If all signal
1-72
AO= O,I
AE=I,O
I 1-_ BO=N'N
-- _ _ _ BE-Q.I
_i_BJ_o,I _(N'I )B k (_).-_ NB O_ '1
I
Figure 18a
AO = 1,0
AE=O,I
°.',
Figure 18b
__o.,_r_o.,______
Figure 18c
CO=O,I
CE=O,I
Figure 18 Interwoven Rank Testing
1-73
,_-_o.,_4_o.,_----_o.,_
Figure 19a
CO=O,l
CE'_O,I
Figure 19b
Figure 19c
1-74
Figure 19 Interwoven Rank Testing
_rocessors and voters in the path operate correctly the final output of the
Nth processor (NC) will be the correct output signal. Reversing the states
of control lines A0, AE, BE, CO should also provide the same result since
this causes the pairs of signal processors in each file to assume the
opposite complementary condition. The system may be completely exercised
as a ncn-redumdant system for either of the above DC states.
Consider now the various combinations of input signals which the
lc voter was subjected to as a result of the above tests. An examination
of figure 18a reveals that these combinations are as follows:
State No. A B _C Output
3) 0 1 1 1
8) o o 1 0
7) l 1 o l
2) 1 o o o
Note that the tests have verified that the Voter operated correctly for the
two signal states which could not be confirmed by the basic s_ngular rank
tests. This was the uncertain condition that a voter will make a correct
decision when the signal processor proceedin_ it in the same rank is in
disagreement with the other two signal processors. Thus far our tests have
verified the above uncertain condition for all odd numbered c rank voters,
as well as all even numbered b rank voters. A total of four different input
states have been verified for each of these voters. The remaining voters
in these ranks may be similarly verified by the test conditions shown in
1-75
figure 18b. The a rank voters are verified by the arrangement shownin
figure 18c and figure 19a. This is seen to be a mirror image extension of
B-C rank tests.
At this point in the tests, the correct operation of all signal
processors has been verified. An examination of the various input signal
combinations which the voters were subject to is tabulated as follows:
Rank a voters Rank b voters Rank c voters
A B C A B C A B C
0 1 1 0 1 1 0 1 1
0 0 1 0 0 1 0 0 1
1 1 0 1 I 0 1 1 0
1 0 0 1 0 0 1 0 0
l 0 1
0 1 0
Note that the b rank voters have been verified for six of the eight possible
signal combinations while the a and c ranks were examined for only four.
Since the signal condition of all "l"s or _I "O"s was previously shown to
be trivial, it is evident that the b rank voters have been completely tested
for proper operation under all combinations of input signals. The reason
that only the b rank voters have been completely verified and not the a or
c rank voters is due to the fact that the b rank voters provided a co_on
signal path in the tests involving the c rank voters and the rank voters.
The a and c rank voters may be completely verified by the tests shown in
1-76
"figures 19b and 19c. This is seen to cause the dynamic signal path to be
interwoven between the a and c ranks.
Interwoven rank testing may therefore be used as an all inclusive
procedure for detecting any failures of voters or signal processors without
requiring access %0 an_, test points within the system. The system is reduced
to sets of equivalent non-redundant systems by appropriate controls. It is
then completely excercised and tested to determine if all functions are
performed correctly. The success of all tests verifies that all signal
processors and voters are failure free. If any ef the tests result in an
incorrect output_ then some failure is present in the system. The detectior,
of a failure gives very little information concerning its location within
the system.
Although interwoven rank testing does not require access to
test points within the system, it is a more elaborate approach w.hich requires
a degree of regularity in the system configuration as well as the e_tablish-
ment of twelve separate test conditions for an order three system, _nstead
of the three required for singular rank testing and voter signal comparison.
The system should be completely exercised for each of these tests tc verify
that the system is failure free if all tests are successful.
2. Failure Detection and Location for Maintenance
The alternate file controls described above may be used to
detect and locate failures during normsl system operation. Signal com-
parators are required only on the output of every signal processing file.
1-77
If a difference detector is integrally connected with each oro-
cessor file, then the correct operation of the signal processors may be
continuously monitored for maintenance purposes. If only test Doints are
available, they may be periodically tested for signal disagreement. Any
disagreement on the output of a signal processor will indicate that there
is a failure in that signal processor or the voter which proceeds it. This
failure may be repaired during system operation if the other replicated
signal processor and voters _in that file continue to operate correctly. If
a module consists of one signal processor and the voter which provides its
input, then repair is accomplished by replacing that module. This procedure
is useful for detecting and locating failures which cause errors, but is
not sufficient for determining the location of some failures within the
voters. If all signal processors are failure free, the voter portion of
the modules ms_ be completely tested by imposing various combinations of
signals at the voter inputs and examing the associated signal processor out-
puts for signal disagreement. To locate all possible voter failures, it
is necessary to provide a means of examining signal processor outputs while
subjecting the associated voters to the various combinations of input signals.
This may be accomplished by controlling separately the odd and even files of
the system or sub-system under test, as described in the previous paragraphs
and illustrated in figure 18. For example, suppose that the odd files are
allowed to operate normally and that each one of the three signal processors
in the even files are in turn placed in each of the static DC states. The
outputs of the odd files are monitored for signal disagreement during each
1-78
of the successive tests. Any disagreement on the output of an odd file
signal processor will indicate that there is a failure in the voter which
provides the input to that processor. Similarly, the outputs of the even
files are monitored for each of the successive tests. Signal disagreement
should be indicated whenever the control signal disagrees with the correct
signal on the other processors in that file. If this indication does not
occur, then either the control to that file is not effective, or there is a
failure in the difference detector. The above testing is then repeated with
the role of the odd and even files interchanged, each successive test
examining the signal processors for disagreement. With proper design, any
failures in the voters, the difference detectors, or the control hardware
may be repaired while the system is in operation. Removal or disablement
of one replicated voter or processor will not seriously jeopardize system
reliability if the remaining replications of voters and processors continue
to o_rate correctly.
D. Circuit Implementations
I. Control Circuitry
Consider now the mechanization for control!in6 the outnut
of several signal processors with a single contrcl line. A typical signal
processor output is shown in figure 20. The circuitry shown is seen to be
in the usual form of D-TL NAND gates. The base return resistor _ may be
connected to the emitter ground return if the associated transistor is
representative of the low leakage silicon devices found in integrated cir-
cuitry. Since this resistor is normally connected to ground by a discrete
1-79
connective path, it is a relatively simple matter to provide _ with a
separate external connection.
m
m
LOGIC
I Ld
2 IA
N L,
iTM
RE <
RA
OA DB
• hJ hJ
• Irl I_1
I
SIGNAL PROCESSOR
"1
RA J
J
_. I OUTPUT
I
I
• o I
I
j.
I
Rs J
I
1 CONTROL
Figure 20 Signal Processor Output Control
1-8o
IfSuppose further that _ is chosen to be equal to or less than RA.
is connected to ground potential the circuitry will operate normally. If
is connected to the + E supply QO will conduct and saturate regardless of
the signals present on the inputs l, 2, - - - N. This is seen to be the
condition where the control line potential forces the signal processor out-
put to assume the "0" state. If the control line is connected to an equal
potential of opposite polarity (-_),transistor QO will be cut off thus
causing it to assume the "l" state regardless of the signals present on
irputs l, 2, - - - N. The method described to implement the required control
function is one of several possible approaches. It is an approach which
represents a simple modification to existing circuitrv and requires only
a single control line which is grounded in normal operation.
Another alternative requires control of both the base return line
and the emitter ground line, but does not restrict the value of the base
return resistor, _, and does not require a negative voltage supply. The
same method described above is used to cause the "0" output, i.e., to con-
nect the control line to a voltage which _s sufficiently positive to cause
the output to saturate. For most circuits, + E will be of sufficient mag-
nitude for this purpose. To effect a "l" outout, the emitter ground line
may be removed, so that the output cannot be a low imoedance to ground,
regardless of input signals. This approach may be particularly useful when
it would be undesirable to reduce _ less than RA, or in circuits where the
base input diode, DB, is replaced by an emitter follower to increase base
current drive. This approach places little restriction on circuit
1-81
configuration or values and the test power supplies, but requires two
separate control lines, both of which are grounded in _ormal operation.
2. Difference Detector Circuit
Shown in figure 21 is a typical discrete component difference
detector which may be utilized in the foregoing tests. The output level
is a logical "0" only if all inputs are identical. Any disagreement of
input signals will cause the first transistor to conduct and thus cause
the second transistor to assume the "I" state (cut off). The circuit is
seen to perform the functional operation of "exclusive OR" for two inputs.
INPUTS
,,J
v I
.LI
I T'
I I
I i
b v!
l-
r"
I
I
+ v
OUTPUT
Figure 21 Difference Detector
1-82
The output of the difference detector m_y be used to trigger a flip-
flop in order that any momenta_ T disagreement of input signals may be _is_
played. This would be useful in detecting any sporadic errors wPich ndght
otherwise remain unnoticed. As previously mentioned, the difference
detectors might be combined with suitable indicators and packaged as an
integral part of the system circuitry. This would eliminate any loading
effects due to the use of test leads and external test equipment in monitor-
ing test points. In addition this would provide maintenance personnel _th
a simultaneous display of the condition of the system and the location of
faulty modules.
/J
L
1-83
V. S,mmmaryand Conclusions
I. General
It has been shownthat the special features of a redun-
dant configuration impose ,mique requirements on the design of functional
circuitry and the facilities req_lired for test. Redundancyis a powerful
tool for achieving extended reliability, but it should not be enc,anbered
with circuitry which is inherently _mreliable or contain particular failure
modeswhich prevent the associated system configuration from operating
independently. An appreciation of this philosophy allows the achievement
of reliability goals with a minim,_nof additional complexity. Effective
circ_zit design is req_xlred to obtain the desired balance between complexity
and reliability in red_mdant systems.
2. I_gnetic Logic
Although magnetic logic is often cited as having several
feat,Ares partic,Alarly applicable to spaceborne computers, the disadvan-
tages of magnetic logic strictly limit their usef,Jlne_s in general logic
systems, and particularly for red_mdant spaceborne systems. Somebasic
disadvantages are listed below:
I) Lack of compatible steady output signals
2) Excessive power cons,_nption for speeds
comparable to low-power microcircuitr_y.
3) Extensive peripheral eq,Aipment, incl,_ding
high c,Arrent drivers.
_) Limited fan-out and gain characteristics
5. High peak power requirements.
6. Indeterminate reliabilit_ _ performance due to
extensive hand wiring with fine wire and _umerous
connections, as well as unavailability of accurate
reliability data.
7. Complexity required for general logic functions.
8. Lack of suitable restoring element for use in
redundant systems.
Magnetic logic does, however, offer non-volatile storage and
reduced average power for low computing speeds. Magnetic devices appear
to be suited to special applications where certain logic functions, such
as transfer and OR, are intermixed with the memory function, and very low
speed capability is acceptable.
3. Integrated Semiconductor Logic
Integrated semiconductor circuitry offers many character-
istics which are desirable for circuits to be used in redundant space-
borne systems. Some general features of integrated semiconductor logic
when compared to other commonly available logic systems are:
i. Significantly reduced size, weight, and power consumption.
2. Availability of general logic elements, as well as
special purpose circuits.
3. Predictable operating characteristics over wide
environmental variations.
4. Availability of accurate reliability data.
1-85
5. Extensive research and development for new integrated
circuits.
6. High frequency capability.
7. Compatibility with synthesis and testing techniques
for redundant systems.
A comparison of the currently available integrated logic elements
indicates that diode-transistor logic (D-TL) is the most suitable for use
in redundant spaceborne systems. D-TL offers excellent operating charac-
teristics, such as easily distinguished "i" and "0" states resulting in
high DC stability and compatible output signals, high noise immunity,
self contained drive current, allowable parameter tolerances, input iso-
lation, and other characteristics which permit efficient redundant design.
D-TL frequency capability exceeds the requirements of most spaceborne
systems, and requires relatively low power, so that total power dissipation
and temperature stress are _inimlzed.
A majority voting restorer, designed using interconnected NAND
elements, has been described which is not subject to the detrimental
failures of conventional majority voters.
4. Failure Testing
It is a characteristic of redundant systems that they offer a
1-86
high reliability for a period of time after the initially failure free
condition, and that the system reliability decreases rapidly wheninternal
failures are present. It is therefore important to insure that no initial
failures exist in a redundant system to obtain maximumsystem reliability.
This reliability may be required for a single time interval without further
maintenance, such as for spaceborne systems, or it maybe required for
repeated time intervals, where the system is restored to the initially
perfect condition prior to each interval. The latter methodmaybe used
to obtain high mission reliability by maintaining a redundant system
which is used repetitively, such as the ground support and launch ecuip-
ment used prior to and during each mission. Since an initially failure
free order three system can withstand any single failure, as well as a
relatively large number of randomly scattered failures, it offers high
reliability for the _eriod of time when the probability of individual
failures is low. Techniques are described which permit even higher reliabili-
ty by combining periodic maintenance with continuous maintenance of a redun-
dant system.
It has been shown that a relatively simple test referred to as
singular rank testing may be used to determine that all of the replicated
signal processors are working properly. If the signal processor fails
whenever any of its parts fail, success of the singular rank tests will
verify that all signal processors are failure free. Success of singular
rank testing will also verify that the majority voters are sufficiently
failure free to insure that the system is not vulnerable to single failures.
Singular rank testing effectively isolates each rank of the replicated non-
1-87
redundant system by forcing each remaining pair of replicated ranks to
have static complementarybina_/ outputs. System output is monitored to
determine if each individual rank is able to perform all system f,_.ctions
correctly, in a manner similar to the verification of a non-redundant sys-
tem. Sing,ilar rank testing is expected to be the most efficient and effective
method for diagnosing eq,lipmentwhich has been recently assembled from com-
pletely tested modules, since the probability that the few ,_ndetectable
failures might have occ,Arred since complete testing is very low.
A somewhatmore complicated testing procedure_ referred to as inter-
woven rank testing, has been described which will completely test all voters
to insure that they will makecorrect decisions for all possible input
combinations. It has been shownthat the failure detection proced,ares may
be accomplished by controlling one or more normally gro,mndedcon_nonlines
for each of the replicated ranks of the system, without altering the logic
design or incl,Jding any additional hardware except to provide access to
these lines. Sing,alar rank testing places no restrictions on system size
or configuration.
The characteristics of red,Andantsystems have been shownto intro-
duce _uniqueproperties to the problem of fail,Are location and fa,Alty module
replacement. Although a red,Andant system is more complex that its conven-
tional co,Anterpart, fail,Are location within an operating system does not
req,Aire the operator skill and sim,Alation eq,Aipmentusually required to
locate fail,Ares in a non-redundant system. Since an operating red,_dant
system always has at least one correct signal available at every point in
the system, these correct signals maybe used as a basis of comparis6n to
1-88
4BIBLI OGRAPHY
i. Haynes, J. L., "Logic Circuits Using Square-Loop Magnetic Devices:
A Survey", IRE Trans. on Elec. Computers, Vol. EC-IO, No. 2 (June 1961)
i
2. H. D. Crane, "A High Speed Logic System Using Magnetic Elements and
Connecting Wire Only," Proc. IRE, Vol. 47, pp. 63-73; (Jan. 1959).
3. D. R. Bennion and H. D. Crane, "Design and Analysis of MAD Transfer
Circuitry," Proc. 1959 Western Joint Computer Conf., San Francisco,
Calif., pp. 21-36, (March 1959).
4. J. A. RaJcbman, "The Transfluxor," Proc. IRE, Vol. 44, pp. 321-332;
(March 1956).
5. H. D. Crane, "Design of an All-Magnetic Computing System," IRE Trans.
on Elec. Computers, Vol. EC-IO, No. 2 (June 1961).
6. "Aviation Week and Space Technology," Aug. 19, 1963 pp. 93-103
7. A. R. Helland and W. C. Mann," Failure Effects in Redundant Systems"
Westinghouse Report EE-3351. (March, 1963)
8. Report No. NADC-EL-6319, Micro-Notes No. 3, "Information on Micro
Electronics for Navy Avionics Equipment" (June, 1963)
1-90
•other versions of the nominally identical signal. A difference detector
on the signal processor outputs to restorers may be used to indicate
fail_Ares among these signal processors. If the detector includes memory,
it will also detect and locate transient or sporadic failIAres. These same
difference detectors may be used for the somewhat more difficIAlt task of
locating those fail_Ares in the voters which do not cause errors when all
voter inputs are identical, as well as verification that the test controls
are actually capable of proper operation. The method which has been
described uses the same types of control as sin_Alar and interwoven rank
testing, and does not jeopardize system operation if all signal processors
are operating correctly.
1-89
J W
Appendix 2
RELIABILITY OF IMPERFECT REDUNDANT SYSTEMS
by
R.S. Bray
P.A. Jensen
C.G. Masters
September 1963
i l
TABLE OF CONTENTS
I°
If.
In.
Vl
INTRODUCTION .......................... 2-i
MISSION RELIABILITY ....................... 2-2
PROCEDURES FOR ESTIMATING THE SYSTEM RELIABILITY ..... 2-6
A. Estimation of the Expected Value of Mission Reliability with only
the Information that the System is Operating at t I ........ 2-6
B. Estimation of the Expected Value of Mission Reliability with
Tests at t I Helping to Establish the Circuit Failure Rates .... 2-8
C. Improvement of the Estimate Through Failure State Tests ..... 2-9
D. Determining the Mission Reliability of Large Systems ....... 2-12
E. Using Tests to Determine Both the Failure States of the System
and Failure Rates of the Circuits at t I ............ 2-16
TEST OF THE HYPOTHESIS THAT MISSION RELIABILITY IS GREATER
THAN A REQUIRED VALUE .................... 2-17
CONCLUSIONS AND RECOMMENDATIONS .............. 2-19
2-ii
BI. INTRODUCTION
The problem of the pre-launch testing of spaceborne electronic systems is becoming
more severe as the systems increase in complexity while decreasing in physical size. The
testing problem will soon become much worse as systems are made redundant and in-flight
tests are used to determine the successive actions of deep space probes. Tests can no
longer be made adequately on the basis of a strict "working" or "failed" criterion because a
redundant system may contain many internal failures and still be operating at- the time of
test. Such a system might easily have a much lower probability of successfully completing
a mission than a functionally identical non-redundant system.
In addition, the large number of subsystems in a complex redundant network will make
complete check-out (i. e. tests of each subsystem) virtually impossible. Consequently, a
new method must be devised which will permit a statistical estimate to be made of the proba-
bility of mission success (reliability). This estimate must be based on the results of a
limited amount of testing and should be as accurate as possible.
2-i
II. MISSION RELIABILITY
The problem may be stated more specifically as follows. A test of a redundant machine
will be made at some time t 1. (It is expected that some failures will be found in the equipment,
and the object of the test is merely to determine the number and pattern of the failures in the
system. ) From the test data, the probability that the redundant system under test will oper-
ate successfully throughout a mission which begins at time, tl, and ends at time, t2, given
that the system is operating at tl, is estimated. This probability is defined as the mission
reliability (R) and is a function of the system organization, the state of the system at tl, the
failure rates of the parts of the system, the starting time (tl) of the mission, and the mission's
duration, t 2 - t 1. At some time to, which is less than t 1 or t2, all circuits in the system are
assumed perfect. As time progresses they are assumed to fail in a random manner with a
constant failure rate. At t 1 when the system is ready to begin the mission, the system must
be in one of a finite number of possible failure states. The failure states are determined by
the number and location of failed circuits in the system. For example, consider the multiple-
line redundant network of figure Q-1. A restoring circuit indicated by a circle will make a
correct decision if at least two of its inputs are correct.
Figure Q- 1.
STAGE A STAGE B
A Two Stage Example of a Redundant System
Assume for simplicityofexplanation, that the restoring circuits of this system are
perfectly reliable and that only signal processing circuits, indicated by rectangles, can fail.
The possible failure states of this system are listed in columns 2 and 3 of Table I.
2-2
TABLE 1
1 2 3 4 5
Number of Number of
Failure Failures in Failures in Ri * --(t2)** P'l --(tl) ***
State Stage A Stage B
1 0 0 [pm3+3Pm2(1-Pm)l2 [p3][p31
2 0 1 Ipm3+3Pm3(1-Pm)]pm2 Ip3][3p2(1-p_
4 0 _ 0 [_][_1_]
5 1 0 [pm3+3Pm2(1-Pm)]pm I3p2(1-p)][p3]
6 1 1 pm4 I3p2(1-p_[3p2(1- )_
* Ri(t2) is the probability of correct system operation at time (t2) given the i th failure
state exists at t 1.
** All the p_s in this column are probabilities that a circuit is successful at t2, given
it was successful at t 1.
*** All the p's in this column are probabilities that a circuit is successful at tl, given
it was successful at t O.
2-3
TABLE1(Cont)
1 2 3 4 5
Numberof Numberof
Failure Failuresin Failuresin Ri* ** Pi '"_tl_***State Stage A Stage B (t2)
* Ri(t2) is the probability of correct system operation at time (t2) given the ith failure
state exists at t 1.
** All the p_s in this column are the probability that a circuit is successful at t2, given
it was successful at t 1.
*** All the p's in this column are the probability that a circuit is successful at tl, given
it was successful at t O.
For each of the failure states of Table 1, the reliability of the system can be calculated
at t2. This is done as follows: If the failure rate, X , of a circuit is constant and known, the
probability that a circuit is successful at t2, given it is successful at t 1 is the expontential.
-X(t 2-tl) (I)
P =e
m
For the system to be successful at the end of the mission, two or three circuits in each
stage must be successful. The probability that the system meets this requirement depends
on the failure state of the system at tl, and the value of Pro" For instance for failure states
3, 4, 7, 8 and 9-16, the probability of correct system operation must be zero because there
are too many failures at t 1. Because R i is defined as this proba.bility, given the system is
in the ith state at tl:
R. = 0 for i = 3, 4, 7, 8, 9-16
1
For failure state 1, the reliability is the probability that two or three circuits are
successful at t 2. Thus:
R 1 = • m 3 + 3 pm 2 (1- Pm
The reliability of the system for other failure states is shown in column 4 of Table 1.
2-4
Column 5 of Table I lists the probabilities that the particular failure states will be
present at t 1. The factor p in this column is the probability of success of a circuit at t I
given the circuit was successful at t O. These probabilities will find use in later discussions.
Two things must be known if the mission reliability of the system is to be determined
with 100_ confidence, the failure state of the system and the failure rates of the circuits
(needed to calculate pm). For large systems both these factors may be very difficult or
impossible to determine exactly. To find the failure state of a system, the failure state of
each stage must be known. This may require a considerable amount of testing, probably a
test of all circuits in the system. The failure rates of the circuits can only be determined
exactly with a test of an infinite number of circuits all operating under the same environments
as the circuits in the system. Of course, with limited testing allowed at t I it is improbable
that the exact failure state of the system can be found. Estimates and their accuracy are the
subject of the remainder of this report.
2-5
III. PROCEDURES FOR ESTIMATING THE SYSTEM RELIABILITY
In the study of this problem, several ways have been proposed to estimate a system's
mission reliability with varying degrees of accuracy and varying levels of confidence. Four
of these are described below.
A. ESTIMATION OF THE EXPECTED VALUE OF MISSION RELIABILITY WITH ONLY
THE INFORMATION THAT THE SYSTEM IS OPERATING AT t 1.
Using the design failure rates* one can estimate the mission reliability with only the
information that the system is operating successfully at t 1. This is done using the equations
representing the reliability of the system at time t given only that all circuits are operating
successfully at time 0. The system reliability R (t) can be written as the probability of
successful operation from time 0 to time t. The reliability of the system of figure 1 is:
R(t) = {p(t)3 + 3 Ip (t_ 2 [1 - p(t_ _ 2 (2)
-kt
where p(t) = e
A plot of R(t) for the redundant system of figure Q-1 is shown in figure Q-2a.
* The design failure rates are those assigned to the circuits during the design of the system.
They are generally derived from controlled life testing of components similar to those
used in the circuits or from field tests of similar components.
I-
.J
.J
W
er
I.O0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
I 2 3 4 5 6 7 8 9 I0
TIME IN HUNDREDS OF HOURS
),-
I.-
-I
m
.J
hi
nr
1.00
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
B
I 2 3 4 5 6 7 8 9 I0
TIME IN HUNDREDS OF HOURS
Figure Q-2. Reliability vs Time For a Redundant SyStem.
A) With No Test at t 1.
B) With a Test Determining the Success of the System at t 1
2-6
If oneteststhesystemat atimet1andfindsit to beworkingsuccessfully,this infor-
mationcanbeused to adjust the system reliability for time greater than t 1 to take account of
the condition of success at t 1. A curve must now be determined which gives the reliability
of the system given successful operation at t 1. This is expressed as:
It I (td
For t<tl, the reliability must be unity, because it is assumed that once a system fails
it stays failed. .
Then:
For t>tl, the reliability is:
R(t) t > t I (4)R It I R(tl)] = R(tl )
This is derived from the definition of conditional probabilities.
P (AIB)
P (A and B)
P (B)
A plot of equations (3) and (4) is shown in figure Q-2b for a particular t I and the system
shown in figure Q-1.
Using equation (4) the mission reliability can be written:
R (t2)
R(t2, t1) = R(t 1) (5)
Thus, the mission reliability can be determined simply by using the reliability equations of
the system and the design failure rates of the circuits of the system.
The question now arises, of what value is this result ? First, assuming the failure
rates used in the calculation of R are perfect, if a large number of systems were constructed
and run until tl, approximately R (tl) x 100_ of them would be working. Throwing away all
systems that were failed at tI and continuing the test until t2, R (t2, tl) x 100% of the popula-
tion all systems working at tI will be working at t2.
2-7
No information was given for this estimate about the failure state of the system at tl,
except that the system was in one of the failure states for which the system is successful.
For the example, these are states 1, 2, 5 and 6. This limited information about the failure
state makes it necessary to approximate the mission reliability by an expected value given
that the system is in one of the four successful failure states. The approximation has a con-
siderable effect on the accuracy of the estimate which is described in detail in Section IHC
of this report.
B. ESTIMATION OF THE EXPECTED VALUE OF MISSION RELIABILITY WITH TESTS
ATtl, HELPING TO ESTABLISH THE CIRCUIT FAILURE RATES.
Another problem which threatens the validity of the R calculated by this method is the
uncertainty of the failure rates of the components of the system. The failure rates used in
design are derived from a variety of sources and are almost surely not exactly accurate for
any operational system. A realistic way to use design failure rates is to assign confidence
limits to their values. With these one can say with a certain confidence that the failure
rates of his parts are within a region determined by his confidence limits. This data is often
available with design failure rates. Using the two extremes of failure rates, upper and lower
confidence limits can be calculated for the mission reliability. The statement can then be
made with a certain confidence that the mission reliability is within the interval of its con-
fidence limits. It is instructive to point out that if the failure rates of all parts are perfectly
known, there is 100% confidence in the calculated value of mission reliability. If, however,
the failure rates are uncertain, as is always the case, confidence limits should be indicated
for the mission reliability which reflect the uncertainty of the failure rates.
Estimation of the mission reliability of the system using the failure rates used in design
has one serious failing. These failure rates often do not accurately describe the actual com-
ponents. The design failure rates may have been determined under different environmental
conditions than those of system in use, or components in the system may have been subjected
to different manufacturing conditions than those used to derive the design failure rates.
These and other factors might cause the circuits in the system to have different failure
rates than those predicted in original design. Tests performed at t 1 can be used to deter-
mine if the actual failure rates are indeed different from design failure rates. If they are
different the tests will be used to estimate the actual failure rate.
The first task is to test the null hypothesis that the actual average failure rates are
the same as those used in design. To do this, the system must be split into groups of
circuits with each group comprised of circuits of identical design. Using the design failure
rates, the number of failures that can be expected in each group at t 1 is calculated.
2-8
- j*t
This expected number ispjn,where pj= e , and n is the number of circuits in the group.
About this expected value one can construct an interval specifying the number of failures he
is willing to observe at t 1 and still accept the hypothesis that the actual failure rate is that
used in design.
The next step in the procedure is to test the circuits. If possible, all circuits are
tested ** and the numbers of failures recorded. If the number of failures at t 1 in n samples
is within this interval the design failure rate is used to calculate the mission reliability. If
the number of failures is not within the interval a new failure rate is calculated using the
observed data at t 1. The mean of this new failure rate is k o and is determined from the
equation
In x/n
o t 1
Confidence limits are placed on this calculated rate and the extremes of the confidence
interval are used to calculate confidence limits on the estimates of the mission reliability of
the system.
The question immediately arises, "Why test the null hypothesis at all if test data is to
be accepted in preference to the design failure rates ?" This is done because under the con-
dition that the null hypothesis is met, the correspondence of the two sources of failure rate
estimates would result in a higher confidence in the final estimate than either source alone
can provide. When the null is rejected and the test data alone is used, the confidence in the
estimate is reduced.
C. IMPROVEMENT OF THE ESTI_IATE THROUGH FAILURE STATE TESTS
In this reliability estimation procedure a more accurate estimate is obtained by testing
at t 1 to determine the failure state of the system. H the failure state were known exactly and
the failure rates of the circuits were accurate, the mission reliability of the system could be
calculated with no equivocation. Thorough testing at t 1 could determine exactly the failure
state of the system, but since thorough testing is not of interest in this study the failure state
will be known imperfectly. One will have a number of alternatives each with a certain pro-
bability given the results of the tests.
kj = design failure rate of the j t_h type circuit.
Note, if the system is too large to permit complete testing, a random sample of each
type of circuit is taken and the number of failures observed in the sample is used to
estimate the actual failure rates.
2-9
Consideragaintheexampleof figureQ-1. Eachstageof thesystemhasfour failure
states,zero, on¢,two, or threefailedcircuits. If noinformationis availableat tl, not
eventhatthesystemis operating,everystagemaybein anyoneofthesestates. Thusthere
are42possiblefailurestatesofthesystem. Theyhavebeenlistedin column1of Table1.
Associatedwiththe ith failurestateis aprobabilityPi whichis theprobabilitythatthesys-
temis in this stateat tl giventhatall circuits weresuccessfulat t0. Thus,withno
informationat t1ontheconditionof thesystem,theprobabilitythatthesystemis in the
stateinwhichnocircuitshavefailed is
6
PI=P
Thefactorp is theprobabilityof successof acircuit at t1.
failurestateinwhichonecircuit is failedin StageB is
P2 = 3p5 (l-p).
The probability of the
The probabilities of occurrence of the states given no information on the condition of the
system at t 1 are listed in column 5 of Table Q-1.
Associated with each of the failure states is a reliability of the system at t 2 given that
the system is in the failure state at t 1. This is written as R 1 (t 2) and is shown for each state
in column 4 of Table 1.
The reliability of the system is written as the sum over all i of the product of the
probability of a ith failure state and the mission reliability given that the system is in the
ith state at t 1. Thus:
all i
R (t 2) = _ Pi R.1
If tests are made at t 1 that give some information on the condition of the system, the
number of failure states possible are markedly reduced, and the reliability estimate available
at tl is much more accurate. For instance if one tests the system of figure Q-1 and finds it
functioning correctly at tl, each stage must have no more than one circuit failure. Thus,
only four states are possible after this test. These are states 1, 2, 5 and 6. The probability
that the system is in a particular state must be adjusted to account for the known condition
Thus, for the example the probability of being in state 1 withthat the system functions at t 1.
no failures is:
P1
i = 1,2,5,6 (6)
_,, Pi
2-10
Thedenominatorinequation(6)is theprobabilitythatthesystemis in oneof the four
possible states.
In general, a test to establish the failure state will leave only a set of possible failure
states. Assume the test determines the state of the system to such an extent that the only
possible failure states are included in the set I. If P'. is the probability of being in the ith1
failure state given the results of the tests, then:
P'. = 0 For i # I
I
Or if a state is not in the set I its probability is zero.
_a state is possible then:
pI -
1
P.
1
all i e I
For i • I (7)
The mission reliability for a particular failure state, Ri, does not change, hence the
mission reliability given the results of the test can be written in general as:
allieI[ P. ]1 R.RM = all i • I l (8)
For the example
1
RM = pl+P2+P5+P6 P1 R1 + P2 R2 + P5 R5 + P6 R6 ]
(9)
More extensive tests at t 1 will further reduce the number of failure states which can
exist. For instance if a test reveals that at least one circuit in the network is failed, the
failure state which has no errors is eliminated, changing considerably the expected mission
reliability. For this example P_ = 0, and states 2, 5 and 6 are the only members of the set I.
To illustrate the value of testing to determine the failure state at tl, consider the
example. The probability that a circuit operates until t 1 is p (tl) = 0.9 and the probability
it lasts until t2, given it was successful at t 1 is Pm (t2) = 0.9. The system is that shown
in figure Q-1 and the restoring circuits are assumed perfectly reliable. Say that in reality
one circuit is failed in one stage and the circuits in the other stage are all successful, but
2-11
this informationis unknownto thetester. This is the informationto begainedat t1 through
thetests. Table2 lists the _eliabilityonewouldpredictwithdifferentamountsof infor-
mationaboutheconditionof thesystemat t 1. Thewidevariationin theresult indicatesthe
importanceof testingat t1.
Thissectiondoesnotproposethedetailedproceduresfor testinga systemat t1. It
should,however,indicatetheimportanceof makingthesetestsandthecalculationsrequired
to utilizethe informationgainedfrom thetest to estimatethesystemreliability.
TABLE2
TestResultsatthe
Mission'sStart(tl) PredictedSystemMissionReliability CorrespondingRiskof Failure
1. Noinformationat tl, noteven 0.821 0.179
thatthesystemis working.
2. Testsshowthatthesystemis 0.867 0.133
workingatt 1.
3. Testsshowthatthesystemis 0.770 0.230
workingbutthatat leastone
circuit is failed.
4. Testsshowthatexactlyone 0.788 0.212
circuit in thesystemis failed
at t 1.
D. DETERMININGTHEMISSIONRELIABILITYOFLARGESYSTEMS
Theexampleof the lastsectionis a smalltwostagesystem. Onemightwell askif it
is feasibleto enumerateall ofthepossiblefailure statesof a largesystemfor thedetermina-
tionofthemissionreliability. Indeedwithno informationatt 1onwhetheror notann stage
systemis operatingcorrectly, thereare 4n possiblefailurestatesof thesystem. As n in-
creases,thenumberof possiblefailure statesincreasesexponentially.
Thepurposeof thetestsatt 1 is to eliminatelargenumbersof thesestatesin themanner
shownfor theexampleandhenceobtainabetterestimateof themissionreliability. Theuse
of equation(8)providesthisestimatebut it requires, in its presentform, separateconsidera-
tionofeachfailure state. This is impracticalfor all butthesmallestsystems.
2-12
Thisproblemis circumvented by first putting the mission reliability equation in a
more general form. The mission reliability of the system given the results of the test at t 1
is a conditional probability which can be written:
Prob. (Test results at t I and successful system operation at t 2)
RM = Prob. (Test results at t 1) (10)
Equation 8 is a representation of this equation for small systems.
The form equation (10) takes depends on the characteristics of the system under study
and the type of test to which it is subject at t 1. For example, consider an n stage order-
three-multiple-line system which has perfect voters. For simplicity assume all the stages
are identical with equally reliable circuits. For illustrative purposes assume the stages are
arranged in a chain as in figure Q-3.
iiii
Figure Q-3. Chain of n-Multiple-Line Stages
The first type of test to which the system of figure Q-3 is subjected is a simple test to
determine its operability. Is the system failed or successful at tl? Given the system is
successful at t 1 the mission reliability will now be determined.
Because the system is working at tl, each stage must be in one of two states, either
three circuits successful or two circuits successful and one failed. Then the system may be
in any one of 2n possible states. Using equation (8) to evaluate the mission reliability would
be a rather tedious and time consuming process if n were a sufficiently large value since both
the numerator and denominator of this equation have 2n terms. However, because of the in-
dependence of the stages of the multiple line system, it isn't necessary to carry out this
operation. The probability that each stage is successful at t 1 is independent of the condition
of all other stages and can be written:
[p3 + 3p2 (l-p)] (11)
2-13
Sincetheyareall identicaltheprobabilitythatall thestagesaresuccessfulat t 1 is:
Ip3 + 3p2 (l-p)] n (12)
Thisterm is theprobabilitythatthesystemis in a successfulfailure stateat t 1andis
thedenominatorfor equation(10)whenthetest consistsonlyof determiningtheoperabilityof
thesystem.
Theprobabilitythata singlestageis operatingat t2canbewritten:
2 p2 2{p3 Ipm3 + 3Pm (l_Pm)] + 3 (l_p) [pm ]} (13)
Sincethestagesare independenttheprobabilitythatsystemis operatingat t2 is:
{p3 [pm3 + 3Pm2(l_Pm)l + 3p2 (l-p)Ipm21} n (14)
This term is equivalentto thenumeratorof equation(10). Usingtheterms (12)and(14)
themissionreliability canbedeterminedfor this system. Giventhatthesystemis successful
at t1theprobabilitythatthesystemis successfulat t2 is:
{p3 [pm3+ 3Pm2(l-Pm)l + 3p2 (l-P) IpI2]} n (15)
RM : [p3 + 3p2 (1-p)] n
Note that for this determination of the mission reliability the separate failure states have not
been enumerated. The calculation of mission reliability for this system has been a relatively
simple procedure.
• Other tests at t 1 will result in different forms for the mission reliability equation (10).
For instance assume the system of figure 3 is subjected to a different test. This test sub-
divides the system into three nonredundant ranks as shown in figure Q-4.
Each rank will be tested individually. If a rank fails it can be inferred that one or more
circuits in the rank are failed. If a rank is successful it can be inferred that all circuits in
the rank are successful.
At tl the information is given that the system is operating correctly and that 0, 1, 2 or
3 of the ranks have failed. Now equations must be developed that determine the mission
reliability of the system given the results of the test at t 1.
2-14
r ==_m
RANK
I
...........-c -O I 4IF--- RA2NK
I
II
I
I
l--
..........-c=>-O
..!
Figure Q-4. System Divided Into Three Nonredundant Ranks
The numerators and denominators of the mission reliability equation for the _¢arious
test results are shown in Table 3.
TABLE 3
Test Result
(Ranks
Failed)
0
Prob.
(Test Result at tl)
 0Id n
Y1 = [p2 (l-p) +
p3]ny 0
Y2 = I 2 p2 (l-p) +
p3] ny0-2Y 1
Y3 = [ 3p2 (l-p) +
p3]ny0-3Y1-3Y 2
Prob. (Test Result at tl and
Successful System Operation at t 2)
3 3 2 )] nQ0 =[p (Pm+3Pm (1-Pm)
QI= [p2,1 ) 2+ 3 2 2(l_Pm))]n
-P Pm P (Pm +3pm
-Q0
2 n
2 2 3 3 + 3 pm (l_Pm))]Q2 = 12p (1-p)Pm+P (Pm
-Q0 -2Q1
Q3 = [3 2 , 2+ 3, 3 Pm(1-Pm)]P (l_p)p m P tPm+ 3 2 n
-Q1 -3Q1 -3Q2
Mission
Reliability
Q0
Y0
Q1
Y1
Q2
Y2
Q3
Compared to enumerating all the failed states possible with the particular results of a
test, these equations are relatively simple. If the assumption that all circuits are equally
reliable is removed, the equations for mission reliability are very similar to these except in-
stead of raising a single term to the power n as in these equations, a product of n factors
will be taken. This should be a simple matter on a computer.
2-15
If the restriction that the restoring circuits be perfectly reliable is removed, the
mission reliability equation will not be changed significantly unless the stages are intercon-
nected in such a manner that they are no longer independent. The techniques used to calculate
system reliability inthis section are invalid if the stages are not independent. Techniques
have been developed to determine the reliability of such systems* and these must be used in
determining the mission reliability.
The equation describing the mission reliability for a system will depend on both the
tests performed at t 1 and the characteristics of the system. These factors will surely be
known prior to the test, so equations can be developed to evaluate the mission reliability
which take into account the possible failure states of the system without exhaustive enumeration.
E. USING TESTS TO DETERMINE BOTH THE FAILURE STATE OF THE SYSTEM AND
FAILURE RATES OF THE CIRCUITS AT t 1
In technique C, tests were made at t 1 to determine the possible failure states of the
system. In technique B tests were made to establish the actual failure rate of the circuits of
the system. It should be possible to design tests which give information regarding both these
parameters.
The tests will establish the failure rate of the system at t 1 and use these in carrying out
the reliability calculations described for Technique C. It takes little imagination to see that
in the course of tests to determine the failure rate a great deal will be learned about the
failure state of the system. For instance as soon as one failure is found the possibility that
the system is in the no circuit failure state is decreased to zero, probably decreasing the
mission reliability appreciably.
The details of this technique have not been developed, but generally it proposes to use
the tests of t 1 to indicate both these parameters and thereby increase markedly the accuracy
of the mission reliability estimate.
* Jensen, P.A., W.C. Mann and M.R. Cosgrove, "The Synthesis of Redundant Multiple-
Line Networks", First Annual Report Contract NONR 3842 (00), May 1, 1963.
2-16
IV. TEST OF THE HYPOTHESIS THAT THE MISSION RELIABILITY IS
GREATER THAN A REQUIRED VALUE
This method is separated from the others because it does not explicitly estimate the
reliability of a system. Instead it finds, through measurements at the beginning oI me
mission, the probability that the system will not meet a given mission reliability specification.
The user of the system must specify the minimum mission reliability. He must also
specify the maximum chance he is willing to take that the system does not meet this goal when
his tests indicate that it will. It is assumed that the system is not acceptable if the probability
that it does not meet the reliability specification is above the given value, and is acceptable
otherwise.
The first step in this procedure is to determine the failure rates that the circuits of
the system must have to just meet the mission reliability goal. These failure rates are
called the maximum failure rates, _ m" For a system in which many circuits have the same
failure rate this does not seem to be too imposing a problem. For example consider a system
where all circuits have the same failure rate. If the starting time and duration of the mission
are known, the mission reliability can be expressed only as a function of the failure rate, ), .
Equation (5) can then be set equal to the required mission reliability and solved for the failure
rate. A cut and try method may be required for the solution.
The maximum failure rate is a function of both the starting time, tl, and the duration,
t 2 - tl, of the mission. However, if the duration of the mission is known, it is possible to
plot a curve of mission starting time against the maximum failure rate.
Once the maximum failure rate is known it only remains to determine if the actual
failure rate of the circuits of the system is less than or equal to this value. This will be de-
termined by testing n of the circuits at t 1 and counting the number of failed circuits. Call the
number of failed circuits X 1. With this data and by using the maximum failure rate, an upper
bound on the probability that the true failure rate is greater than the maximum failure rate can
be determined.
If the fact that a majority of the circuits in a stage must be operative at t 1 is neglected,
the success of a circuit in the system may be considered a Bernoulli trial with probability of
- kt
success, e The probability distribution of the total number of circuit failures in M
circuits is then binomial. This distribution or the associated density function can be plotted
for any number of samples. One such plot appears in figure Q-5.
The probability distribution of the number of failures at time t 1 can be plotted using the
calculated maximum failure rate.
2-17
Figure Q-5. Sample Distribution
Some maximum number of failures Y will be chosen such that there is probability of 8
that the number of failed circuits observed at tl, X1, will be less than Y if the failure rate of
the circuits is ), m" The quantity 8 is determined from the binominal:
y - 1 - X t I n-h k tl) h
8 = _ (h_ (e m ) (1-e m (16)
h=0
For failure rates greater than k the probability that less than Y failures occur must be
m
less than 8 . Soil X 1 is less than Y, with confidence 1 - 8 the statement can be made that
the actual failure rate must be less than the maximum failure rate. Now the statement can
be made that with confidence 1 - 8 that the reliability of the system is greater than the mini-
mum reliability specified by the user.
This method leads to the statement with a confidence (1 - 8 ), it can be said that the
probability that the system will suceed is R. The information used to compute R might be
used to compute the expected time to system failure instead. The object of the test would
then be to confirm or reject the hypothesis that the expected life would exceed the mission
time with a confidence (1 - 8 ). This modification has not been carefully examined but it
appears to reduce the number of probabilistic statements from two to one.
This procedure again uses no information on the failure state of the system except that
the system is successful at the beginning of the mission. The effect of this on the accuracy
of the results has already been discussed in Section IIIC.
2-18
V. CONCLUSIONS AND RECOMMENDATIONS
It is the nature of a redundant system to withstand a number of internal failures and
still perform its function successfully. This is an extremely desirable property for increas-
ing life or providing high reliability, but it makes it unreasonable to base the decision -
whether or not to carry out a mission with the system - only on the fact that the system is
operating at the beginning of the mission.
This decision should be based on the probability that the system will complete the
mission successfully. There are two major factors affecting the probability which are im-
perfectly known at the beginning of the mission. First, the number and location of initial
circuit failures has a very significant effect on the probability that the system will operate
throughout the mission. Second, the mission reliability depends heavily on the failure rates
of the circuits which make up the system. There is little accurate information concerning
either of these factors when it is time to make the decision.
The report proposes that certain tests be made just before the mission is to begin to
determine at least approximately, these unknowns. It proposes some procedures for using
the results of the tests to estimate the mission reliability with varying degrees of accuracy.
A procedure for making the decision on the useability of the system without estimating the
mission reliability is also presented.
It should be noted that the details of these procedures are still to be worked out and
the accuracy of their results are still uncertain. The work here reported will provide the
basis for future studies on the subject.
No attempt has been made to evaluate the relative usefulness of these procedures. It
is recommended that efforts be made to develop an appropriate measure for comparing the
techniques so that they may be evaluated relative to a common scale.
One very important area of study neglected by this report is the design of simple and
efficient tests to be performed at the beginning of the mission to obtain the information re-
quired for the reliability estimates. As much information as possible must be gained from
a minimum number of tests. A small amount of basic work has been done in this area, and
it will be the subject of future efforts.
2-19
e. Jt_
Appendix 3
A SURVEY OF COMPONENTS FOR ADAPTIVE RESTORING CIRCUITS
by
H. Brinker
TABLE OF CONTENTS
Introduction
I. Electrochemical Devices
a. Device 1
b. Solion
c. Merc1_ry Cell
2. Magnetic Devices
a. N_D Integrator
b. Orthogonal Core Integrator
c. Second Harmonic Integrator
d. _gnetostrictive Integrator
3. Conclusion
References
LIST OF FIGURES
Fig,lre I
Figure 2
Figure 3
Figure
Figure 5a
Figure 5b
Figure 6
Figure 7
Figure 8
Figure 9
Figure lO
Figure ll
Comparison of Adaptive and Majority
Voting Techniques
Adaptive Voter
Device 1 Cell
Device 1 Integrator
Solion Tetrode and @_tput Characteristics
Solion Tetrode connected as an Integrator
Mercury Cell Integrator (capacitive readout)
Multiple Aperture Device (MAD)
MAD Integrator
Orthogonal Core
Second Harmonic Integrator
Magnetostrictive Integrator
Page
I
3
3
5
7
8
8
ll
ll
12
13
15
2
2
h.
h,
6
6
7
9
10
ll
12
13
3-ii
•Introduction
- _ =,,
The Adaline Neuron 1 is an adaptive logic device which may be trained
to recognize certain classes of input patterns. The device output is a
binary signal which classifies particular combinations of input signals
into two categories. An output decision is determined by a threshold
element _hose input is the linear sum of the products of each input and
its associated variable weight. During adaptlon the weights are appro-
priately changed in order to make the output decision agree with the de-
sired response. By following a simple set of rules after each application
of input signal combinations the device is caused to converge to an optimum
state for properly categorizing the set of input patterns.
Although training rules for a single layer system have been formulated
by WidrowA, Znew adaptive theory is required if systems of t_._ or more cas-
caded layers are to be properly trained to perform complex functions of
adaptive behavior and pattern recognition. The question of whether such
devices may be connected in complex arrays and demonstrate brain-like
behavior has generated considerable interest. Such applications appear to
be philosophical and subject to considerable controversy. Of primary con-
cern in the present study is to consider the usefulness of the Adaline
neuron approach in implementing the adaptive voting elements of a redundant
system.
The chart of Figure I shows how adaptive voters may extend the relia-
bility of a conventional redundant system, allowing a system using 9 replicas
to outperform a conventional system using 35 replicas of each function.
The Adaline neuron has received considerable quantitative study in
application to pattern recognition. When modified as shown in Figure 2,
and applied as an adaptive voter, the training rules become quite simple
since the desired output is determined by a voting of the weighted inputs.
Initially, all weights (gains) are made equal. The decision element will
then provide an output in accordance with the states of the majority of
binary, replicated input signals. If input errors are independent and
random the adaptive voter, by progressively adjusting its weights to assign
high weights to reliable inputs and low weights to failed or unreliable in-
puts, may derive correct information from a small minority of correct inputs.
3-I
l.O
A
.2
,55 INPUT MAJORITY _ 9 INPUT ADAPTIVE VOTER\ 2 WORKING INPUTS REQUIRED\
Figure 1 Comparison of Adaptive and Majority Voting TeChniques
/0
Figure 2 Adaptive Voter
3-2
• In this manne_ the effect of errors caused by input failures may be negated,
allowing a correct decision to be made under a high probability of input
signal failure. The simple, fixed majority voter will make output decision
errors when more than half of the inputs fail or are in error. The adaptive
voter, by masking out input errors as they occur, nay tolerate failures until
only two correct inputs out of the original group are present.
In order to provide automatic adaption it is necessary to continuously
compare the output decision with each binary input and to incrementally
decrease or increase each input weight according to whether agreement or
dleagreement exists. Assuming that input errors or failures occur rando,Lly
and that the automatic adaptive process can negate an unreliable input be-
fore other failures occur, the adaptive voter offers the possibility of
realizing system reliability of unprecedented excellence.
Inherent in the basic design of an adaptive voter is the requir_ent for
a variable weighted device which performs integration and displays relatively
permanent memory. These special characteristics have stimulated considerable
effort toward the development of suitable adaptive components. Devices which
display variable weight with memory generally utilize phenomena involving atomic
translation or rotation. The following represents a survey of the more prom-
ising techniques which have been suggested by researchers. The first three
devices described exploit electrochemical effects while the remaining devices
utilize aagnetic domain phenomena.
1. Electrp-Ch_cal Devices
a. Device i
Device 13 , an electrolytic device developed at Stanford University
by _idrov, is an electronically adjustable resistor with a rate-of-change of
resistance controlled by _plication of d-c current in a third electrode.
It consists of a sealed plating cell containing an electrolytic bath, a
resistive substrate upon which metal is deposited and a metal source elec-
trode, a typical configuration indicating the placement of electrodes and
electrolTte in a small plastic enclosure is sho_n in Figure 3. Two leads
are attached to the substrate and resistance between these leads can be
reversibly controlled by passing plating current into a third electrode.
The conductance of the device is changed and stored by plating or stripping
metal fr_ the substrate by nean8 of the integral of the _lating current.
Conductance is sensed nondestructively by applying a low voltage a-c signal
and measuring the resultant current flow.
Normal d-c drop between between source and substrate is typically 0.2
volts at a plating current of 0.2 me. The substrate resistance changes
from 30 ohms to 2 ohms in I0 seconds vith this magnitude of plating current.
The AC sensing voltage applied is usuall_ O.i volts RMS. A typical imple-
mentation of Device 1 with associated transformer coupled sensing and
d=c plating circuitry is sho_ in Figure _.
3=3
Although Device i models are cozmmerciaily available at a cost of approxi-.
mately $50 per cell their application in a practical syste_ is somewhatcum-
bersome. Transformer coupled circuits are usually required in order to
present a balanced load to the plating current source, and to provide the
CONTAINER FILLED WITH
PLATING SOLUTION
PLATING
NAL
_--'---.,,._/ _ _j""_...j_// ._"-----RHODIUM COATED
/ _ I _"'_'_INSULaTING CONNECTING
_GNAL LEADS
Figure 3 Device i Cell
• _220 115V_i OUT
115V _o.ov 5 6.3V
 ll i tLI T
,_M-2CR
INPUT 47K t ;1_
Figure 4 Device i Integrator
3-4
low voltage drop across the substrate. The substrate resistance is usually
less than IOO ohms and the a-c voltage drop must be kent below 3/4 volt in
order to prevent the formation of gas in the cell. Some difficulty has be_
reported in keeping the substrate material free of dimensional imperfections
_hich in turn cause non linear plating effects to take place. Long term
stability is apparently affected by chemical reactions taking place between
plating material and electrolyte. To date Device i models are available
in sample quantities and it is difficult to predict ultimate large scale
production costs, repeatability and reliability.
b. Solion
The solion is a fluid-state device which functions by controlling
and monitoring a reversible electrochemical "redo_' reaction. The term
redox refers to a chemical reaction in which oxidation and reduction occur
simultaneously. The redox system used in solions consists of two electrodes
immersed in an electrolyte containing both the oxidized and reduced species
of an ion. The system is completely reversible in that oxidation can occur
at either electrode while an equivalent amount of the same element is reduced
at the opposite electrode. Iodine is the reacting element most commonly used.
A simplified drawing of a solion tetrode and its output characteristics
is shown in Figure 5a. The tetrode has a platinum electrode at each end of a
glass tube and two perforated platinum electrodes separating the tube into
three compartments. The reservoir, containing the input electrode, is the
largest compartment. The integral compartment, containing the common elec-
trode, is made very small so an equilibrium distribution of the iodine may
be quickly reached. The compartment between the shield and readout elec-
trodes serve to separate the two electrodes. The output characteristics of
a solion tetrode are similar to that of a vacuum tube pentode, and show a
transconductance of 40,OOO micrc_hos at an output current of 500 microamperes.
A solion tetrode connected as an integrator is shown in Figure 5b.
By controlling the charge transferred between the two input electrodes,
a change in conductivity proportional to the integral of the innut current
may be obtained between the output electrodes. In this manner the device
may be utilized as an integrator, providing an output current proportional
to the integral of the input current. Because of the concentration poten-
tial, the input impedance of the solion tetrode is in the order of IOOO
ohms and therefore a relatively high impedance signal source is required
in order to avoid integration errors. At constant temperature, the
stability of solions is reported to be less than 1% over a period of several
days.
3-5
--8
ET
E ---- 0.7 Volts
u
Ee= -- 4 Millivolts
--.6
0
--.4 _ 4
I R
o
_ _.2
0 -- .2 -- .4 -- .6 --.8
re -- Volts
Figure 5a Solion Tetrode and Output Characteristics
SI_ R ElectrodesI Input
IC S Shield
Circuit Symbol
R Readout
,nput,,,no,
(Current Source)
0.7 V
Figure 5b Sollon Tetrode Connected as an Integrator
A practical problem in the use of solion tetrodes arises from the
requirement of providing an isolated battery potential between input and
shield electrodes to prevent iodine diffusion between the reservoir and
integral compartments. Primary application for the solion tetrode to date
has been demonstrated as a low level DC amolifier with a time constant of
3-6
20 seconds. Because of the inherent practical problems of precision de-
"sign, isolated supply voltages and discharging effects of parallel outouts
the solion appears to offer little promise as a practical adaptive component.
c. Mercur_ Cell
Another novel approach fo_ variable gain with memory is achieved by
use of a mercury cell integrator, an electrochemical device which provides
visual a_d electrical readout of the integral of an applied current. The
integrating element consists of a capillary tube filled with two columns
(electrodes) of mercury separated by a gap of aqueous electrolyte of metal-
lic salt. Two different methods have been used to provide electrical read-
out. The first method called capacitive readout is shown functionally in
Figure 6. The d-c input signal electroplates mercury across the gap at a
rate which is a direct function of the input signal amplitude, thus causing
the gap or bubble of electrolyte to move. The outside of the capillary is
covered by a vapor-deposited conductive sheath. The mercury electrodes and
sheath, separated by a thin glass wall provide a capacitance of approrLmately
20 pF. In application, an a-c signal is connected across the electrodes and
:1:lln
"___ fllndt
CIRCUIT DIAGRAM
Figure 6 Mercury Cell Integrator
(Capacitive Readout)
superimposed on the d-c input signal. The a-c signal will divide in accor-
dance with the capacitance existing between the upper mercury column and
sheath, and the capacitance between sheath and lower grounded column of
mercury. The excitation signal provides a signal at the sheath which is
a direct function of the length of the ungrounded electrode. An auxiliary
amplifier and detector in turn provide a proportional d-c signal of proper
level to operate other related devices.
The device provides reversible integration, relatively stable
memory, direct visual readout and a linearity better than O.I percent.
l_put control current is limited to +5 ma d-c. The integration time from
minimum to maximum output signal is _pproximately I00 minutes at maximum
control current. This time is ultimately limited by the maximum voltage
which may be dropped across the electrolyte, without causing the formation
of gas.
3-?
A typical capacitive readout integrator now commercially available
is approximately 0.5 cu. in. but prices range around $130 per unit. Although
displaying excellent stability and predictable operation such devices will
require considerable price reduction before application becomespractical.
The integration time although relatively long maynot present a serious
limitation for systems _hich display slow adaptive behavior as would be the
case in adaptive voting elements.
Another technique for sensing the position of the bubble utilizes
a light source and a photo-conductor whose resistance is inversely propor-
tional to the amount of light passed by the transparent electrolyte. As
the bubble moves out of line with the light source and photo-conductor
target area the light becomes progressively blocked by the mercury columns,
causing the photo-conductor resistance to increase. Tais technique allows
faster integration because the bubble need only be displaced by its own
height to effect a change from maximum to minimum light intensity at the
photo-conductor. A typical photoelectric integrator commercially available
occupies 1 cu. inch and requires 300 milliwatts to power an integral in-
candescent lamp. Output resistance varies over the range from 25K ohms to
350K ohms. Quantity prices are expected to fall below $15 per unit thus
providing a reasonably inexoensive adaptive component. The use of an in-
candescent lamp for the light source imposes a serious life and reliability
problem. The use of a more reliable light source and a substantial size
reduction will be necessary before application becomes practical.
2. Magnetic Devices
Various techniques have been suggested for providing variable gain and
non-destructive readout with magnetic devices. The phenomena utilized in
such devices is based upon the ability of magnetic materials to store a
remanent flux which is sensed in a non-destructive manner. Suggested de-
vices provide the capability for a partial switching of magnetic domain
under a volt-second impulse as the basic incrementing source. Suitable
magnetic materials include ferrites and tape wound cores which are charac-
terized by a square hysteresis curve. _ost of the devices to be described
utilize the same basic type of incrementing technique and differ primarily
in the manner by which the stored flux is sensed.
a. MAD Integrator
A diagram of a typical multi-aperture device 7 is shown in Figure 7.
in this device flux can be switched around the minor aperture by means of an
a-c drive winding without disturbing the flux linking and stored around the
main aperture. Initially the flux around the main aperture is set to cause
saturation in either a clockwise or counterclockwise direction. A momentary
reversal of the magnetizing force driving the main aperture will cause a
partial reversal of the flux. The amount of flux reversal is determined by
the magnitude and duration of the drive and the value of the hold current.
The purpose of the hold winding is to retain a portion of the core saturated
in the original direction of magnetization and thereby assure partial
switching of the flux. The amount of flux alternately switched around
the small aperture is then proportional to the flux which has been switched
3-8
aroun_ the main aperture. The output voltage will consist of a signal
whose voltage integral is proportional to the amount of flux trapped in
the common area between the two flux paths. Several cycles of carrier
drive rosy be required before this condition stabilizes. Care must be
taken to limit the carrier drive to values less than the magnetizing force
required to disturb the remanent flux around the main aperture.
The extent to which the remanent flux can be incremented is usually
implemented by means of a smaller core of like magnetic material. The
smaller core provides the appropriate amount of volt-second drive to
increment ths storage core in equal steps at various settings of ramanent
flux. Brain u has indicated that it is essential that incrementtn_ should
always occur at a constant reference phase with respect to the carrier
drive unless carrier drive is removed. If this is not done the size of
the incr_uental flux change will be dependent on the vector sum of the
switching and carrier signals. A typical scheme for realizing integrator
operation is shown in Figure 8.
SENSE
WINDING
ADAPT
WINOING_ II
I I OUTPUT
II WINDING
HOLD
WINDING
Figure 7 Multiple Aperture Device (MAD)
The physical requirement of providing a number of hand wound turns
about the various apertures dictates to a large extent the cost of the de-
vice. Large driving currents, a moderate amount of timing during incre-
menting and relatively low output signal amplitude necessitate peripheral
circuitry of considerable complexity. The resultant degradation in the
basic reliability of the approach then becomes an imposing problem.
3-9
2 TURNS
8TURNS
SET LEVEL
_J-l_
SATURATION
52.9
TURNS
BIAS HOLD
225 mo
2TURNS
READ OUT
2TURNS
75 kc/s
0.2 AMP
Figure 8 MAD Integrator
3-I0
b. Orthogonal Core Inte_Tator
The magnitude and direction of a stored flux may be sensedgby aDply-
Ing a magnetic field orthogonally to the direction of stored flux. This
causes the remanent flux vector to rotate generating a voltage proportional
to its rate of change _nd hence its magnitude. _he application of a read
or sensing field at right angles to the stored or written flux minimizes the
interraction of the sense drive on the stored flux magnetic path. At the
termination of the read drive the flux vector returns back to its original
preferred orientation by virtue of domain elasticity. A typical orthogonal
core configuration is shown in Figure 9. The flux level stored in the core
is altered by pulsing the output winding in a manner similar to the incre-
menting techniques previously discussed. 0utgut signal consists of either
positive or negative pulses depending upon the direction of the stored
flux, with an amplitude proportional to the magnitude of the remanent flux.
Practical problmus similar to those associated with the multiaperture de-
vice previously discussed again make physical implementation cumbersome.
c. Second Harmonic Integrator I0
Nondestructive readout of remanent flux may be obtained by reducing
the sensing drive to a value insufficient to cause irreversible switching.
Since magnetic cores are generally non-linear the output voltage will con-
tain harmonics of the drive current. In particular, the even harmonic
SENSE
AND ADAPT
WINDING
,
 ,,NDING
_"_ -_ IRON FLUX
__ RETURN
I ._,] , , I'_SENSE AND
II b',.__! il _ U] ADAPT WINDING
I_ il _ DRrVEW=ND,NO
_ FERRITE CORE
Figure 9 Orthogonal Core
3-11
voltage for certain core materials is found to be proportional to the net
remanent flux level. The second-harmonic generator shown in Figure I0
consists of a pair of tape wound cores driven from an r-f sinusoidal
power source. The output winding is arranged so that the fundamental com-
ponent of drive voltage cancels out, leaving a second harmonic distortion
voltage proportional to the remanent flux in the cores.
By passing a direct current through the output winding the remanent
flux level may be altered. Due to an interaction between the d-c adapt
current and the RF drive the rate of c_ange of the remanent flux with
respect to the adapt current is constant and reversible. Tape-wound cores
have been found to provide the best performance and because of their higher
permeability require fewer turns. Typical associated driving, sensing and
timing circuitry tend to be rather elaborate however, the cancellation of
the fundamental driving frequency is difficult to achieve in practice thus
making the desired output signal appear against a background of noise. This
low level signal must in turn be amplified in order to provide a signal com-
patible with the associated solid state circuitry which it must ultimately
control. Clearly a separately switched driving source for each pair of
cores is required in order to provide the individual binary signal inputs
whose weights are to be altered. Since the sinusoidal drive currents tend
to be in the order of lO to I00 or more milliamperes the driving and peripheral
circuitry is necessarily elaborate.
d. Magnetostrictive Integrator
The direction and magnitude of the net remanent flux in a magneto-
strictive core may be sensed if the core is excited mechanically. II Figure
ll shows a simplified scheme for implementing a magnetostrictive storage
syste_ using an ultrasonic delay line to excite several ma_netostrictive
torroids. Driving source for the sonic delay line is a piezoelectric trans-
ducer. Input to each of the torroids is provided by means of narrow width
RF DRIVE
I RF DRIVE _,oo-Kc _
ADAPT CURRENT
Figure I0 Second Harmonic Integrator
3-12
• pulses through a separate write coil woundconcentrically with the read
coil. If the frequency and rms amplitude of the stress wave is maintained
at constant value, the open circuit output of the read coil is approxi-
mately proportional to the flux stored in the individual torroids. Although
this affect has been damonstrated experimentally by NagyII and others the
basic peculiarities of magnetic domain behavior especially under the in-
fluence of mech&nical excitation is only crudely understood.
The experimental systems fabricated to date are rather large owing
to the structural requirements of acoustical devices and the associated
electronic circuitry necessary to provide proper ti_ling, current driving
and voltage amplification. At best considerable experimental work is
necessary to show that magnetostrictive storage offers any real advantage
over more conventional electro-magnetic approaches. Indeed, the sensing
of remanent flux by acoustical meansrather than by non-destructive, elec-
trical drive appears to inject an unwarranted interface complexity.
. SONIC DELAY /MAGNETOSTRICTIVE TOROID_
Figure II Magnetostrictive Integrator
3. Conclusion
As a result of the foregoing survey it became apparent that none of the
suggested adaptive devices were sufficiently developed to justify the selec-
tion of a practical approach for immediate circuit implementation of an
adaptive voter. An explicit evaluation was not attempted owing to the
superficial treatment of the various devices by academic researchers.
The magnetic devices with their known sensitivity to temperature stress
appear to offer the least hope for providing analog memory with long term
stability. The requirement for providing carefully controlled incrementing
with relatively large drive currents coupled with the _all output signals
and associated amplification appears to dictate an imposing amount of
peripheral circuitry. The degradation in reliability as a result of this
complexity represents a liability which makes practical application doubtful
for redundant systems.
3-13
References * ,
I) B. %_drowand M. E. Hoff, "Adaptive Switching Circuits," Technical
Report No. 1553-1, Stanford Electronics Laboratories, June 1960.
2) B. Widrow, "Adaptive Sampled-Data Systems - A Statistical _neory
of Adaption," 1959WESCON Convention Record, part 4.
3) B. Widrow, "An Adaptive 'Adaline' Neuron Using Chemical Memistors, •
Technical Report No. 1553-2, Stanford Electronics Laboratories,
October 1960.
4) "An Introduction to Solions," Texas Research and Electronic Corp.,
Dallas, June 1961.
5) "D-C Amplifier Uses Fluid-State Tetrode," Electronic Products
Magazine, October 1962.
6) "Capacitive Readout Integrator, _ Technical Brochure, Curtis
Instruments, Inc., Mount Kisco, New York.
7) J.A. RaJchman and A. W. Lo, _he Tranfluxor," Proceedings of the
I.R.E., March 1956.
8) A. E. Brain, "The Simulation of Neural Elements by Electrical Net-
_rks based on Multi-Aperture Magnetic Cores," Proceedings of the
I.R.E., January 1961.
9) J. K. Hawkins and C. J. Munsey, "A Magnetic Integrator for the Percep-
trcn Program, _ Annual Summary Report, Publication No. U-603, Aeronu-
tronics, Newport Beach, Col., July 30, 1960.
i0) H. S. Crafts, _A Magnetic Variable Gain Component for Adaptive Net-
w_rks," SEL-62-147, Technical Report 1851-2, Stanford Electronics
Laboratories, December 1962.
Ii) G. Nagy "Analogue Memory Mechanisms for Neural Nets," Cognitive
Systems Research Program, Contract No. _NR 401(40), Report No. 3,
Cornell University, Ithaca, New York, August 31, 1962.
3-15
Appendix 4
TRANSOR ANALYSIS
by
R. S. Bray
P. A. Jensen
C. G. Masters
September 1963
f , . •
TABLE OF CONTENTS
I°
II.
III.
IV.
V°
INTRODUCTION ..........................
RESTORING CIRCUIT MODELS ...................
A. The Transor Decision Function .................
B. The Threshold Decision Function ................
FAILURE MODES .........................
A. Transor Restoring Circuit Vulnerability .............
B. Threshold Restoring Circuit Vulnerability ............
RELIABILITY ANALYSIS ......................
A. Transor Reliability Defined ...................
B. Output Modes Defined ......................
C. Upper Bound on Transor Reliability ...............
D. Transor Reliability for Strictly Asymmetric Failure Modes .....
E. Transor Reliability for Mutually Exclusive Output
Failure Modes ........................
F. Transor Reliability for Symmetrical Environment .........
CONCLUSION ..........................
BIBLIOGRAPHY ..........................
4-I
4-3
4-3
,i-4
4-6
4-6
4-10
4-11
4-II
4-11
4-12
4-13
4-13
4-15
4-17
4-25
4-ii
I. INTRODUCTION
In recent years many novel schemes have been proposed to improve digital system
reliability through the use of "redundant" equipment. Several of these, patterned after
a concept of Von Neumann, 1 require a "restoring organ, .... restorer" or "voter" to be placed
after each set of redundant signal processors which perform a particular subsystem
function. A restoring organ receives an input from each member of the associated set of
processors. From these nominally identical input signals, the restoring organ produces
an estimate of the correct subsystem output based on one or more specified decision
criteria. It should be noted that the restorer does not perform any data processor function
but acts as an error correcting transmission channel connecting two signal processors.
It has been shown in the literature 2 that the theoretically most efficient restoring
organ is one that is capable of adapting itself to changes in the reliability of inputs.
Specifically, for threshold type organs it has been shown that the optimum use of n unreliable
versions of the same signal could be achieved by dynamically weighting each input in accor-
dance with its relative reliability. Inputs which have a past history of being more reliable
are given the heavier vote weights, and the unreliable inputs the lighter vote weights.
The ideal restoring organ would sense the unreliable inputs and decide on the optimal vote
weights. By efficiently tailoring the restoring organ to its ever-changing environment,
significant improvement could be achieved over the presently popular majority restoring
circuits.
In studying adaptive restoring organs, Corhpany A has shown 3 that circuit imple-
mentation of adaptive restoring organs for the specific requirements of redundant space-
borne systems is not yet practical. The complex circuitry required under the present
"state of the art" to perform the adaptive function results in machines too cumbersome and
unreliable to compete with less sophisticated redundant systems. This does not mean
though that the present restoring organs used in redundancy techniques are adequate and
cannot be improved upon.
The purpose of this study is to investigate a new restoring organ proposed by Comp-
any A, called the Transor 4. A characteristic of many failed subsystems is their tendency
to have steady-state outputs as their dominant failure mode. In Transor, steady-state
outputs are automatically deweighted by detecting only changes in states rather than the
absolute states themselves. In an environment where the probability of steady state
1,2,3,4
See Bibliography
4-1
failure is relatively high, a restoring organ which ignores its steady-state inputs can derive
a correct output with less than a majority of working inputs.
The salient characteristics of the Transor restoring organ are best shown by contrasting
them to the corresponding characteristics of a majority restoring organ. The majority
organ was chosen as a reference base because of its similarity in function to the Transor
and because it is presently the most widely used restoring organ.
4-2
IL RESTORING CIRCUIT MODLES
A. THE TRANSOR DECISION FUNCTION
To be consistent with the terminology adopted by one group of investigators, the
term "restoring circuit" will be sued to denote one functional unit of a restoring organ or
restorer. A very general block diagram of a Transor restoring circuit having binary inputs
(Xl, x2,... XR) and an output z is shown in figure T-1.
SUM CHANGE
DETECTOR
OUTPUT
MEMORY Z
Figure T-1. Transor Restoring Circuit
Some of the salient characteristics of a Transor Restoring circuit are noted below:
1) It has memory
2) It operates only on the number of changes in the states of
individual inputs between two adjacent bit times, (t - 1) and
(t).
3) It is a binary voting element with a binary output.
4) It has two thresholds, not necessarily of the same magnitude,
which combine with the states of the input at (t - 1) and ( t )
to determine the element output.
The functional relationship, describing the Transor Decision function can be stated as
follows
Z (t) f [Z (t°l), x2-_XR )(t-l), T1]= - (Xl, x2--XR)(t); (x 1, • To] (1)
4-3
The number of binary Ones appearing on its inputs during each bit time are summed and com-
pared with the number present during the previous time period. If the change is positive
and greater than a given threshold T 1 then the output z is forced to a binary One. If the
change is negative and greater in magnitude then a second threshold, To, the output is
forced to a binary Zero. If neither threshold is exceeded, the output does not change from its
This operation may be summarized by the following decision rule state-previous state.
merits.
R R
xi (t) -_. xi(t-1) > T 1 -- Z (t) = 1 (2)
o o
R R
_ x i (t) _ _ xi(t-l) <_ _ To
o o
_ z(t)
= 0 (3)
R R
-To < _ xi(t) o_ xi(tol) < T1 -- z(t) : z(t-1) (4)
o o
B. THE THRESHOLD DECISION FUNCTION
The threshold model* consists of a black box having a certain number of binary inputs
(Xl, x 2...x R) and an output z. At any bit time (t) the state of the output line zis a
function of the state of the input lines and the threshold T. A general relationship similar
to equation (1), but describing the threshold decision function may be delineated by the
(t)
following expression.
z(t)
If the output, z, can assume either a Zero orOne state, the threshold restoring circuit
makes a decision to force its output to the One state under the following decision rule :
* R+I
The majority gate is a threshold model with T = T, where R is the number of inputs.
4-4
If
R
xi (t) > T _ z(t) = 1
o
and to the Zero state when
R
_x. (t) < T- Z (t) = 0
I
o
(6)
(7)
4-5
I]I. FAILURE MODES
A. TRANSOR RESTORING CIRCUIT VULNERABILITY
Before the reliability of any Transor network can be expressed in a meaningful
mathematical form, the failure modes of the individual subsystems appearing at the
Transor's inputs must be explicitly stated.
A characteristic of Transor is its ability to differentiate between transistional and
steady-state failures. This property creates failure modes different from those of
threshold decision. Specifically, a signal processor is assumed either to be working
correctly or failed into one of the following modes:
1. The transitional mode, in which extra Ones and/or extra Zeros
appear at the output, and
2. The steady-state mode, in which the output permanently remains
in a single state.
A transition (figure T-2) is defined as the rise or fall of a pulse during its switching
time. The restoring circuit executes a decision by vector summing the change
I
I
I
I
, Iv/ i
Q
I
I
I
I
I
I i
TRANSITION INTERVALS
Figure T-2. Transition Intervals
in input pulses on the R redundant lines during the vote interval and a decision is made
according to the decision rules (2) through (4). The term "extra One"implies a one has
appeared on a signal processor's output when it should have been a Zero. By going to the
wrong state a signal processor creates a wrong transition which is voted by the Transor.
Wrong transitions can occur through diode failures in the gating section of diode-transistor
4-6
typesignalprocessors. These failures sporadically generate "extra" Ones or "extra"
Zeros as a function of the information at the gate's inputs. To illustrate, consider a
three input Transor voting on the output of a network of redundant AND gates. The
state of the binary inputs may be represented by the state vector S. (t) below.
1
xl(t)
S 1"(t) : x2(t)
x3 (t)
In figure T-3 a diode is assumed to have opened in branch (1) of two of the gates causing
those branches to appear as Ones. An erroneous One will appear at the gate's output
whenever a correct Zero appears on those inputs and correct Ones appear on the remainder
of the inputs. However, if all the input diodes open or an output element fails, the gating
function will be destroyed, and the output will assume a steady-state. A method for
determining the probability that a signal processor will fail into either of these two modes is
discussed in Appendix I.
t-I t
t-I t L. 1
_ 20 _
3C " uP
t-I t I
_ 2C '-r-
3C P-
TRANSOR
INPUT •
t
Figure T-3. Generation of Wrong Transitions in Redundant AND Gates
4-7
Because transitions are vector quantities their occurrence in the wrong direction
may threaten Transor performance in three ways:
1. Wrong transitions cancelling correct transitions.
2. Wrong transitions occurring while the correct inputs remain in
the same state (a series of Ones or Zeros). During this time
the correct inputs have lost their voting power.
3. Wrong transistions temporarily simulating steady-state failures.
Wrong transitions produced by "extra Ones" and/or "extra Zeros" over a sequence of bit
times can result in "error correlation" and create a variety of failure modes, subject to
the nominally correct input states to the Transor for the considered sequence.
Figure T-4 shows this more clearly when state vectors are used to represent the inputs
to a five input Transor. Inputs x 1 and x 2 are assumed to have failed and capable of
randomly producing wrong transitions in either direction, i. e, extra Ones or Zeros.
No inputs are assumed failed to a steady-state. For definiteness all inputs at time (t)
may be assumed correct. In the following bit times (proceeding to the right) several
failure patterns are possible for each nominally correct input state. At (t+l) the states
(2), (3), (4), and (5) are considered among the possible states (four other possible states
including (1) have been omitted as repetitious). Observe that sequence (1) _ (2)
is the most damaging because only the wrong transitions have any voting power. For a
threshold set as low as two this would result in a wrong decision. The sequence (1) _ (5)
represents a possibility in which both erroneous inputs have temporarily "stuck" in one
state simulating a temporary steady-state. The sequences (1) _ (3) and (1) -- (4) are
the most likely possibilities in which one of the failed inputs is temporarily correct. In
the next bit time (t + 2) transitions to the possible states (3), (4), (5) and (6) and (7) are
considered (again repetitions are omitted). Shown here are the cancellation effects
caused by the introduction of errors on the previous bit time, demonstrating the "error
correlation" inherent in Transor. The sequence (2) _ (5) is the most damaging because
any threshold greater than one would have resulted in a wrong decision. Observe the
tradeoff conflict created by the necessity for setting the threshold at a value greater than
two in the sequence (1) _ (2) and the same threshold at a value less than two in the
sequence (2) -- (5) in the following bit time. Clearly there must exist an optimum
threshold. Inclusion in figure (4) of transitions from states (4) and (5) would have pro-
duced no new failure modes since they are but the duals of (2) and (3).
4-8
®t÷l t+2
®
o
®
° ®
[I®
®
®
Figure T-4. Possible Sequences of Input States for a Five Input
Transor Over Two Bit Times
4-9
B. THRESHOLDRESTORINGCIRCUITVULNERABILITY
Athresholdrestoringcircuit makesadecisionat time (t) bysummingthenumberof
binaryOnesappearingmomentarilyat its inputs. Thedecisionis independentof the input
stateattime (t - 1). Byvirtueof decisionrule (6)if thenumberof errors appearingon
therestorer's inputsis greaterthanthethresholdT therestorermakesthewrongoutput
decision.Asopposedto Transor, thethresholddevicecannotdifferentiatebetweenpure
wrongtransitionsandsteady-statefailuressothatbothfailuremodesmaybe lumped
together.To illustrate, considerathree-inputhresholdrestoringcircuit whosethreshold
is setattwo(T = 2). For definiteness assume that x 1 and x 2 at time (t) are in error and in
the same state and x3 is correct as indicated below.
(t)
x 1
x2(t)
(t)
x3
IX
IX
Xu
z(t) -
_ X
Under this condition a wrong decision will be made. This may be considered a "worst case"
failure mode because the alternate situation is possible where x 1 and x 2 have failed into
opposite steady states.
1]0 _ Z = x 3
x3
In thiscase the errors nullifyeach other and the restoring circuit'soutputwill follow
the singlecorrect input (x3). In most reliabilityanalyses the "worst case" isassumed, and
any two failures in a set of restoring circuitinputs are assumed to cause system failure.
4-10
IV - RELIABILITY ANALYSIS
A. RELIABILITY DEFINED
In keeping with the usual concept of reliability, the reliability of a Transor restoring
circuit will be defined as the probability that it never makes a wrong decision during its
mission time. For analysis purposes the transor itself is assumed perfectly reliable,
i. e., a wrong decision is never made through component failure within the Transor itself.
In part KI it was shown that errors appearing on the Transor inputs in a particular bit
time could be correlated with errors that appeared on adjacent bit times to produce unique
failure modes. Two of these were:
(1) Cancellation effects
(2) Simulated steady- state
In the following discussion it will be shown how these failure modes may be 'built in"
to reliability models by using multinomial expansions. Analytical models formulated in this
manner may be easily compared with models for threshold reliability.
B. OUTPUT MODES DEFINED
Any output of a binary signal processor can be classified into one of six mutually
exclusive classes over the element's mission time. These are:
1) Correct
2) Continuous Zero state
3) Continuous One state
4) Extra Ones but no extra Zeros
5) Extra Zeros but no extra Ones
6) Both extra Ones and Zeros.
Moreover the output of a system, composed of binary signal processors may be defined
by the six mutually exclusive classes above. Each of these classes will be assigned the
following probability measures in conformance with the Transor decision rules.
4)
p; the probability that the output is correct
(Is; the probability that the output is either a continuous Zero or a
continuous One.
ql; the probability that the output generates extra Ones, but not extra Zeros.
4-11
5) qo; the probability that the output generates extra Zeros, but not extra Ones
6) (tl0; the probability that the output generates both extra Ones and Zeros randomly.
Note that the measure qs is the result of the union of classes (2) and (3). The transitional
probabilities ql' qo and ql0 are defined to represent _ the probabilities that a particular
set of components, whose failure will cause wrong transitions to be generated randomly,
will fail.
C. UPPER BOUND ON TRANSOR RELIABILITY
An upper bound on reliability is easily obtained by excluding all but steady-state failures
from the environment. If _ is a random variable denoting the number of correct transi-
tions (or working inputs) and ), the number of inputs failed to a steady-state; a probability
density function may be defined over the sample space as
8 = , p B q:
Since Transor ignores steady-state failures the only criterion for a correct decision
is that
B _> T
o
/9 > T 1
The corresponding limits on _, are
), < R - T (8)
where T = T 1 = T. The reliability iso
R
B=T
p)R - /9 (9)
In an environment capable of producing only steady-state failures, the maximum
reliability and error correction capability is obtained by setting T = 1. This is the optimum
threshold. From equation (8) we see that Transor can correct at best R - 1 failures in
an order R redunda_ system.
4-12
D. TRANSORELIABILITYFORSTRICTLYASYMMETRIC FAILURE MODES
Excluding from the mutually exclusive ways an environment can fail class (6) and
either class (4) or (5) limits transitional failure modes to states (2), (3), (4) and (5) in
fig. (4). Of these the sequence (1) -- (2) is the "worst case". For definiteness let it be
assumed that Transor inputs may produce only extra Zeros and steady-state failures. Let
a be a random variable denoting the number of wrong transitions to the Zero state.
o
The density function on this sample space is
B R ) pB r ao0= % %'Y_ aO
A wrong decision will be made unless
a _< T -I
o o
Since it is necessary that
B >- T
o
the limits on y must be
y<R-T-a
o o
The reliability is
T -i R-T -a
o o o
a =0 y=O0
R IR-ao -T
R-ao- )'' )" ' SO P qs qo (i0)
E. TRANSOR RELIABILITY FOR MUTUALLY EXCLUSIVE OUTPUT FAILURE MODES
The scope of the environment considered in part D can be broadened to include both
the mutually exclusive classes (4) and (5). Each input may be failed to either steady-state,
extra Ones orextra Zeros (but not both). The failure modes (figure T-5) may be represented
in a manner similar to figure T-4; inputs x 1 and x 2 assumed failed in one of the four mutually
exclusive ways listed above.
The sample space may be described by the density function
4-13
t+l t+2
®
®
®
Figure T-5. Possible Sequences for a Five-Input Transor with Mutually
Exclusive Output Failure Modes
The sequence (1)
decision unless
O
-- (2) in figure (5) implies that a Transor will make a wrong
(ii)
and its dual
a 1S TI-1. (12)
From the sequences (i) --
B+
(3)
Q 1
and (I)
_> T
1
-- (4) respectively
(13)
4-14
>_T (14)
_+ aO 0
for a correct decision. However examination of the sequences (3) -- (4) and (4) -- (3)
show that inequalities (13) and (14) do not represent "worst cases". "Error correlation"
between the bit times (t + 1) and (t + 2) have produced a temporary steady-state. A correct
decision will be made only if
B _ TO (15)
> T 1 (16)
From (15) and (16)
_, -< (R-To) -- a 1 -- a o (17)
y < (R-TI) -- a 1 -- a o (18)
a =0
0
Of these last two inequalities the number of allowable steady-state failures,
governed by the highest threshold, T O or T 1.
The reliability will take the form
TI-1To-1 R-T ° - ao al R- a - a 1-y 6 6o o 1 y
6 1=0 y =0
- a o- 6 1- a, 6
_o, 1,
T , will be
(19)
where T o is assumed > T 1.
F. TRANSOR RELIABILITY FOR A SYMMETRICAL ENVIRONMENT
A symmetrical environment utilizing Transor decision will be defined as the mutually
exclusive classes (1), (2), (3) and (6). Wrong transitions may occur in both directions and at
random. Therefore 6 o = 6 1 = a and T O = T 1 = T. The density function on this sample
may be written as
¢ : 6 r
, 6, 10 qs
From figure T-4 it can be seen that a wrong decision will be made unless
4-15
and
From (21)
),5 R-T-2a
Therefore the reliability for the symmetrical environment is
R_--
T-1 R-T- 2a
a =0 _" =0
R-a - T
Q
P qlO qs
),
(20)
(21)
(22)
(23)
4-16
V. CONCLUSION
The dynamic characteristics of the Transor decision function make this type restoring
circuit unique to the present art. The mission of this part of the Failure Free Systems
Study has been to evaluate the potential usefulness of the Transor as a restoring circuit.
Primarily because it is most commonly used in present redundant equipment, the thres-
holdtype restoring circuit has been chosen as the reference point for the evaluation primarily. It has
been hypothesized that, if it can be shown that the Transor failure masking capability com-
pares favorably to that of the threshold restoring circuit, further development, including the
construction of a breadboard model, should be justified.
The results of section IV have shown that there are certain environments in which
Transor can be used to advantage in improving system reliability. For example, the
maximum error restoring capability of Transor is shown to be R-1 failures of R redundant
lines in an environment free from transitional failures. This is a significant improvement
over the majority threshold restoring capability under the same conditions. There is need
for caution, however, for in environments where symmetrical transitional errors are
possible error correlation may make Transor performance inferior to threshold. From the
reliability models, a tradeoff may be determined in terms of the output error probabilities
of the environment.
The work done up to this point represents only a first step in Transor decision study.
Work yet to be done includes: (1) ageneral Transor reliability model incorporating all the
possible failure modes and (2) a decision rule for determining an optimum threshold.
In addition to continuing the analytical effort described in this report, a computer sim-
ulation program is being written to aid in the task (1) effort. This will be a relatively simple
but versatile program designed to accommodate any set of restricting assumptions including
those made in the four models derived in this report. The results of this report have shown
a solution to task (2) would be desirable because of the tradeoffs between different failure
modes. If the error probabilities of the signal processor outputs are known in the design
stage maximum reliability can be bought for zero additional cost by a judicious choice of
the thresholds.
4-17
VI. APPENDIX
Determination of the Reliability Parameters P ' qs' qo' ql' ql0 in a Signal
Processor.
In section IV it was shown that reliability models could be formulated in terms of the
output error probabilities of a set of redundant signal processors. This section describes
a method for determining these probabilities.
Consider a set X* which has for its members the n components of a signal processor.
Each member (component) has two possible states:
xi; the i th member is working.
x-i; the i th member has failed.
Let each component have a reliability
P(x i)
and a probability of failure
P(_i )
t
=l-e 1
The probability measure on the sample space of X may be partitioned into the canonical
form
l=P(Xl N x2N -_Xn ) +P(XlN x2N -_Xn )
+ P(XlN x2N x3--Xn )+'''+
+P(._l N _2N -- Rn )
(24)
Briefly, the method requires determining the correspondence between groups of the terms
in (24) and the individual terms in
1 = p + qs + qo + ql + ql0 (25)
Obviously the parameter p, that the signal processor output is correct is
p = P (x i n x 2 N .... x n)
The remaining 2n-1 terms in (24) are mapped into the four remaining parameters in (25) by
paritioning the set X into subsets whose members are defined by those components whose
* Summary of all the notation to be used is included on the last page of this appendix.
4-18
failure will result in one of the four mutually exclusive events described in part IV. Specifi-
cally let
Xss be the set whose failure results in either a steady-state Zero or One.
X 1 be the set whose failure results in extra Ones.
X be the set whose failure results in extra Zeros.
o
Xl0 be the set whose failure results in extra Ones and Zeros.
Since each component may fail by shorting or opening, these two modes will determine
membership in one or more of the above sets. If the probability of a component shorting
s xgiven that its failed, P( x i i ), is p i then the joint probability of x. failing and shorting1
is
s)P (x i f'l x i s ) = P (xi = Pi (1 e
Let the probability of an x i opening given that its failed the P(x i
Then
s - i-p (x i xi ) + P(xi o xi ) =I
and
ojP (x i x i) = 1 - Pt
Also since for each x i the events working, shorted or opened are mutually exclusive the
pl'obability of a component no._ft shorting is
_k. t
p (xiS ) = P ( x i U x i°) = 1 - Pi (1 - e I )
To illustrate the technique a NAND gate will be analyzed using the test results contained in
5
an earlier report.
CRI
CR2
c
CR5
0 '-
r _
+12
i b
C9
I(
R5
+6
OUTPUT
Figure AT-1. NAND GATE
4-19
The pertinent results are included below.
1. AND gate input diodes; CR1, CR2, CR3
A. OPEN - Any open circuit input is equivalent to a logical "one" on that input; it
cannot inhibit the AND gate.
B. SHORT - A shorted diode will not affect the ability to perform the AND function if
that input has low impedance to ground in the "zero" state and high impedance to a
positive voltage in the "one" state. The line with a shorted diode is no longer
isolated from other inputs; that line is shorted to the AND gate output and may,
therefore, be an incorrect "zero".
2. AND gate resistor; R4
A. OPEN - The AND gate has no voltage available to drive current into the transistor
base, so the NAND gate output remains a "one".
B. SHORT- This will cause a low impedance path from the +12 volt power supply
through the input diodes to all of the inputs to the gate. If any of these inputs
are from NAND gate transistors which are conducting, that input will also be a
low impedance to ground. A low impedance path then exists from the power
supply to ground, and a high current will flow through the diode and transistor
according to the magnitude of the impedance of the power supply and components
involved. In the tests observed, this current was not sufficient to damage the
transistor or diode and did not blow the fuse on the power supply. However, if
any inputs are from flip-flops, the clamp diode will turn on when the voltage
exceeds the clamp voltage. A low impedance path then exists from the +12 volt
power supply through the shorted AND gate resistor, the input diode, and
may seriously overload the clamp voltage supply, depending how the clamp
voltage is derived. In the tests observed, this current was sufficient to cause
both the input diode and clamp diode to short and the clamp voltage to rise
toward +12 volts.
3. Input resistor - capacitor; R5, C9
A. Resistor SHORT- The transistor base voltage will be the AND gate output.
This will normally cause the transistor to conduct, so that the output will
be "zero" for any logic input.
B. Resistor OPEN- This will cause the transistor to be off, so that the output will
be a "one" for all logic inputs.
4-20
C. OPEN C9 - This does not adversely affect operation, unless the switching time is
critical, in which case NAND gate turn-on time was increased from 65 nanoseconds
with C9 to 80 nanoseconds without C9; turn-off time was increased from 25 to 45
nanoseconds in one approximate measurement with a constant load on the output
of the circuit. The turn-on time was measured as the time from the input going
positive above +1.6 v. until the output goes to +1.6 v. from the "one" state. The
turn-off time was measured as the time from the input going negative below +2.4 v.
until the output goes to +2.4 v. from the "zero" state.
4. Base bias resistor, R6
A. OPEN - This will normally cause the transistor to conduct, so that the output will
be "zero" for any logic input, except that when the AND gate voltage is going
negative from the "one" state, this voltage change is coupled across C9 and will
turn the transistor off until the transient effect has ended.
B. SHORT- The short of the base resistor may cause damage to the output transistor,
since -12 volts on the base exceeds the maximum rating of 5 volts for VEB O. The
output voltage will depend on the failure mode, if any, of the transistor. In three
multiple failure tests that included short of the base bias resistor in a NAND gate,
two transistors shorted base to collector, which resulted in a -12 volt output;
one transistor shorted collector to emitter, which resulted in a "zero" output.
The -12 volt output did not cause any significant difference than a normal "zero"
output to the following circuitry.
5. Collector (output) resistor, R8.
A. OPEN- The removal of the output resistor does not affect the logical operation
of the circuit, since any loads are also to positive voltage sources. The output
rise time will be somewhat slower but the output will turn off faster because the
output voltage in the "one" state is lower and the load current is less.
B. SHORT- The output voltage will be +6 volts; the current in the transistor will be
high if the transistor is conducting. This current was not sufficient to cause
permanant damage to the transistor in the observed tests.
6. Transistor, T7
The transistor may fail into any of several possible modes, but the circuit output
will usually be a "one" unless a low impedance path exists from the output to ground,
such as when the collector is shorted to emitter, or if the transistor is otherwise
forced to remain conducting from collector to emitter.
4-21
From the test results the component failures may be categorized (below) into their
effects on the NAND gate's output.
I Components Causing Failure into Steady State "1"
II
1) R4 Open
2) R5 Open
3) T7 (most modes result in a "1" )
Components Causing Failures into Steady State "0"
1) R5 short
2) R6 short
3) R6 open
4) CR1 and CR2 and CR3 open (together)
III Component Failures that will Produce Transitional Extra "Ones"
1) CR1 or CR2 or CR3 open
2) CR1 and CR2 open
3) CR1 and CR3 open
4) CR2 and CR3 open
From the three categories above may be formed the mutually exclusive sets
Probability of X s (i)= P IX s (i)]Set X s
X s (1): X4 ° (1-P4) (1- e-X4 t)
1 - e -X5t
X s (2): X 5
1 - e -X6t
x s (3):x 6
1 - e "XTt(4):xXs 7
F"
0
I (1 - p) (1 -X s (5): I(i N _2 ° n x3°
t.
-xt ] 3e )
The probability of a steady-state failure is
5 5
qs= E P [Xs(i [Xs(i' J)l + Z PI xs(i' j' k)]
i =1 i _ j i_j_k
5 5
i _ j_ k¢l i =1
4-22
Set X
o
Xo(1):
Xo(2):
(Xl°®X2 °_x3°) Nx4 °fix 5Ax 6fix 7
(Xl° fl x2° 0 x2° fl x3° • Xl° A x3°)
flx4°fl x 5fix 6fix 7
Probability of X ° (i) = P
3 (l-e) (1-e -xt) •
-2Xt [e 1-(1- p4 ) (1-e
-Xt 2
3 [(l-p) (l-e )] e
- X 4 t)(1-e ] •
-(X 5 + X 6 + x7)t
e
X (i) ]
The probability of an extra zero is
2 [Xo
i=l
Observe from the set X ° that transitional errors will be caused by less than three of
the input diodes failing through opening. In actuality the probability of a wrong transition
for the member X ° (1) in the set X ° is the joint probability:
P (i t_h_hDiode open N "O" on the i t-h-h input A
n-1 diodes working fl "l's" on the n-1 diodes A no steady-state failures)
=P (i t_hh Diode open ). P (n-1 Diodes working). P ("O" on i t__hhinput n
l's on n-1 inputs I i th Diode open n n-1 working). P (no steady-state failure)
i
The third term in the joint probability expression is the conditional probability express-
ing the fact that a wr_ng transition is a function of the information appearing at the gate inputs
in any bit time. For all practical purposes this term may be set equal to unity due to the
tremendous speed at which information is processed and the resulting short time between
occurrence of all possible input states. This same reasoning may be applied to the other
member X (2).
o
Note that a NAND gate possesses an asymmetric environment because there are no
failure modes that can result in the exclusive classes X 1 or X 1 o"
4-23
Thus the reliability of a Transor voting on the output of a network of redundant NAND gates
can be defined by equation (10) in part IV.
The following notation was used in this appendix.
th
1) xi, the event that the i-- component is working correctly.
2) xi; the event that the i t_h.h component has failed.
3) P (xi); probability of the defined event (I)
4) P(xi ) = 1 - P (x i)
s
5) x ; the event that the i t_h_h component has shorted
i
o
6) x i ; the event that the i t._h_hcomponent has opened because the probability
space of each component is the logical union of
x. U- _x s(xi )u(x.1 n x°)
s
7) P(x i ); the probability of (5)
o
8) P (x i ); the probability of (6) = 1-P (xi) - P (xiS)
9) -xiS , the event that the i th- component has not shorted
10) _i ° ; the event that the i th component has not opened
-s
11) P (x i ) ; the probability of (9)
12) P (x. ° ) ;the probability of (10)
1
s
13) P (x i I x i ); the probability of the i th component shorting given that its
failed =
o
14) P (x i I x i ); the probability of the i th component opening given that its
failed. = 1- p
4-24
BIBLIOGRAPHY
i) J. von Neuman, "Probabilistic Logics and the Synthesis of Reliable
Organisms from Unreliable Components in Automata Studies, "Ed.
C. E. Shannon and J. McCarthy, Princeton University Press, 1956.
2) W. H. Pierce, "Adaptive Vote-Takers Improve the Use of Redundancy, "
Redundancy Techniques for Computing Systems. " Ed. R. H. Wilcox and
W. C. Mann, Spartan Books, 1962. July 17, 1961
3) "A Survey of Adaptive Components for Use in Failure Free Systems",
Special Technical Report No. 1, Nasw-572, Aug. 1963.
4) W. C. Mann, "Restorative Processes for Redundant Computing Systems, "
Redundancy Techniques for Computing Systems, Ed R. H. Wilcox and
W. C. Mann, Spartan Books, 1962.
5) A. R. Helland, W. C. Mann, "Failure Effects in Redundant Systems, "
Report No. EE-3351, Westinghouse Electronics Division 1963.
4-25
d ,
Appendix 5
COMPARISON OF DYNAMIC AND THRESHOLD RESTORERS
by
C. G. Masters
I_ S. Bray
December 1963
kSection
I°
II.
m°
IV.
V.
VI.
TABLE OF CONTENTS
Page
INTRODUCTION ........................... 5-1
DESCRIPTION OF DYNAMIC RESTORING CIRCUITS .......... 5-3
A. Review of the Transor Decision Function ............. 5-3
B. Description of the Hamming Distance Restoring Function ...... 5-4
C. Comparison of Transor and the Hamming Distance
Restoring Circuit .......................
REVIEW OF THE ANALYTICAL EFFORTS ...............
lo
B.
C.
D.
E,
F.
5-5
5-7
Signal Processor Assumptions .................. 5-7
Classification of Failure Effects ................. 5-8
Class Probability Measure .................... 5-10
Analytical Models ....................... 5- 11
1. Multinomial Model for a Dynamic Restoring Circuit ...... 5-11
2. The Transor Model ..................... 5-12
3. The Hamming Distance Restoring Circuit Model ........ 5-12
4. The Threshold Restoring Circuit Model ............ 5-13
Threshold Parameters as a Bound on Dynamic Parameters ..... 5-14
A Comparison of Transor and the Hamming Distance
Restoring Circuit ...................... 5-16
SIMULATION PROGRAM ....................... 5-19
DISCUSSION OF RESULTS ...................... 5-21
A. Simulation Results ....................... 5-21
B. Curves Discussion ....................... 5-24
CONCLUSIONS ........................... 5-29
5-ii
.Figure
l°
2.
3.
4.
5.
6.
7.
8.
9.
LIST OF FIGURES
Page
Block Diagram of the Transor ..................... 5-4
Block Diagram of the Hamming Distance Restoring Circuit ........ 5-4
Possible Five-Input Sequences for Two Failures ............. 5-9
Typical Histogram ....................... . . . 5-22
Approximation to Reliability Curve ................... 5-22
Transor Order 5 Redundancy ..................... 5-23
Comparison of Transor and Hamming Distance ............. 5-25
Comparison of Threshold and Hamming Distance ............ 5-26
Comparison of Order 7 Threshold and Order 5 Hamming Distance ..... 5-28
5- iii
I. INTRODUCTION
Thebasicfunctionof a restoringcircuit is discussedin PartOneof SpecialTechnical
ReportNo. 4whichis containedinAppendix4of thisreport. TheTransoris describedin
thatreport asadevicewhichis potentiallyusefulfor performingtherestoringfunction. Be-
causeit is sensitiveonlyto changesin thestatesof its inputs,arestoringcircuit of this
typeappearsto haveadvantagesoverthecommonthresholdvoter in environmentswhere
mostfailuresresult in steadystateinputsto therestorers. Ofcourse,suchacircuit should
beinferior to thethresholdvoterwhenfailuresresultin transienterrors.
Theoriginalgoalofthis studywasthedeterminationoftheratioof probabilityof steady-
stateerrors to probabilityof transienterrors for whichanydecreasein theratio will make
theuseof thresholdvoteradvantageouscomparedtotheTransor. In theprocessof perform-
ingthestudy,a newdynamicrestoringcircuit hasbeendevelopedwhichhasobviousadvan-
tagesovertheTransorfor certaininputfailurepatternconditions.Theinventionof the
HammingDistanceRestoringCircuit causeda shift in theprimarygoalto includeevaluation
of bothit andtheTransorrelativeto eachother, aswellasto thethresholdvoter.
SectionII of thisreport includesa brief reviewof theTransoranddescribestheHam-
mingDistanceRestoringCircuit. Section III reviews the analytical techniques which have
been used in searching for tools to evaluate the two restorers. Section IV describes the com-
puter simulation program which was used in the evaluation. Sections V and VI contain the
results which have been obtained and the conclusions which can be drawn from these results.
5-1
II. DESCRIPTIONOFDYNAMICRESTORINGCIRCUITS
A. REVIEWOFTHETRANSORFUNCTION
TheTransoris describedindetail inAppendix4. A brief reviewof theTransorfunc-
tion is givenhereto easethediscussionoftheHammingDistancefunctionandto facilitatea
roughcomparisonof thesalientfeaturesof each.
A blockdiagramof theTransorRestoringCircuitwithbinaryinputs(x1, x2.... XR)
is shownin figure 1. ThefunctionalrelationshipbetweentheoutputZ. theinputs,andthe
thresholdsTOandT1is expressedin generalas
Z(t) = f [z(t-1); (Xl, x 2 .... xR)t; (x 1,x 2 .... xR)(t-1) ; TO; T 1]
(1)
The specific function summarized by this relationship may be described as follows.
The number of binary "ones" appearing on the Transor inputs during each bit time (t) are
summed and compared with the number present during the previous time period (t-l). If
the change is positive and greater than a given threshold T 1 then the output Z is forced to a
binary "one". If the change is negative and greater in magnitude than a second threshold,
TO, the output is forced to a binary "zero". If neither threshold is exceeded, the output does
not change from its previous state. This operation may be completely specified by the follow-
ing decision rule statements:
R R
_ x'(t) - _ x(t-1)1 1
i=0 i=0
> T 1 _ Z (t) = 1
R R
i=_ x(t)1 - _ x(t- 1)1
i=O
< T O _ Z (t) = 0
- To<
R R
_ x(t) - Z x(t-1)1" 1"
0 0
< T 1 _ Z (t) = Z (t-l)
5-3
SUM CHANGE
DETECTOR
OUTPUT
MEMORY
Z
Figure 1. Block Diagram of the Transor
B° DESCRIPTION OF THE HAMMING DISTANCE RESTORING CIRCUIT DECISION
FUNCTION
A block diagram for a Hamming Distance Restoring Circuit with binary inputs (Xl,
x 2 .... XR) is shown in figure 2. The functional relationship between the output Z, the in-
puts, and the threshold T can be expressed in a form similar to that of Transor:
(t)_ xl(t-1) " x2(t) - x2(t-1) : x2(t) - 1)z(t) = f [ z(t-1)' 1 ' x2(t-
(t) (t- 1) ]
x R - x R ; T
Again, this relationship summarizes a rather complicated function. In the same man-
ner as the Transor, the output of the Hamming Distance Restoring Circuit tends to remain in
the Z (t-l) state unless the number of state changes on its inputs exceeds some threshold. In
the latter case, however, the direction of state changes is not considered and output state
change decisions are made without any cons ideration of the absolute states of the imputs. Thus, the
o
O
STATE
CHANGE
DETECTOR
STATE
CHANGE
DETECTOR
T
MEMORY 1_
OUTPUT
Figure 2. Block Diagram of the Hamming Distance Restoring Circuit
5-4
outputat timet, Z"t),¢is alwaysdependentuponZ(t-l) andtheHammingdistancebetweenthe
twoinputvectors (Xl, x2.... XR)(t)and(Xl, x2.... XR)(t-1). Thisrelationshipis completely
specifiedbythefollowingrule statements*:
R
T> L I x'(t)l - xl(t-1) [ ----. z(t) = z(t-1)
i=l
R
T_. _ [ :_.(t)l - x(t-1)[1 _ Z(t) = z(t-1)
i=1
C. COMPARISON OF TRANSOR AND THE HAMMING DISTANCE RESTORING CIRCUIT
The otgstanding characteristic of the Hamming Distance Restoring Circuit which dif-
ferentiates it from the Transor is that it ignores informatio_ about the absolute state of its
inputs. This characteristic can be used to advantage because the input from a signal pro-
cessor producing both erroneous "ones" and "zeros" cannot cancel the influence of a working
processor input as it can in the Transor case. This may be illustrated by considering the
following input pattern for two bit times. Suppose that input 3 is failed to a steady state
"zero", that inputs I and 2 represent the correct information, and that inputs 4 and 5 are
producing both extra "ones" and "zeros" at these bit times.
INPUTS x.(t- 1) x.(t)
1 1
1 (correct) 0 1
2 (correct) 0 1
3 (failed) 0 0
4 (incorrect) 1 0
5 (incorrect) 1 0
R
(t) (t-l)* The function Ix 1 - x. [
1
i=l
x (t)is a measure of the difference between vectors ."
and x (t-l) which applies frequently in formation theory. The conception of this measure
is credited to R. W. Hamming of Bell Telephone Laboratories.
5-5
OUTPUTS z(t- 1) z(t)
Threshold(majority, 0 0
T=3)
Transor 0 0
Hamming Distance 0 1
Restorer
Actually, the states indicated by inputs 4 and 6 need not necessarily occur as a result
of component failures. For example, if no provision is made for synchronization, corre-
sponding elements of a redundant binary counter may become permanently out of phase as the
result of either noise, or the initially random states due to application of power. For this
example, the net change in the number of "ones" is zero. but the total number of state changes
is four. It cannot be said from this one example that the Hamming Distance Restorer can al-
ways withstand more input failures, but grounds for further consideration have certainly been
established.
It should be noted at this point that ignoring the absolute state of the inputs provides the
major advantage of the Hamming Distance Restorer but it also a disadvantage. Because the
output Z is not directly related to the absolute states of the input, the output state must be
set to the correct initial state before operation is begin or it has only a chance, perhaps 50_,
of being correct. If it is not initially correct, Z (t) will always be in the state opposite to the
correct one. Transor, on the other hand, will converge to the correct value after a small
number of bit times because of its dependence on the direction of state changes.
The remaining sections of this report will describe the efforts which have been made
to evaluate both Transor and the Hamming Distance Restoring Circuits. These evaluations
are referenced to the commonly used threshold voter. The results of the evaluations are
discussed in Section V. The conditions under which one of the dynamic restoring circuits
might be more powerful than the threshold voter are established.
5-6
III. REVIEWOFTHEANALYTICALEFFORTS
A. SIGNALPHOCESSORASSUMPTIONS
Toclarify thedescriptionof theanalysisof thevariousrestoringcircuits, it seems
advisableto summarizetheassumptionswhichhavebeenmadeconcerningthesignalpro-
cessorswhichprovideinputsto therestoringcircuits. Eachprocessoris assumedto be
composedof aset of components,all of whichmustworkproperlyin orderfor theproces-
sor outputo becorrect. It is assumedthatthei-th componentofthesethasaprobability
of failureduringthedifferentialinterval At whichis proportionalto theinterval length.
Thisprobabilitycanbeexpressedas k i A t" This impliesthatthereliability (theproba-
bility thatthei-th componentdoesnotfail duringatimeinterval, t) givenbytheexpression
R(t)= e X.t1
(3)
Because correct operation of all components is required for correct processor opera-
tion and assuming independence of failures between signal processors, the reliability of a
processor composed of N components is equal to the product of the component reliabilities.
Therefore:
)N N -X.t - _- ki t
Rs = T[ R = _-[ei 1 = e i=l
i=l i=l (4)
Similarly, if the set of components is partioned into M subsets and a reliability com-
puted for j-th subset, the processor reliability would be the product of subset reliabilities.
Mathematically, this is expressed as
M
R s R.]
j=l (5)
and
a.
l
n.
l
I[
j=l
n,
]
i=l
X
i)t
(6)
5-7
where n. is the number of components in j-th subset and k . is the failure rate of the i-th
] i
component of the .subset. The subsets which the components are partioned into correspond
to the class of processor output errors which failure of the component will cause. The
classification of errors is discussed in this section.
If all failure modes of acomponent caused only errors of one class, the assumption
could be made that each component was completely associated with one of the class subsets.
In general, this is not true. For example, if the output transistor of a binary signal proces-
sor is shorted (emitter to collector), the output would probably become permanently fixed at
the "zero" level. If, however, the transistor is open circuited, the output of the processor
would probably become permanently fixed at the "one" level. Because subsets are established
by classification of output error types: the above transistor cannot be uniquely associated with
any subset. To make an association, some artificial method must be used to assign to each
subset only that "portion" of a component which will cause that particular class of output er-
ror. Although the components cannot be physically divided in the required manner, they can
be analytically split by multiplying the total failure rate of the component by the conditional
probability of the occurrence of each possible failure mode. This procedure produces a num-
ber which can be considered the failure rate of a smaller component or subcomponent whose
failure results in only one of the possible classes of output errors.
It should be noted at this point that the failure probabilities of the smaller subcomponents
described above are not independent of the operational state of all other similar components.
as are the original circuit components. This may be illustrated by referring to the previous
example. If the transistor in the example were split into two subcomponents representing
the short and open failure modes, and one of the subcomponents had failed, the other compo-
nent could not also fail. The occurrence of a double failure of subcomponents associated with
a single physical component, however, is normally a relatively improbable event in compari-
son to the other system-failure producing events in associated circuits. For this reason.
this dependence effect has been ignored in all the models developed during this study.
B. CLASSIFICATION OF FAILURE EFFECTS
In the initial phase of this study, which is reported in Appendix 4. it was shown that
the ability of dynamic restorers to differentiate between inputs working correctly and those
failed to a steady state could generate failure modes different from those of threshold deci-
sion. There are, specifically, four modes which threaten the operation of dynamic restoring
circuits.
1) Wrong transitions cancelling correct transitions. (A sufficient number leave
a net number of correct signals insufficient to span the set threshold. )
5-8
2) Wrongtransitionsoccurringwhilethecorrect inputsremainthesamestate
(aseriesof extra"ones"or "'zeros"). Duringthis time thenominallycorrect
inputs have lost their voting power so that. if enough wrong transitions occur
at one time, they will span the threshold and result in a wrong decision.
3) Wrong transitions temporarily simulating steady state failures. Wrong tran-
sitions can combine on adjacent bit times in a manner to produce a steady
state effect.
4) Steady-state failures. Enough steady-state failures would leave insufficient
correct signals to span the threshold.
To illustrate, consider figure 3 where state vectors are used to represent the five in-
puts to Transor. Inputs x 1 and x 2 are assumed to have failed and capable of error. For
definiteness all inputs at time (t) may be assumed correct. In the following bit times (pro-
ceeding to the right) several failure patterns are possible for each nominally correct input
state. The cancellation mode (1) is clearly shown in the sequence (2)-----_(5) where extra
"'zeros" have appeared at time (t+l). By virtue of the Transor decision rules, an error
will be made at (t+2) unless T = 1 since the net result of the summation over (t+l) and (t+2)
O
is minus one. Of course, it is also possible for errors to cancel each other as in sequences
(3)-----_ (4) and (3)---_(7).
(M}
T T,+ I T+
i
;
', g
ooo,1
'1
(5)
(4)
(6}
(7)
Figure 3. Possible Five-Input Sequences for Two Failures
"5-9
Thesecondfailure mode(2)is shownin sequences(1)---*(2)and(1)----_(3)andthethird
mode(3)by sequence(3)---_(6). Theresultis thesamein thethird modewhethertheerrors
arecausedbywrongtransitionsor steady-stateerrors.
Anyoutputof abinarysignalprocessorcan be classified into one of six mutually ex-
clusive classes over an arbitrary time interval of six mutually exclusive classes over an
arbitrary time interval. These are:
1) Correct
2) Continuous Zero-State
3) Continuous One-State
4) Extra "ones" but no "zeros"
5) Extra "zeros" but no "ones"
6) Both extra "ones" and "zeros"
This classification is necessary because the failure modes caused by wrong transitions
have no parallel m threshold voter. A realistic comparison cannot be made on the basis of
each output simply failing or working. For example, the sixth output mode listed above re-
sults m the cancellation effect (1) mentioned earlier. Likewise. output modes (4), (5). and
(6) result in the second and third failure modes listed m part A.
C. CLASS PROBABILITY MEASURE
Each of the six mutually exclusive classes must be assigned a separate probability
measure. Let these be:
1) p: the probability that the output is correct
2) q¥ • the probability that the output is a continuous "zero"
o
3) qYl " the probability that the output is a continuous "one"
4) qa 0 " the probability that the output generates extra "zeros"
5) q a 1: the probability that the output generates extra "ones"
6) qr, 10: the probability that the output generates both extra "ones" and "zeros"
randomly.
The q's alx)ve are related to the reliability of the reliability of the compolmnt subsets
through the simple relationship
r. : 1 -q_
J J
and
r =
where j = )'o " ¥1 " a 0" a 1 ' a 10 (7)
r y • r y • r a • I'R • ra
o 1 1 0 10 (8)
5-10
Thus, the q's refer to the probabilities that one or more failures will occur within a particu-
lar set of subcomponents and cause the related output error.
D. A]_ALYTICAL MODELS
In a multiple-line redundant system, it is assumed that each input to a restoring circuit
is derived independently, and each input, over an arbitrary time interval, can be defined by
one of six mutually exclusive operational classes. A physical system, defined in this man-
ner, suggests a multinomial distribution as its possible analytical model because the R re-
dundant lines can be considered analogous to R repeated trials of an event with more than
two possible outcomes.
1. The Multinomial Model for a Dynamic Restoring Circuit
Let the number of outputs failed to a particular mode be represented by a random
variable. Specifically, let
Y = the number of outputs failed to the steady state
a 0 = the number of outputs generating extra "zeros"
a 1 = number of outputs generating extra "ones"
a 10 = number of outputs generating both extra "ones" and "zeros" randomly
Hence, the number of outputs that are continuously correct is
R- a I0 - al - °O - y
We see that the analytical model for a dynamic voter may be delineated by a subset of points
in a four dimensional sample space. These points correspond to possible operating states
of the system. Associated with each sample point is a probability defined by the density
function
( R ).
_ (alO, al, ao, F" ) = R-a lO-al-aO-y, o lO, a l,a O,),
N
* The symbol Xl,... xi,...
m
where _ Xi = N.
i=l
(R-a _a _a -Z)
.(p) 10 1 0 /q ac 1,,_alO
k W
.(qa 1) ul (q0c0) a0 (q¥)Y (9)
ml represents the mathematical function N!
m
H x.!
i=l
, 5-11
Where
P + qalO + qa I + qa 0 + q y = I (10)
Thus, the reliability of a dynamic restoring circuit will be
ALL _ C
R(t)=_) (alO a i, aO Y ) (11)
where 1"[ is the subset of sample points whose outcomes result in a continuously correct de-
cision by the circuit.
2. The Transor Model
For the Transor, membership in the subset I'[ may be determined by the intersec-
tion of the following set of linear inequalities derived from the Transor decision rules.
¢Ii0 + (Z I <-- T I -I
alO + OO <- TO -t
2Ol0+ (21+ 00+ )," __ T'
where _'=7" I + 7"0 and T' = R - T O or R - T1, whichever is smaller. Thus
RT(t) -alo-al -aO-y,alo,al_ao, 7. (R-alO-al-aO-Y) a I
(P) (qalO)(ZlO(qcll) (qa O) aO (qy)YALL (_) SATISFYING
THE DECISION RULES (12)
For example, if R=5, T 0=2 andT 1 = 3, then
202+ qal Y I qao
(13)
3. The Hamming Distance Restoring Circuit Model
P
The decision rules for the Hamming Distance Restoring Circuit described earlier
in the report determine the following set of linear inequalities:
aiD + o I _< T-I
(210 ._- ¢!O _< T-I
+ a I + o + 7" < R-TIO O -
5-12
Removalof thecancellationeffectaccountsfor theabsenceof thefactorof two (2) in the last
inequality thus making the Hamming circuit less sensitive to failures causing both extra "one"
and "zero" transitions. From these decision rules, the reliability of the circuit can be
written as
TI TI T!
RH(t ): _ _. _.
010=0 QI=O (Zo=O _'=0
• (qat) al (qa O) eO ¢qyjY
For R=5 and T=2
RH(t):pG+Gp4(I-p)+lOp3q_t+2Op3qQiq}r + 20p3qo 0 q X" 20p3qcZl qo0 +20p3qa
+lOpZqy3+3Op2qoloqy2+ 30p2q Op 2oi0 q 7,.2+3 qaoqy2+6Op2qolqaoq)r
R_T_otO_Ol _oo IR R '_ (R-Ol_-O -OO-V) _ tO
_.. _alO_al_a0_y alO al,aO,7./(p) v , --(q(zlO
(14)
qy
(15)
4, The Threshold Restoring Circuit Model
In system reliability analysis using majority threshold voters, it is customary to
assume that the failure of a majority of inputs, regardless of their mode, will result in a
wrong decision. Although this common assumption was used in Special Technical Report No.
4, it is not strictly correct because a threshold voter may tolerate as many as R-1 failed
inputs and still function correctly. A more rigorous approach, using the results of section
HB, can be found by letting:
1) 010 be a random variable devoting the number of wrong "ones"
and "zeros"
2) _1 be a random variable denoting the number of wrong "ones" only
3) _I be a random variable denoting the number of wrong "zeros" only
0
Thus, we see that the parameters defined for the threshold voter are related to
the dynamic restorer by:
_/i = al + 7"i
_/0 = a 0 + 7'0
_10 = alO+ X
where X is a dummy variable which accounts for the case in which a signal processor has
experienced two failures causing opposite steady-state errors. Because it is impossible to
5-13
saywhichofthetwofailurewill controltheoutputfor a generalcase,theworstcasecondi-
tionis assumedandin themodelsbothareassumedto existsimultaneously.Byvirtueof
thresholddecisionrulesthesubset1]maybedefinedby
010 + _'1 <_ T-I
810 ÷ xI/o --- R-T
The reliability of threshold voter is, then
T/' T_._IO R_IO R '_ (R-_IO-'_/I -'_'tO ) _10 _/I _/0
...,,,=z ,.O,o,,,, _T_I) (q_TtO)
eo=O',I/,=o _o=O
(16)
where T" = T-1 or R-T whichever is smaller.
For example, if R=5 and T-3 we have
2
RTh(I) = p5 + 5pH(I-p) +lOp3 (I- p) 2 + 30p2(q_i/i )(q_/o) + 30 ( P ) ?'( q_I/i )2 (q _L/O)
-'I'60p 2(qelo)(q_I/i) (q'_I/o) +30p (q_i/i) P-(q_/o)2
(17)
E. THRESHOLD PARAMETERS AS A BOUND ON DYNAMIC PARAMETERS
It was shown that the terms in the analytical models corresponded to probability mea-
sures associated with specific members of the subset 1"[ within the sample space. Criteria
for membership inl] was determined by the intersection of a set of linear inequalities de-
termined from a decision rule.
It will now be shown that a dynamic restoring circuit can now be as effective as the
threshold voter when the optimum threshold T for the threshold voter is (R + 1)/2 and the
optimum threshold for a dynamic voter is >_ (R + 1)/2. It has been shown that when
qxIdi _ q?0 (defined earlier) within a certain range, the optimum threshold for a threshold
voter is (R + 1)/2. The decision for the threshold voter now becomes, using the relations
previously described in the threshold restoring circuit model:
R--I
elo + el ÷ )'1 -_ 2
(18)
_10 4- (:10 +7"0 <- R--I
2
(19)
5-14
Assume also that the ratio of q/(qal 0 + qal + qu0) is such that the optimum dyn-
amic restoring circuit threshold is also (R + 1)/2; hence, the decision rules for the dynamic
circuit becomes
R-[
<
alO 4- ct I - 2
R-I
alO + aO <- 2
R-I
%+", +°o + Y, *to-< 2
(20)
(21)
(22)
when 7" = Yl + Y0 and
and (19) form the set I] Th
of simply showing that H H
will form the non-empty sub-sets of the form:
R-!
a 10 C 910. Let all the terms generated by inequalities (18)
and those by (20), (21), and (22) the seth K The proof consists
C liT h . Clearly each random variable consisted one at a time
)R-i i
I:1
where k=910' alO' at ' aO'F" t' YO" HH C Thbyvirtueofthefactthatal0C
910. The proof becomes even more obvious when we consider the non-empty subsets
formed by combinations of random variables taking two at a time. Choosing one variable
from inequality (18) and one from (19) will generate non-empty subsets of the form
R-I R-I
-i-j,i , (p) R-I-j (qk)_(ql)j FOR (k#l)
(23)
i=l j=l
where k, 1=010, Ol0,al,a0,T1, TO.
empty subsets of the fornl
R-_-_,i,i (plR-i-j (qk)i (ql)i
;+j=2
Choosing two terms from (6) will form non-
(24)
5-15
Now II C ll
H Th
number in (24) is
because the number of terms generated by(23)is ( R-1 )
2
+ 4 + eo. +
R+I
2
R+I2 -_
M=3
M
2
and the
(25)
and for all R -> 5 R-3
2
M=I
M
(26)
Likewise, the same reasoning may be applied to combinations of random variables taken
three at a time. Thus, it has been shown that if the dynamic restorer is to show superior
performance it can only do so when its optimum threshold is reached at values less than
R+I
2
F. A COMPARISON OF TRANSOR AND THE HAMMING DISTANCE RESTORING CIRCUIT
In previous discussions, it has been noted that the Transor is controlled by two thres-
holds as opposed to the single threshold of the Hamming Distance Restoring Circuit. It
might be argued that the utility of two thresholds, not necessarily set at the same level,
would present an added advantage in a high asymmetrical environment, i. e., one in which
either "one" or "zero" errors are more likely. That this is not the case will be shown in
the following discussion.
In an earlier Westinghouse report I it was shown that in an asymmetrical environment,
a great increase in threshold voter performance could be had by using thresholds less than
or greater than (R + 1)/2 according to a criterion developed in that report. Since dynamic
restoring circuits cannot distinguish between outputs failed to a continuous "one" and those failed
to a continuous "zero", they cannot take advantage of the asymmetry in steady state errors.
This leaves for consideration only asymmetrical transitional errors.
The results of the previous section have shown that for a dynamic restoring circuit to
show improvement over a threshold voter, the optimum dynamic restoring circuit threshold
must be reached at a value less than (R + 1)/2.
i. P. A. Jensen, "Decision Making in Redundant Systems", Report No. EE-2599,
December 1961.
5-16
If it is assumed that the optimum value of threshold for the Hamming Distance Restoring
Circuit is reached at a value T where T is less than (R + 1)/2 and R=5, the following
opt opt
possibilities exist for the Transor thresholds.
I) T O = T 1 = T opt
2) T O # T 1 = Top t
3) T 1 # T O = Top t
The first case is trivial. If all thresholds are equal, then the 1"I formed from the
T
Transor criteria is clearly a subset of H H ' i. e., 11. T c H H by virtue of the factor
2 a 10 in the Transor inequality.
In case (2) T O can either be greater or less than T 1. If T O < T 1 then (R - T1)< (R - TO)
and is the controlling factor. But since (R- T1) =(R- Top t) IITCHH. If T 0•T 1 =
Top t for example, T O = T 1 + 1 = Top t + 1 then (R - TO) is the controlling factor. But (R = T 0)
= (R - Top t - 1) so that, effectively, while the number of terms containing transient proba-
bilities has been increased, the number of terms containing steady-state probabilities has
been decreased by the same number and since q), >> qa0 the reliability of the Transor
will never be as good as that of the Hamming Distance restoring circuit. The same reason-
ing may be applied to case (3).
5-17
IV. SIMULATION PROGRAM
The success of the computer simulation program in evaluating self-repairing systems
encouraged the use of a similar program for use as an analytical tool in this phase of the
failure free systems study. Such a computer program has been written and has provided a
variety of interesting results. Insights into the Transor circuit's most vulnerable areas
were gained through this program. One of the results was the development of the Hamming
Distance Restoring Circuit. The development of the system failure criteria statements for
the program contributed to the development of the general decision rules which have been de-
fined for Transor, Hamming Distance, and Threshold restorers. The program was used to
find the ratio of steady-state to transient error probabilities for which the dynamic restoring
circuits were at least as effective as the Threshold voter in deriving correct system outputs.
Finally, the program provided a check for the analytical models when numerical examples
were considered.
Computer simulation programs are commonly used to analyze the performance of de-
terministic systems which are so large and complex that a mathematical model would be
unwieldy or of probabilistic systems which are difficult to model, or when specialized infor-
mation is desired. The Dynamic Restoring Circuit Evaluator (DRCE) program fell into this
last category.
The computer program which has been written for this study retains all of the basic
philosophy of the program previously developed for the evaluation of self-repairing systems. *
Some portions of the self-repair program were used directly in the DRCE program, but the
sections of this latter program which concerned system operational state (i. e., working or
failed) are much simpler than those of the self-repair program. These simplifications were
possible because of the reduced size and the non-adaptive nature of this simulation problem.
In this simulation program, the range of numbers between zero and unity is divided into
intervals, and each interval is assigned to one of the subeomponents of the system. In a
system containing (s) subcomponents, the range is divided into (s) intervals each assigned to
a different subcomponent. This procedure guarantees that all the numbers in the range are
assigned in a manner which uniquely associates every number with only one component and
similarly, all components are assigned intervals in the range. By judiciously specifying
the lengths of the intervals, random numbers from a population uniformly distributed between
zero and unity canbe used to simulate naturally occurring random subcomponent failures with-
in the system. To do this, the length of the component interval is made equal to the conditional
*This program is described in Appendix 6.
5-19
probability of failure of the subcomponent given that a failure exists somewhere within the
system. This probability is givenbythe expression k i
R _Xl
i=I
where M is the number of subcomponents in a single processor and R is the order of redun-
dancy (i. e., the number of signal processors in a state). A component failure is simulated
by determining a time to failure* and then locating the subcomponent to be designated failed
by associating a random number with a particular interval of numbers. Having done this,
the type signal processor output error is automatically specified, and the effect of this error
on system operation can be found.
As the first step, a system is set up with no initial failures. The above process is be-
gun and continued repetitivelyuntil the system under consideration no longer meets one or
more operational criteria. At this point, the total system operating time is computed as the
sum of the times between component failures. This entire procedure is now repeated many
times (usually 100), and data concerning number of failures withstood and system operating
times are recorded. From this data various curves are plotted, and system response to
various failure patterns is observed.
* The method used to determine the time between each succeeding failure is identical to that
used in the self-repairing systems simulation. That method is described on pages 10 and
11 of Appendix 6.
5-20
V. DISCUSSIONOFRESULTS
A. SIMULATION RESULTS
Before proceeding with a discussion of the results, a brief description of how compara-
tive reliability versus time curves were obtained is required. For each system simulation,
the computer print-out includes a number which indicates the total operating time of the sys-
tem before the occurrence of a critical failure pattern caused loss of system function. These
numbers are ordered and split into groups so that a histogram of percent of systems failed
versus time can be formed. A typical histogram is shown in figure 4. From this histogram
an approximate reliability vs. time curve can be easily constructed by starting a line at
unity (100%) on the ordinate and zero (0.0) on the abscissaor time axis and proceeding hori-
zontally to the right until the time corresponding to the first spike on the histogram is
reached. At this point the line is dropped vertically by the arithmetic magnitude of the spike,
then continued to the right again until the next spike is reached. Continued repetition of this
procedure produces a curve such as that shown in figure 5.
The question that immediately arises is "How many system simulations must be run in
order for a curve constructed in this manner to be smooth enough to provide a meaningful
approximation to the true system reliability curve ?" Because the question of "What is smooth
enough?" cannot be precisely stated without a series of opinionated assumptions, a simpler,
much less rigorous method of evaluation was used. The number of runs was arbitrarily set
at 100 and a curve was plotted for a particular Transor voted system. This was compared to
a series of points computed from the analytical reliability expression for the same system
subject to the same failure rates. The curve and points are shown in figure 6. The corre-
spondence of the curve and the set of points was close enough that the no increase in the
number of simulated runs was considered necessary. This relatively low number of runs
had the distinct advantage of requiring a computer running time of only about 30 seconds, in-
cluding compilation time, while producing acceptable results.
One more detail must be pointed out before the curves can be completely understood.
The primary interest in the study was the effect which changes the ratio of probabilities of
steady state to transient errors. For this reason, the total failure rate of the signal pro-
cessors was held constant for all simulations. This means that not only the general shape
of the reliability curves can be meaningfully compared, but also their locations relative to
the time axis. Holding the total failure rate constant in no way restricts the generality of
the results because a change in this rate would simply cause a linear shift of the curves along
the time axis.
5-21
20
PERCENT OF
SYSTEM
FAILURES
DURING
EACH
INTERVAL
I0
0
TIME INTERVALS
Figure 4. Typical Histogram
PERCENT OF
SYSTEMS
OPERATING
I00
90
80
70
60
50
40
30
20
I0
0
I._
L_L
L_.h_
"1-- !
TIME
Figure 5. Approximation to Reliability Curve
5-22
ONNN_
ooOO
OOOO
OOOO
ii ii ii u
JJ_J
it
=_N_
XXX_
J
_j.r
0 _ _ _. _ _ _ m oJ
- 6 c_ o o o c_ d <5
(1) 3_111 1V 9NI183dO S_31SXS JO NOIlOVSd
o
¢D
_D
a
04
o
aD
/-
0
0
==
O
"r
I1
O
.m
O
O
I.-
W
=E
I-
_1
I,o
Q
O
_E
..I
u
I--
tLI
n-
O
-r
t-
3E
O
n."
14.
(Jr)
I.-
Z
6
0.
®
Figure 6. Transor Order 5 Redundancy
5-23
B. CURVES DISCUSSION
The first Transor simulations showed that in the region where Transor was competitive
to the threshold voter, the optimum T O and T 1 were both equal to two for an order five sys-
tem. The discovery that relationship held even under highly asymmetric failure probability
conditions stimulated the development of the Hamming Distance Restoring Circuit. It has
since been shown analytically (see Section III) that the Hamming Distance Circuit always dgmi-
nates the Transor for order five redundancy applications. This result correlates with the
simulation comparison for the same configuration, sub]ect to the same failure mode condi-
tions. An example of the simulation results is shown in figure 7.
In comparing the curves for the Hamming Distance Restoring Circuit and those for the
threshold voter, it has been found that the latter tends to produce a more reliable output for
steady-state to transient error probabiiity ratio below approximately seven to one (7:1) and
the Hamming Distance Restoring Circuit slightly more reliable above that ratio, This ratio
cannot be exactly determined because certain worst case assumptions have been made in
establishing system operational rules for both circuits. These assumptions are slightly more
detrimental to one than the other and may not be precisely realistic in either case. This is
demonstrated by the combination of points and curves shown in figure 8. In this figure, the
Hamming Distance curve appears to be slightly better than the threshold simulation curve in
the high reliability region of the curves and worse in the long life region. For this plot of
threshold curve, the assumption was made that the first steady-state error to occur in any
processor assumed permanent control of the output of the processor and any future transient
or steady-state errors in that processor were ignored. The points in that same figure were
plotted from a theoretical analysis in which it was assumed that the most detrimental steady-
state error which had occurred always controlled the outputs. This worst case assumption
does not affect the Hamming Distance curve but it heavily influences the threshold curve.
Under this assumption, the Hamming Distance Restoring Circuit clearly dominates over a
large section of the curve.
It is interesting to observe the changes which occur in the reliability curves of the re-
storing circuits as the ratio of steady-state to transient error probabilities is increased.
The fact that as this ratio is increased the Hamming Distance curve and the threshold curve
get closer together until they cross, indicates that one or both of the curves are shifting in
response to the change. The first possibility seems to be the case. The points on the thres-
hold curve tend to remain fixed. (NOTE: a slight shift to the right may be observed. This
is caused by a reduction in the Pal0 as the ratio increases). The Hamming Distance curve
5-24
nr
tO
b--
0
_ tO
Z
T -
I
I
I
I
I
.j
Jr
I
I
I
I
0
_D
J
0 o 6 6 0 o o 6 o
(I) 3YVIJ. J.V 9NI.LV_300 S_3J.SXS .,40 NOII3V_I_, J
0
oJ
rr
oo 0
-- "t-
LL
0
0
0
__ J.--
tO
OJ --
-- I--"
CO
Figure 7. Comparison of Transor and Hamming Distance
5-25
_J
bJ
0
=E
/
0
I-
E:
0
uJ
"I-
0
u_
u_
t--
Z
@
Figure 8. Comparison of Threshold and Hamming Distance
5-26
is sensitive to changes in the ratio and shifts rapidly enough to the right to overtake the thres-
hold curve. At approximately the ratio when this occurs, the Hamming Distance curve rapidly
becomes less sensitive to changes in the ratio. The ratio continues to be increased, the
curve stabilizes and finally begins to slowly fall back to the left, thus indicating that an opti-
mum ratio exists in the region near (7: 1). This phenomenon appears to be caused by the discrete
nature of the threshold which controls the Hamming Distance decision rules. As the seven to
one (7:1) ratio is greatly exceeded, the threshold of the Hamming Distance should be reduced
to (1) if additional improvement in the reliability curve is to be expected. This threshold
reduction, however, would make the circuit vulnerable to single transient errors. Despite
the probable improvement in the overall reliability curve, this sensitivity to single failures is
generally considered undesirable. For this reason, no effort was made to simulate systems
with this threshold.
In figure 9, a comparison is made between an order five Hamming Distance curve and
an order seven threshold curve at a ratio of seven to one (7 : 1). It can be observed that in
the high reliability region, the curves are almost indistinguishable. This implies that unde_
these ratio conditions, an order five Hamming Distance restorer system might be as useful
as an order seven threshold voter system. This would allow an obvious saving in redundant
equipment.
5-27
Figure 9. Comparison of Order 7 Threshold and Order 5 Hamming Distance
5-28
VI. CONCLUSIONS
From the results obtained by manipulating the analytical reliability expressions for
the Transor and Hamming Distance Restoring Circuits, it may be concluded that the output of
Hamming Distance Circuit is more reliable than that of the Transor in order five redundant
systems. This conclusion holds for any ratio of steady-state to transient error probability
or any asymmetry (tendency toward "ones" or "zeros") of error probabilities.
From comparison of the simulation curves, it may be concluded that the threshold cir-
cuit is more reliable than either of the dynamic restoring circuits until the ratio of the pro-
bability of steady-state errors to the probability of transient error exceeds approximately
seven to one. Above this ratio, the dynamic restoring circuit outputs are more reliable.
Further comparison reveals that the difference in the reliability curves tend to stabilize or
slightly decrease as the ratio becomes much larger than 7:1. The stabilizing effect is more
pronounced as the order of redundancy is increased from five to seven.
Finally, it may be concluded that in the short life, high reliability region with approxi-
mately a seven to one probability ratio, an order five system using Hamming Distance Re-
storers may be as reliable as an order seven system using threshold voters.
5-29
%
%
Appendix 6
SELF REPAIR TECHNIQUES
by
M. R. Cosgrove
C. G. Musters
September 1963
ABSTRACT // .3 _// ?
This report describes the initial step in the design of an optimal self-repairing
system. The report contains a description of the several classes of "repair" strategies
under consideration and the computer simulation program which is used to determine the
performance of the systems for each strategy.
The computer simulation program determines the performance of a particular strategy
by injecting random failures throughout the system and simulating system reaction according
to the "repair" pattern of the strategy in question. The program prints out system performance
in terms of:
I. total time to failure
2. average time to failure
3. number of failures to system failure
4. number of switches affected.
The results for the two classes of strategies for which curves were drawn show
that with the addition of a minimal amount of self-repair capability, the reliability of the
system can be substantially increased over that of a comparable system using fixed
redundancy alone for failure protection.
fl-
6-ii
TABLE OF CONTENTS
ABSTRACT .............................
I. INTRODUCTION ..........................
IL STRATEGY DESCRIPTION .....................
A. Basic Assumptions .......................
B. Basic Strategy Classes Considered to Date ............
I_. THE COMPUTER SIMULATION PROGRAM . . " ............
A. The Reason a Simulation Program was Used ...........
B. How the Program Works ....................
C. Sample Format ........................
D. Production Format ......................
IV. RESULTS .............................
A. Failures Withstood (as percent of system) vs. Spare Mobility . .
B. Reliability vs. Time Curves ..................
V. SUMMARY AND CONCLUSIONS ....................
V'L FUTURE STUDIES .........................
VII. APPENDIX ............................
Page
ii
1
5
5
5
9
9
9
12
13
15
15
17
25
27
29
6-iii
Figure
9
10
LIST OF ILLUSTRATIONS
Multiple-line Redundant System ..................
Multiple-line Redundant System with Self-Repair Capability .....
Probability Distribution of a Component Failure ..........
Simulation Matrix .......................
Average Number of Failures Withstood (As Percent of Gamma 1
Systems) Versus Number of Moves per Spare .........
Average Number of Failures withstood (as Percent of Beta Systems)
Versus Number of Spares per Block .............
Minimum Number of Failures (as Percent of Gamma 1 Systems)
Versus Number of Moves per Spare .............
Minimum number of Failures (as Percent of Beta Systems)
Versus Number of Spares per Block .............
Percent of Systems Operating (Beta Class) Versus Time ......
Percent of Systems Operating (Gamma Class 1) Versus Time ....
Page
2
2
10
12
16
18
19
20
22
23
6- iV
I - INTRODUCTION
In an effort to increase the reliability of complex electronic systems, several methods
have been proposed for using "redundant" equipment to provide failure protection within these
systems. Two of the most useful types of redundancy techniques are multiple-line, majority
voted logic and multiple component grouping schemes. Although both techniques are very
effective, a large percentage of the "redundant" equipment is not efficiently used, i.e., the
system fails with much of the "redundant" equipment still functioning. This undesirable
feature is inherent in systems of this type because random failures do not tend to distribute
evenly throughout the system. Instead, they almost invariably tend to group and cause a
critical failure pattern to occur in one subsystem area before many failures have occurred
in the remainder of the system. The most drastic example of this is the failure of an order
three, multiple-line, majority voted system upon the occurrence of two successive failures
in the same stage with no other failures in the remaining stages.
Company A has devised a new solution to the failure protection problem which exploits
most of the desirable features of the multiple-line, majority-voted schemes, but is not as
sensitive to critical failure patterns as the more standard techniques. This solution is in the
form of a set of strategies for allowing the reorganization of the systems in response to
failure patterns which may develop. The systems which employ these strategies are called
self-repairing systems.
The general approach of the self-repair strategies can be described through the use
of an example. Figure 1 shows a block diagram of an order three, multiple-line system.
Figure 2 shows the same system after some self-repair capability has been added. It is
assumed that all blocks in the system are functionally identical such as the multivibrators
in a shift register, and are interconnected by switching and voting circuits. If two blocks in
the same column fail and the blocks on either side of this column are still operating, the
self-repair switching mechanism senses this condition and shifts the required additional
working blocks to the failed column. The failed block can now be eliminated or "voted out. "
This procedure decreases the remaining protection provided the adjacent columns, but it
prevents system failure at a critical point and thus extends the life of the system. As
additional blocks fail, other blocks are switched into the failed columns. The choice of
which block shall be brought in to aid the vulnerable column is determined by the particular
strategy in use.
6-1 •
MAJORITY
VOTERS
BLOCKS
Figure I. Multiple-line Redundant System
/
I-I- F----
LTJ rLT j rL1 -] r
FAILURE DETECTING
AND SWITCHING_ CIRCUITS-_
Figure 2. Multiple-line Redundant System with Self-repair Capability
The unique feature of these strategies is that the switching circuitry can be completely
distributed rather than "lumped" into a central controller. As a result, most failures in
the switching circuitry are equivalent to signal processor (block) failures and are elimi-
nated in the normal manner. This means that individual failures in the switching circuitry
do not cause the loss of the entire self-repair capability.
Before a "hardware" design of self-repairing systems can begin, the full range of
feasible switching strategies must be examined, and from these an optimum strategy or set
of near optimum strategies must be selected. The majority of this report is concerned with
6-2
a description of some of the more promising strategies and with the computer program
which is being used to simulate the failure response of systems which employ these
strategies.
There are a great number of possible strategies which may be investigated, many of
which are quite similar to one another. The strategies being considered are arranged in
groups called classes, the individual members of which are special cases of the general class.
This allows the investigation and programming of a few classes of strategies rather than
many individual strategies. This facilitates comparison of strategies within a class as well
as adding a certain degree of generality to the analysis.
Before proceeding to the description of specific strategies or classes of strategies,
the properties a self-repairing system should have must be noted and the basic assumptions
stated. A short list of the general desirable properties is compiled below.
a. Self-repairing systems should be more reliable than ordinary
redundant systems of identical function capability and cost.
b. The switching strategy used should make optimum use of the
redundant function blocks for a fixed amount of switching
complexity.
c. instantaneous failure masking must be pr6vided for system
applications which cannot withstand a temporary loss of data.
An example of this is the key-stream generator used in secure
communication channels.
d. The strategy must be suitable for implementation by a distributed
(non-centralized) switching network.
6-3
H - STRATEGY DESCRIPTION
A. BASIC ASSUMPTIONS
Almost all large computing and control systems are formed by interconnecting a
relatively small number of different types of basic circuit blocks. As a result, the com-
ponents of these systems can be split up into homogeneous groups of functionally similar or
identical blocks. It is assumed, therefore, that such groups can be formed and that self-
repair strategies can be applied within each group. Note: The members of any group are not
required to be physically or functionally adjacent but may be located in scattered sections of
the overall system.
It is also assumed that at least two blocks must be performing the same nominal function
before a failure can be detected, and at least two correctly operating blocks must be perform-
ing the same function before a third (failed) block can be eliminated from this function.
If at least three blocks are performing a function and one of them fails, the elimination
process is assumed to be instantaneous, and the failure is assumed to be completely masked.
If, however, only two blocks are performing the function and one fails, a third block must be
switched to that location to eliminate the failure. This process is not assumed to be in-
stantaneous and errors appear in the system temporarily. As a result, systems using the
basic order-three redundancy with self-repair (as will be described in the Beta and Gamma
Class strategies of this report) must be capable of withstanding temporary data loss without
mission failure. If this assumption is not true, a higher order of redundancy must be used
as in the Alpha class strategies or higher-order versions of the Beta and Gamma classes.
If, because of particular failure and response patterns,single blocks are left to per-
form particular functions it is assumed that the system continues to operate with one or
more stages existing in the non-redundant state either until one of these blocks fails or until
another critical failure pattern occurs elsewhere in the system.
Finally, it is assumed that a stage shown pictorally at one end of a system is, in
reality, adjacent to the opposite end and enjoys the same repair facilities as stages shown in
the center of the system.
B. BASIC STRATEGY CLASSES CONSIDERED TO DATE
The following few paragraphs will indicate the general principles of each of the three
strategy classes which have been simulated thus far. Detailed examples of each class are
shown in the Appendix_and the reader will probably need to refer to these for detailed con-
sideration of the following descriptions.
6-5
1. Alpha (a) Class
Systems employing the a class strategies are basically multiple-line redundant
(usually order three) systems which are equipped with sets of spares. These spares are
additional function blocks which can be automatically used to replace failed blocks. In
general, spares can not economically be given enough mobility to allow a single spare to be
capable of replacing each operational block in the entire system. Instead, individual spares
are usually given restricted capability and may replace only blocks in a single row* or
portion of a row. A large number of strategies, each belonging to the ( a ) class, can be
generated by varying (a) the total number of spares available for a fixed system size, (b)
the mobility of each spare (c) the pattern in which the spares' repair capabilities overlap.
If it is assumed that spares will immediately replace failed blocks regardless of
whether it is the first failure in a function column or not, complete failure masking is
achieved. The threshold vote technique will continue to absorb failures after the spares
complement is exhausted until a majority of unrepairable failures have occurred at a
particular function. At this point the system will fail since both the self repair capability
and the network redundancy have been exhausted.
2. Beta (B) Class
Beta Class strategies do not utilize inactive spare blocks as does Class a
With no failures, the system operates as an ordinary multiple-line redundant system. When
a critical failure i. e., one which would cause failure of a multiple-line redundant system,
occurs, the failed block is removed from the system and replaced by a properly functioning
block from an immediately adjacent function. The individual strategies in this class differ
from one another primarily in the number of spares which they can draw from the rest of
the system.
Because failures are replaced by function blocks only from the adjacent functions
there is a smaller amount of switching circuitry involved with Class B than with other classes
of self-repair strategies. This advantage is partially offset, however, by the one drawback
inherent in this class of strategies. That is these systems are more vulnerable to fail-
ures which are grouped in one area of the system than are the more flexible strategies.
The three strategies of this type which have been simulated are described in the
Appendix. These particular strategies do not usually allow blocks to move a second time
after an initial repair has been made. This restriction has been made for a variety of
reasons, but other strategies are being considered which will release this restriction. In
addition, strategies having increased spare mobility will be considered in future studies.
* For example the top line or row of signal processor in Figure 1.
6-6
3. Gamma (_,) Class
The Gamma (), ) Class of self-repair strategies contains much more variety
than either Class Q or Class B • The class is characterized by a shifting of the spare
blocks in one direction to alleviate the critical condition caused by the failed function
blocks. Unlike the strategies of Class B , it is possible lot a spare to move several times
in response to failures. When a critical failure occurs, one of the function blocks adjacent
to the failure will replace it, leaving a void. This void, if it creates a vulnerable situation
i. e., one block per function stage, will be filled by the function block immediately adjacent
to it in the opposite direction from the original failure. The next failure to occur in the same
stage as the original failure causes another shift of the function block now adjacent to the
failure. This may be a function which has already shifted in response to a failure. As long
as spares are available, they will continue to shift laterally to replace failed blocks or to
fill voids.
Since the spare function blocks are allowed much more mobility in this class of
strategies, more failures can be corrected. However, the amount of switching circuitry
necessary to implement the strategies is a monotonically non-decreasing function of the
mobility of the spares. This creates problems of implementation which limit the usefulness
of high spares mobility.
The individual members of Class y strategies differ primarily in amount of
mobility allowed to the function blocks. This, in turn, affects the failure absorption capa-
bilities of the strategies. Again, the individual strategies are described in more detail in
the Appendix.
6-7
111 THE COMPUTER SIMULATION PROGRAM
A. THE REASON A SIMULATION PROGRAM WAS USED
Although the reorganization features of self-repairing systems improve the failure
absorption capability of redundant networks, these features drastically affect the analytical
reliability expressions developed for multiple-line, majority-voted systems. Not only does
a slight amount of reorganization capability greatly complicate the expressions, but each
modification of each strategy class appears to require a different solution. Extensive efforts
to model some of the simpler seE-repairing systems have been unsuccessful. Because of
this, efforts to write exact reliability expressions have been dropped, and a general computer
simulation program has been written to facilitate a Monte Carlo approach to the reliability
analysis. This program can be used to simulate a broad range of strategies, and it provides
data about the actual switching patterns which tend to occur in a system. This latter infor-
matlon could not be easily determined from reliability expressions even if they were avail-
able. A plot of reliability versus time can be obtained directly from the program results
with no more additional input information than would be required by calculations made using
analytical expressions.
B. HOW THE PROGRAM WORKS
1. The General Program Philosophy
A redundant system of the desired order of redundancy and number of functions
is set up in matrix form. The strategy class is then selected from a group of sub-programs
and input data which specifies the particular strategy to be tested is read in. Through the
use of a series of random numbers, individual blocks are designated as failed, and the
switching strategy responds to each failure until the system fails to pass the operational
criteria. A second series of exponentially distributed random numbers determines the time
between each simulated failure, and the sum of these is the time to system failure. Once
the system fails, the pertinent data is recorded, and the computer resets and begins to
generate two new sets of random numbers. Continued repetition of this process provides
the compilation of data mentioned in part A of this section. The following paragraphs indi-
cate specifically how the various portions of the program work and the form of the print
set up.
)k is a constant failurerate.
The Failure Selection Program
A simple procedure for randomly selecting the failed function blocks has been
Each block is assumed to have an exponentially decaying reliability = e -)'t where
It has been shown that the conditional probability that a failure
6-9
has occurred in the i th block given that a failure has occurred in the system is equal to the
Xi
constant, N
Y. X i
i=1
If the interval between zero and one is split into N subintervals, each proportional
to the associated conditional probability, a set of random numbers uniformly distributed
between zero and one can be used to determine which blocks failwith correct conditional prob-
ability of picking any one box. In this particular computer program, the random number
specifies the block to be failed. The system then responds to eliminate the failed block. If
the response is possible, i.e., a spare block is available to make the repair, a new random
number is chosen and the procedure repeats. If no spare is available, the system is judged
as failed.
3. Time Determination
For each of the simulated failed blocks selected above, a time to failure for the
block is also determined. A.M. Mood 1 has shown that random numbers taken from the
uniform distribution can be transformed into any desired continuous distribution by letting
f(y) = 1 0 < Y < 1
y : G(x)
Where G(x) is the cumulative distribution of x.
This relationship is shown graphically in figure 3.
IMood,
I.O
Yi
Y
(UNIFORMLY
DISTRIBUTED
RANDOM
NUMBERS)
O I X!
X [RANDOM NUMBERS
DISTRIBUTED AS G(X)_
Figure 3. Probability Distribution of a Component Failure
A.M. - Introduction to the Theory of Statistics McGraw Hill Book Co., Inc. 1950
6-10
Y is a single valued function of x and vice versa. For each Y chosen from a uni-
form distribution, a unique value of x is determined.
The G (x) function which is of particular interest here is G(t) = 1 - R(t) = 1 - e -_t
This is the distribution function associated with the probability that the first failure has
occurred within a system. This curve is shown in figure 3.
For the first function block failure, a random number is chosen from a uniform
population and transformed to a corresponding number from the exponential distribution.
This latter number is the time from system start to the first failure. To calculate the time
to the second failure, the k associated with the first failed block should be subtracted from
the E k's and the procedure repeated. The new number thus obtained would be the time from
the occurrence of the first failure to the occurrence of the second failure. When the system
fails, the sum of these individual failure times will determine the total system operating
time.
In the present program, the above procedure is slightly modified to make com-
putations easier. Instead of decreasing the Y X's after each failure, this sum is left the
same and blocks are allowed to fail more than once. When a block fails for the second time
no action is taken other than to add the time to this failure to the system operating time.
This modified procedure would not be acceptable if the times between subsystem failures
were of interest, but since total system operating time is the only factor to be considered,
the results are almost identical to these which would be obtained in the more straight-
forward approach.
4. The System Reactions
It is obvious that many specific reactions are different for different strategies,
but the general manner in which the program performs the various shifts and the type
'_ookkeeping" involved can be briefly described. Figure 4 schematically illustrates the
form in which computer "views" the system to be simulated. The height of the '_asic array"
is set by the original order of redundancy, the width by the number of stages, and the depth
by the number of data words associated with each block. The "failed block array" is a two-
dimensional array into which the data words for failed blocks are shifted as the failures
occur. The only indication to the computer that a block has failed is the shifting of these
data words into this latter array.
When a set of data words is moved into this array, the computer examines the
remainder of the system and makes any necessary response. This is done by shifting the
data words associated with the appropriate spare blocks from their original locations into
the locations specified by the particular switching strategy being considered.
6-II
K
( DATA
WORDS/BLOCK
REDUNDANCY') I
,.y
N
(NUMBER OF STAGES)
RESPONSE _
DING
INDICATES
FAILED EMPTY
BLOCK LOCATIONS
ARRAY
Figure 4. Simulation Matrix
C. SAMPLE FORMAT
A check must be made to determine whether the computer simulation program is
operating correctly, i. e°, selecting the correct function block for failure according to the
random number set, responding properly to failures according to the particular strategy,
and failing at the proper time and under the proper conditions, In order to accomplish this,
a sample format has been developed. This sample format prints out the following informa-
tion:
1. * The function block designations and the random number range
which describes failure of the block.
2. * A list of failures which occur with all the information associated
with the failure such as:
a. The random number which was selected
b. The location of the failed block
c. The amount of time from the previous failure to the time
of failure of the block in question
d. The cumulative time from the beginning of system operation.
3. The average time between failures.
This information is printed out for each failure until the system fails.
6-12
When a critical failure of a function block occurs, an operating spare is switched into
the vacant position by assigning random number limits of the spare block to the failure
location. This permits checking of the switching pattern to determine if the simulation
program is working, since an incorrect switching operation will place the random number
limit designation in the wrong position. This event can be detected when the incorrectly
switched function block fails and the position specified by the random number does not
correspond to that printed out in the sample format.
To check a strategy, several runs are made using different random number sequences.
The sample format prints out all the above information for each case. From this information
a determination can be made as to whether the simulation is following the rules for the parti-
cular strategy.
In addition to performing the function of checking the simulation program, the sample
format provides another valuable service. By observing the vicissitudes of the system with
respect to the switching patterns which develop, information can be gained about changes in
the strategy which might profitably be used to implement more efficient system operation or
more economical switching circuitry implementation. This is the manner in which Class _.2
was derived from class _,1"
D. PRODUCTION FORMAT
A typical production run of the computer program simulates system operation for one
hundred randomly selected failure patterns. Up to the present time, all runs have included
one hundred patterns simply because relatively good estimates of the average system para-
meters such as total time to fail, number of failures withstood, etc. are obtained without
requiring excessive amounts of computer time.
The production format directly provides the following information for each of the one
hundred cases:
1. Average time between function block failures
2. Total time to system failure
3. Total number of function block failures before each system failure
(including multiple failures of the same block)
4. Net number of failed function blocks at time of system failure
5. Total number of switching moves experienced by each system
6. Total number of moves made by each spare function block.
In addition to printing out columns of numbers covering the first five items on the
list above, most of the data is compiled into bar graphs. Each of these graphs reflects the
6-13
performance of the set of one hundred runs with respect to a particular parameter. On the
graphs, either discrete points (e. g. net number of failures) or interval terminal points
(for continuous parameters such as time) are plotted on the abscissa. The height of the bar
above each point or interval shows tr_e number of spares or system simulations which are
described by these positions on the abscissa. The program includes a normalization routine
for each graph which is used to compute the average, the variance and the standard deviation
associated with each graph.
6-14
IV. RESULTS
The strategies discussed here (and any new ones which may be invented) must be com-
pared and contrasted to determine their usefulness in increasing the reliability of electronic
systems. The primary goal of this comparison is the determination of which strategy pro-
vides the greatest net increase in system reliability. Because it appears that the switching
circuitry associated with spare blocks increases as the mobility of these blocks increases
and because the failure protection effectiveness of added flexibility is non-linear, it cannot
be simply assumed that the best strategy is the one with the greatest spare block mobility.
The best way to compare these strategies would be to completely design functionally
identical systems using each strategy; get the best available estimates of the failure rates
of all the parts; feed this into the computer program and, in the manner described below,
plot the reliability versus time curves. The comparison would merely require that one
directly observe which strategy has the highest reliability curve. This approach would re-
quire a detailed system design for all strategies. To avoid wasting time on strategies which
can be shown to be inferior to others with much less detailed input data, several less exact
comparisons can be made. These comparisons, which are described below, are the ones
which are being made at this point in the study.
A. FAILURES WITHSTOOD (AS PERCENT OF SYSTEM) _¢s. SPARE MOBILITY
An important consideration in the comparison of systems is the number of failures
which can be withstood without system failure. In order to compare strategies with one
another where the variable is the number of moves allowed per spare, the number of
failures withstood is an important and meaningful criterion. To further compare systems
of different sizes on a common base the curves plotted for these systems are expressed in
terms of average percent of total system failed versus spare mobility. In figure 5 curves
are plotted for three systems of different sizes, 24, 48 and 96 stages employing strategy Z 1"
They are plots of average percent of failures versus number of moves per spare.
These curves provide very useful and interesting results. They are characterized by
a sharp rise, a knee and a rapid leveling off. The knee occurs at a small number of moves
per spare compared to complete (total system) spare mobility. According to this graph, a
great increase in number of failures withstood by a system is effected by increasing spares'
mobility up to a point. The increase, then, is diminished and a point is reached beyond
which little or no increase in number of failures withstood accompanies an increase in
mobility. The characteristic exhibited by these curves illustrates that great increases can
be attained in system performance by the introduction of self-repair Class ), 1 with
6-15
k_Z
44
4O
35
:>4 FUNCTIONS
i
/
/ ff
//
f
r/,
/
/
PER
I
SYSTEM
f
40 FUNCTIONS
i
96 FUNCTIONS
PER SYSTEM--
PER SYSTEM
3O
BLOW UP OF CURVE "/F2
25
o 20 40 60
NUMBER OF SPARES PER FUNCTION BLOCK
(GAMMA I SYSTEM)
Figure 5. Average Number of Failures Withstood(as Percent of Gamma 1 Systems)
Versus Number of Moves Per Spare
6-16
relatively little mobility. The addition of more mobility adds little to the effectiveness of
the technique. This indicates that the most gain is attained with a small degree of mobility;
therefore, the most efficient operation of the technique can probably be accomplished with
relatively little switching circuitry.
Plots have also been made for the percent of system failed vs. number of spares per
function block for the B class strategies. These plots are illustrated in figure 6. The
curves in figure 6 are plots of the Average Number of Failures Sustained versus Number of
Spares per Function Block. The results show substantial gains over the multiple-line case
for each increase in spare mobility. These curves are restricted to low mobilities because
of the fact that the Beta class draws spares to replace failures only from the immediately
surrounding area.
Since an important consideration is the worst failure patterns, a plot is shown of the
lowest number of failures which were sustained to system failed vs. mobility for the Gamma
Class strategies. (See figure 7). These curves agree very closely with those of figure 5
thereby substantiating the conclusion even for the worst case.
Figure 8 shows the Minimum Percentage of Failures Sustained versus Number of
Spares per Function Block for the three different length _ Class systems. These curves,
like those for class Gamma, show a gain over multiple-line system for each advance in
mobility.
B. RELIABILITY VS. TIME CURVES
The reliability of a system as a function of time is the probability (P) that the system
will be operating correctly at that time, or, out of a given sample, s, Px s of these will be
operating correctly. From the production run printout of the computer program, it is
possible to plot the percentage of the systems which are operating versus total operating
time. This plot closely approximates the reliability curve associated with a particular
strategy. The plots made here represent one minus the cumulative sum of the bars of the
graph for number of systems failed versus time. For each interval of time in which failures
occur a step function is subtracted from the curve corresponding to the number of systems
which failed in that interval. This process produces a curve which is a series of discrete
steps, starting at 1 and going to 0 as time increases. Smoothing out this curve would result
in a curve which is identical in form to the standard s-shaped reliability versus time curve
which is common to redundant systems.
As it was mentioned in the introduction to this section, this type curve would be an
excellent comparative tool if accurate estimates of the switching circuit failure rates could be
made using completed system designs. Because the designs are not yet available, the use-
fulness of these curves is restricted to that of investigating which strategies are best under
6-1"/
0.4
_3
U.
k_
o_
0.5
0.2 /
/
/
/
I
/
/
I
24 FUNCTIONS
J
PER SYSTEM
J
48 FUNCTIONS PER SYSTEM
96 FUNCTIONS PER SYSTEM
0 I 2 3
NUMBER OF SPARES PER FUNCTION
Figure 6. Average Number of Failures Withstood (as Percent of Beta Systems)
Versus Number of Spares per Block
6-18
4O
5"-
/ F_'-
_rj .,f-/l
//ii/
.q
!,//
'°//i[/
I/
f
24 FUNCTIONS PER SYSTEM
I I I
48 FUNCTIONS PER SYSTEM
I I i
96 FUNCTIONS PER SYSTEM
0 5 I0 15 20 25 30
NUMBER OF MOVES PER SPARE BLOCK
Figure 7. Minimum Number of Failures (As Percent of Gamma 1 Systems)
Versus Number of Moves Per Spare
6-19
0.2
Lu
Cb
tn
0.1
F •
I
i
24 FUNCTIONS PER SYSTEM
48 FUNCTIONS PER SYSTEM
96 FUNCTIONS PER SYSTEM
0 I 2 3
NUMBER OF SPARES PER FUNCTION
Figure 8. Minimum Number of Failures (As Percent of Beta Systems)
Versus Number of Spares Per Block.
6-20
certain limiting failure rate conditions. Even under these conditions, the reliability versus
time curves are very useful because they provide a universal means of comparing all stra-
tegies in all classes.
Examples of these curves for the Beta and Gamma Class strategies are shown in
figures 9, and 10. The following comments indicate some of the significant features of
these curves.
1. Beta Class Reliability Curves
The reliability curves for the three members of the class are shown in figure 9.
The curve for an order-three, multiple-line redundant system is also shown. These curves
show a significant gain in reliability of all three strategies of the Beta Class over the re-
dundant case. The effective gain will not be as great in reality because perfect switching has
been assumed in plotting the curves.
With the limited amount of switching allowed to strategy B 1 an increase in
MTBSF of approximately 100% results. As more switching capability is allowed to the
system the reliability continues to increase, showing that strategy B 3 provides significant
increase, reliability-wise, over either B 1 or B 2 and very significant increase over the
multiple-line redundant case.
2. Gamma Class Reliability Curves
Figure 10 illustrates the reliability curves for four gamma class strategies.
Illustrated are the limiting cases 1 move per spare and 23 moves per spare*as well as a
multiple-line redundant system. Two strategies of intermediate mobility are also shown.
These curves, again, show that the introduction of a minimal amount of switching
capability, 1 move per spare, causes a significant gain in reliability and operating time over
the redundant system. It is obvious, also that the first few increases in mobility capability
of the spares induce further noticeable gains in reliability over the one move per spare case.
As additional mobility is granted to the system, the reliability gained begins to diminish.
This is illustrated by the fact that as much gain in reliability is attained by increasing
mobility from one to three moves per spare as is gained by going from three to twenty-three
moves per spare. This also reflects the flattening effect observed in the curves of percent
of Failures Sustained versus Mobility of the System, wherein the additional mobility after a
certain point bought no additional gain in reliability.
* 24 Function System
6-21
>-
0
Z
Q
Z
C_
I,I
_c
LU
N _ Z
(_ OQ m _
)" W
L_ O.
n..
J
/
,F
J
i
0
0
0
_D
/ljo 
/
J
Y
0
0
0
o,.I
0 _ (I) I_ ¢,_ _D _!" _ o_ m.
- 0 o d d d d d o o
DNli tlbl3dO Slfl.]I S._S ( ,,qSV7,.3 V138) ..,JO .LN.],,_t_]d
Figure 9. Percent of Systems Operating (Beta Class) Versus Time
6-22
0
0
0
.jr-
Q Ch oo d d o d o d
9NIJ VM3dO S_I.71S,,(S (SSV 7,_ V_RIVD) -,-fO # N..7,.3M..Td
Figure 10. Percent of Systems Operating (Gamma Class 1)
Versus Time
6-23
V. SUMMARY AND CONCLUSION
Before self-repairing systems can be implemented, many feasible switching strategies
must be considered in an effort to determine the most effective manner to manipulate the
redundant or "spare" blocks. The extreme complexity of the reliability expressions associated
with these strategies has resulted in the use of a computer simulation program for comparing
the effectiveness of the strategies. Rather than proceeding to write separate programs for
each strategy, a more general program has been written which employs a small number of
subroutines, each of which describes an entire class of strategies. Input data determines
which class subroutine is being used and which strategy in a particular class is being simu-
lated. Although this generalized program is a great improvement over the individual pro-
gram for each strategy approach, it still requires additional programming each time a new
class subroutine is added. At this time, the change to a more general program, whose simula-
tion strategy can be completely determined from input data, does not seem to merit the pro-
gramming time which would be required.
The present program includes subroutines for three classes of switching strategies.
Each class subroutine contains a great deal of flexibility, thereby including many individual
strategies. This method facilitates easy comparison between members of a class. This
comparison allows immediate elimination of many possible strategies as obviously uneconomi-
cal. For example, the flattening out of the Percent of System Fai|ed versus Spare Mobility
curves (figures 5 through 8) indicate that all possible strategies on the flat part of the curves
cannot be optimum strategies.
From the results of the simulation program, curves for Percent of Systems Failed
versus Spares Mobility have been plotted for the Gamma Class strategies. These curves
have been referenced to that of a multiple-line majority voted system because this particular
technique has been the most effective of the passive, failure masking, circuit level redundancy
techniques. In all cases these curves show not only that great gains can be realized over
multiple-line redundant scheme but that by far the greatest part of these gains are realized
for the first few moves allowed to the spare function blocks. Beyond the range of relatively
limited mobility, little or no gain in the average number of failures absorbed is realized by
the additional mobility allowed to the spares. This is an encouraging result since the great
majority of the gain due to self-repair can be retained without the use of an exorbitant amount
of switching circuitry.
In the B and Z classes of self-repair strategies the degree of failure masking is the
same as that for a multiple-line redundant system of the same order of redundancy. This
is due to the fact that no "repair" is made until an ambiguity is present on the output of a
6-25
stage. Thiseventcorrespondsto redundantsystemfailurewhichactivatestheswitching
mechanismandthe"repair" is effected.However,until thefailure is "repaired"no
failure maskingis present,andincorrectinformationmaybetransmittedto thenextstage.
TheQclassstrategiesprovideadditionalfailuremaskingbecauserepairscanbe
initiatedby thefirst occurrenceof a failure in anystage. However,becausethis classim-
pliesa higherorder of redundancyit cannotbecomparedto order-threemultiple-line
redundancyastheB and ), class have been.
The curves of figures 9 and 10 show a very definite gain in reliability for the self-repair
strategies over multiple-line redundant systems. The curves for the Beta Class strategies
show an increase in reliability for each increase in "repair" capability. Strategy B 3 yields
the highest reliability but even strategy B 1 shows a significant gain over the multiple-line
system. The reliability curves for the Gamma Class show essentially the same result with
respect to the multiple-line case. However, investigation of the curves show that increasing
the "repair" capability produces gains for the first few increases after which the magnitude
of the gain diminishes. These curves tend to bear out the conclusions drawn from Percent
System Failed versus Spares' Mobility curves which flattened out after a certain mobility
was reached. The gains illustrated here must be considered as ideal because the switching
circuitry for self-repair is here assumed to be perfectly reliable. More realistically, the
gains obtainable will be a function of the switching circuitry complexity and will not be as
great as shown here.
6-26
VI. FUTURESTUDIES
All of thecomputersimulationresultsdiscussedin this reporthavebeenbasedon
theassumptionthattheswitchingcircuitry wasperfectlyreliable. Effortsarenowbeing
madeto determinetherangeof allowablefailurerateswhichcanbeassociatedwitheach
strategyfor it to beof maximumeffectiveness.Theserangesare tobestudiedasa function
of thefailureratesof theassociatedsignalprocessorblocks. As a result, beforeactual
systemdesignsarebegun,informationspecifyingtheoptimumswitchingstrategycorrespond-
ingto a givensignalprocessorfailurerateshouldbeavailable.
Fromthesampleandproductionsimulationrunprintoutsit hasbecomeobviousthat
manyof thesparefunctionblocksdonotexperienceasmanyswitchingoperationsasthey
havethecapabilityfor. Whenall sparesareassigneda uniformmobilitysomereachtheir
limit and, indoingsosubstantiallyextendthelife of thesystem. However,in manycases
whensystemfailurehasoccurred,therearemanysparesremainingwhichhavenotbeen
usedto anygreatextent. In orderto capitalizeonthis phenomenona classof strategies ), 2
is beingdevelopedwhichwill assigndifferentmobilitiesto thesparesin a stage. Classy 2
will besimulatedbyanewsub-routinewhichis beingwrittenfor thecomputerprogram.
Whendatais availablecomparisonswill bemadebetweenthisandtheotherclasses.
Additionalclasseswill besimulatedina similar mannerastheyaredeveloped.
Noneof thestrategiesconsideredsofar havepermittedsparesto return to previous
locations. It is possiblethatremovalof this restrictionmightaddto thefailureabsorption
capabilityof a system. Thisareacertainlyshouldbeexploredin this studyseries.
Althoughlittle hasbeensaidaboutthephysicalswitchingtechniquesto beemployed,
it hasbeentacitly assumedthatthefailure detectionandreplacementcircuitry wouldbe
combinedasmuchaspossible. It hasbeensuggestedthatthesetwophasesof therepair
functionmightprofitablybeseparatedandmadealmostcompletelyindependentfroma circuit
viewpoint.This is anotherareawhichshouldbegivencarefulattention.
TheAlphaclassstrategieshavenotbeenthoroughlyinvestigatedto determinethe
optimumdegreeof spareoverlap(i. e., twosetsofsparesservingsomeof thesame
functionalregion). Theinformationfrom this investigationshouldinfluencethedesignof
newstrategyclassesaswell as indicatingtheoptimumstrategyfor theAlphaclass.
6-27
VH. APPEND_
A. CLASS a
Illustrated in figure A-1 is an a class strategy wherein each spare can "repair"
failures in one row and either of two stages. Spare 'T' can "repair" stages 1 or 2;
"2" can "repair" 3 or 4, etc. Each spare can repair failures only in its own rows. This
can be expanded such that, for example, three spares can each repair function blocks in any
of ten stages or, in general, r spares for n stages. Overlapping of spares capability may
help guard against "lumped" failures.
Many different strategies and system repair capabilities can be developed by simply
varying r and n or by overlapping possible individual spare "repair" ranges.
Fq [] Fq
E] 57 []
Fq E] Fq
v'-
SPARES
,,.__.,,.__J ,,___.,,._.j ,,___v__J
Figure A-1. Alpha Class Self-Repair
B. CLASS B
There are presently three specific strategies of /3 Class. The major difference
between these strategies is the number of spare function blocks which can replace a given
failure.
I. Class B 1 (Figure A-2 )
Class B 1 allows only one "spare" for a given failureresponse. For example,
functionblock "H" isgiven capabilityas a spare for stage# 4. Figure A-2a shows the
system before failuresoccur. When one functionblock, J, in stage #4 failsno switching
results other than the eliminationof the failure. (See figureA-2b). When the second failure,
say K, occurs in stage # 4, functionblock "H" will move intostage # 4 (See figure A-2c. )
and resolve the ambiguity caused by the failure. After the failedblock has been eliminated
block "H" remains in stage #4.
6-29
F_ Fq [E] Fq F_
STAG E NO. 1 2
SYSTEM BEFORE FAILURE
3 4 5
Figure A-2a. Beta Class Self-Repair
OPERATION OF CLASS Bt STRATEGY
b. _-_ [_ _
STAGE NO. I 2 3 4
FIRST FAILURE -- NO RESPONSE
F_
[]
F_
5
FAILED FUNCTION BLOCKS
Figure A-2b. First Failure
C*
STAGE NO.
[] []
D
t 2
SECOND FAILURE --is-- t RESPONSE
I_7 _J r _ I-FIRESPONSE L
FAILED FUNCTION BLOCKS
Figure A-2c. Second Failure Response
6-30
!It is possible that one function block will remain working alone without system
failure. For example, if function block "G" failed before "K" function block 'T' will
carry the load for stage 2 after "H" switches until it fails. (See figure A-3. ) System failures
occur when a lone operating function in a stage fails or when no spare is available to resolve
an ambiguity. Failure of this system could occur when function block "E" and "G" have failed
and failure of blocks "H" or 'T' occurs (figure A-4), since for this strategy, block "E" is the
only spare capable of "repairing" a failure in stage #3.
STAGE NO.
F_
F_
F_
I
I ST RESPONSE
I I / L--.)
I /3RD
2
N N
®
FAILURE
5
FAILED FUNCTION BLOCKS
Figure A-3. Third Failure Respbnse
[_ [_ r_ E] []t._J
m/D
NO SPARE AVAILABLE
/ = 2© E]
E] []
[] @
FAILED FUNCTION BLOCKS
Figure A-4. Catastrophic Failure Sequence
2. Strategy B 2 (Figure A-5)
Strategy B 2 is similar to B 1' but it allows one additional function block to re-
place failures in a given stage. In strategy B 2 function block "M" in addition to "H" is
given the capability of replacing failed blocks in stage #4. Strategies ;9 1 and B 2 operate
6-31
identically through the first two failures. When the third failure in stage #4 occurs block
"M", if still operative, will switch into stage # 4 in the same fashion as did function block
"H" in Class B 1" This move is labeled "2 response" in figure A-5. System failure in
strategy /3 2 occurs in the same manner and under the same conditions as in strategy /3 1"
STAGE NO.
[E]
[] =l-CI []L.J
FAILED FUNCTION BLOCKS
Figure A-5. Beta 2 &3 Strategy
3. Strategy/33 (Figure A-5)
Strategy /3 3 extends the scheme one step further. Here, a third function
block is allowed to move in addition to the two responses allowed to strategy /3 2" In this
strategy the ability is imparted to function block "G" in stage 3 to replace failed blocks in
stage #4. This is the 3rd response shown in Figure A-5. Again, failure occurs in the
identical fashion to the other two strategies.
C. GAMMA (7) CLASS
Gamma Class is divided into two parts: Class y 1' where all spare function blocks have
the same mobility, and Class y 2 where one spare in each stage has a greater mobility than
the other.
1. Class Y 1 (Figure A-6)
As in Beta Class strategies, the first failure in a stage of a Gamma Class system
evokes no response from the system. The second failure creates an ambiguity on the output
of the stage. This activates the switching mechanism to switch block "H" into stage 4 thereby
dissolving the ambiguity. (See Figure A-5b. ) The second failed block is now identified and
switched out of the system. Block "H" remains in stage 4 to detect subsequent errors.
another failure occurs in stage 4, for example block "L", block "G" from stage 3 will switch
into stage 4 in the same manner as did block "H". This leaves no error detecting capability
in stage 2. To overcome this, block E from stage 2 switches into stage 3 to fill the void created
by the switch of block "G". (See figure A-6c.)
6-32
Oi
STAGE NO.
m m m m
[] [] [] / [] []
[] [] [] []
I 2 3 4 5
FAILED FUNCTION BLOCKS
FIRST FAILURE - NO RESPONSE
Figure A-6a. Gamma 1 Strategy - First Failure
bl
STAGE NO.
[]
©
I
[_ r-1I !
L.JRESPONSE
[]
2 5
[]
SECOND FAILURE- IN STAGE NO. 4
5
FAILED FUNCTION BLOCKS
Figure A-6b. Second Failure Response
C.
STAGE NO. I
RESPONSE F -'1 _'_LJ
2 :3 FAIL
5
FAILED FUNCTION
BLOCKS
Figure A-6c. Third Failure Response
6-33
do
STAGE NO.
-_ RESPONSE r-- -'1
"_'--'_kJ
I 2
rm r-I
L_.I L_..J
FAILURE
D 5]
3 4
i'-_ @ 1"-'_ FAILEDBLocKsFUNCTION
®
r_
5
Figure A-6d. Single Block Operation
Now if a failure should occur in stage 2, block "D"; a spare function block "B",
from stage 1 will switch to stage 2 and the failed block "D" will be switched from the system.
(See figure A-6d. ) As additional failures are sustained this process continues until a limit
is reached. The end to this process can be reached in one of two ways:
1) A limit can be set for the mobility of a particular function block.
In this case, once a function block has reached its limit it can no longer act as a spare for
failures in the stage following it. If a critical failure occurs and all possible spares have
failed or reached their limits the system fails. Voids which cannot be filled due to spares
reaching their limit remain as voids but the system continues to operate until the remaining
function block fails. This limit sequence is illustrated in figure A-7a. Block "A" has a
_.
STAGE NO.
F-I F-I F-I F-1 F-I
LJ LJ LJ LJ L_I
N rq ra /-_ raITI El El El
I 2 3 5
N[-qEII-qEI_"_°_o_c,,oNBLOCKS
Figure A-7a. Function Block Limit
6-34
mobility of 3 and after a given failure pattern the system appears as in Figure 7a. Block "A"
has reached its limit. Upon the occurrence of a critical failure in stage #4, block "A" can-
not act as a spare for this stage. The ambiguity remains on the output of stage 1 and the
system is considered failed. However, if the critical failure occurred in stage 2 rather than
stage 4, block "M", since it hasn't reached its limit, would switch into stage 2 and resolve
the ambiguity. This leaves a void in stage 1. Function block "G" cannot switch into stage 1,
hence, the void remains and the system works properly as long as the remaining block in
stage 1 does not fail.
2) Another failu_'e mechanism can exist for class _-. When the system
has sustained a large number of failures such that the number of remaining spares is equal to
the number of stages this second mechanism case becomes effective. When an additional
failure occurs, each spare function block will respond once, the initial one will resolve the
ambiguity and others will fill the successive voids which appear in the immediately preceding
stages. Since there is now one less spare than there are stages a void must remain some-
where in the system. If the next failure is in the stage which contains the void or that stage
for which the void would have been a spare, the system goes down. For example, referring
to Figure A-7b if function block "G" fails, block '_D" will switch into #4 to correct for the
failure. Block "A" will fill the void for block '_D", block "M" for "A" and block "H" for block
"M". The process stops here. There is a void in stage 5. Now failure in stage 1 or stage
5 will cause system failure. Class _, 1' allows uniform mobility to each spare function
block in the system.
STAGENO.
l-q F-I F-I F-I F-'I
L_I LJ L/ L_I L_.I
N [-q P] EO
p-] FFI N l-q ITII 2 3 4 5
BL_L__ FAILEDBLocKsFUNCTION
Figure A-7b. Marginal Operation
Many different strategies are contained under the heading of Class y 1" These
differ primarily in the limit assigned to the mobility of the spare function blocks. A
particular strategy may be identified by specifying "n" in the statement "n moves per spare. "
The value of n prescribes where a given function block will reach its limit and therefore con-
trols the differences between the various strategies of Class _, 1"
6-35
2. Class ¥ 2
Unlike the Gamma 1 Class, which assigns the same mobility to all spare function
blocks, Gamma 2 Class allows the two spare function blocks to differ from one another in
mobility. Figure A-8 will assist in the description of the switching processes which occur
for strategy Gamma 2. The members of the top row are assigned a mobility 3, those of the
middle row, a mobility 2.
The first failure in a stage will evoke no response aside from the elimination of
the failed block from the system. Upon failure of the second function block in a stage (stage 4),
the spare will be drawn from the next stage (stage 3). Block "G" which has the greater mobility
will switch from stage 3, to stage 4. (See figure A-8a) This is the only switch which will
Q,
STAGE NO.
L_I [z]
m B \N
I 2 5 5
-FAILED FUNCTION BLOCKS
Figure A-8a. Gamma 2 Strategy - First Failure
occur. Since there are two function blocks remaining in stage 3 the void created by the
switch will not be filled. The next failure occurring in stage 4 will require another spare
to be switched into the stage. This spare is drawn from next stage which has a spare with
high mobility and which is within range to supply the need i. e., block D from stage 2 will
switch into stage 4.
needs not be filled,
b.
[]
(See figure A-8b. ) This leaves another void which is not filled and which
In the system described in figure A-8, the next failure in stage 4, cannot
R7  ITI ITI
STAGE NO.
[]
I 2 5 5
L_=.JI--1J
Figure A-Sb. Gamma 2 Strategy
6-36
draw a high mobility spare A, because it is out of range for stage 4. In this case the lower
mobility spare from stage 3 is used spare "H". This leaves a void in stage 2which must be
filled since there is only one remaining operating function block in that stage. This void
is filled as though it were a failure; if a high mobility spare is available it will be switched,
i. e., function block "A" will switch to stage 3. (See figure A-8c. ) This process continues
until either a failure occurs and no spare is available or a lone remaining function block in a
stage fails. System failure occurs at this point.
C°
STAGE NO.
®
I 2 3 5
_] _ FAILED FUNCTION BLOCKS
Figure A-8c. Gamma 3 Strategy - Third Failure Response
NASA-Langley, 1964 CR-IO_ 6-37
