Techniques for the realization of ultra- reliable spaceborne computer  Final report by Goldberg, J. et al.
Final Report - Phase I
TECHNIQUES FOR THE REALIZATION
OF ULTRA-RELIABLE SPACEBORNE COMPUTERS
By: j. GOLDBERG K.N. LEVITT R.A. SHORT
Prepared for"
NATIONAL AERONAUTICS AND SPACE ADMINISTRATION
ELECTRONICS RESEARCH CENTER
575 TECHNOLOGY SQUARE
CAMBRIDGE, MASSACHUSETTS 02139 CONTRACT NAS 12-33
https://ntrs.nasa.gov/search.jsp?R=19670014623 2020-03-16T18:24:28+00:00Z
Final Report - Phase f
. September 1966/
TECHNIQUES FOR THE REALIZATION
OF ULTRA-RELIABLE SPACEBORNE COMPUTERS :!_
By: j. GOLDBERG K.N. LEVITT R.A. SHORT
Prepared for:
NATIONAL AERONAUTICS AND SPACE ADMINISTRATION
ELECTRONICS RESEARCH CENTER
575 TECHNOLOGY SQUARE
CAMBRIDGE, MASSACHUSETTS 02139 CONTRACT NAS] 2-33 .i
S R I Project 5580
Approved: D. R. BROWN, MANAGER
COMPUTER TECHNIQUES LABORATORY
J. D. NOE, EXECUTIVE DIRECTOR
ENGINEERING SCIENCES AND INDUSTRIAL DEVELOPMENT
CopyNo. _O
i:_:,E.CEDING _%_\G._,'-_LAh_K NOT l:q' _,Jr:..,-.
ABSTRACT
This is a report of a study of techniques for the realization of
ultrareliable, high-performance, spaceborne computers. The study included
the evaluation of, and several new contributions to, the most significant
known techniques and the proposal and investigation of several promising
new techniques. The state of the art of existing redundancy techniques
for fault-detecting and fault-masking is assessed, with special emphasis
on multiple-line voting redundancy, error-correcting codes, and redundant-
state schemes for sequential networks. A number of directions for the
improvement of these techniques are described. Significant potential
improvements in reliability are available in designs allowing for a high
degree of reconfigurability in structure and programs, and system schemes
and design techniques needed for such behavior are proposed and investi-
gated. In particular, we discuss the design of minimal test schedules
for fault detection and diagnosis, the design of highly modular processing
networks and of programmable interconnection networks, and the overall
organization of maintenance and computation functions in a computer system.
The application of error control techniques to memory systems and to power
supplies is considered, and the possible use of all-magnetic logic networks
is examined. Included in the report is a critical and selective survey
of the literature that is relevant to the attainment of reliable systems
"and networks through the judicious use of redundant structures. Finally,
recommendations are made for further research into the development of
techniques for ultrareliable system design.
iii
FOREWORD
This is a report of a one-year research study of techniques for the
realization of ultrareliable spaceborne computers. This study was con-
ducted in the Computer Techniques Laboratory of Stanford Research Institute,
under the sponsorship of the Electronics Research Center of the National
Aeronautics and Space Administration.
The major objective of the study was to provide guidelines for the
design of computers intended to function reliably under the severe con-
ditions imposed by spaceborne missions. It is clear that the spaceborne
requirement introduces design difficulties which are not attendant to other
applications. For example, the possibility of unprogrammed maintenance and
inspection routines is severely limited; the successful use of a radio link
cannot always be assured; the computations are complex and highly varied;
the performance requirements are very high; and there are special physical
constraints on the construction and operation of the computer. In order
to attain an acceptably reliable system the judicious use of redundant
structures is mandatory. Of course, the observation that redundancy is
required to improve reliability is not unique to this study, and hence in
the course of our investigation we utilized many well-established (at least
in principle) techniques, e.g., fault masking by multiple-line voting and
adaptive replacement of faulty subsystems with standby units. In order to
properly assess the myriad of proposed redundancy measures, a considerable
portion of the report is devoted to commentaries on state-of-the-art
developments. This inclusion of review material enables an engineer who
is related to this area solely as a user to satisfy his requirements with
minimal recourse to other documents.
V
However, many novel developments are reported herein, and our con-
cern in this Foreword is to guide the reader--whether he is a research
specialist or one with little prior knowledge of the subject matter--
to those sections which are of most interest to him.t
The report is organized into four chapters and four appendices.
Each major section of the chapters contains conclusions and a detailed
listing of outstanding research problems; the major conclusions of the
research study and recommendations for further research are presented in
Chapter IV. The first chapter serves as an overall introduction to the
report. It contains (i) the statement ol the problem--in particular a
detailed discussion of the characteristics of advanced spaceborne com-
puters; (2) the goals, methods and assumptions of the study; (3) the
criteria of performance, including a discussion of relevant reliability
and cost measures; and (4) the organization of the report.
The second chapter is concerned with those logical design techniques
for fault masking and error detection in which the error control is
passive. In Sec. II-A-I @ we present a detailed (historical) review of
the most important known fault-masking techniques. Specific combinational
fault-masking design techniques are given in Sec. II-A-2. Included herein
are reviews of the multiple-line voting approaches [Sec. II-A-2-a-2)] @
as applied to such simple network models as cascades and trees, and also
some embellishments of known techniques [Sec. II-A-2-a-3)] for the
analysis of arbitrary replicated networks. Some new results are presented
on bounds on the reliability of arbitrary replicated networks [Sec. II-
A-2-a-4)] and also on optimum techniques for the realization of multiple-
output networks which are replicated [Sec. II-A-2-a-5)]. In Sec. II-A-
2-b we present some unique digital realizations of the extremely powerful
adaptive-voting scheme, along with a detailed examination of schemes which
combine fault masking and network replacement. Such schemes have hereto-
fore not been reported in the literature. In Sec. II-A-2-c @ a review is
t Sections containing reviews of prior techniques are marked, in this
Foreword, by an asterisk.
vi
presented of the known techniques for the realization of voting networks--
muchof the prior work has related to networks which realize the majority
function of 3 or 5 inputs--and also somenovel designs (which are approxi-
mately minimal) are given for majority-function networks with an arbitrary
number of inputs, suitable for different technologies. A review of the
known techniques of fault control for sequential networks is presented in
Sec. II-A-3 @along with some(apparently) novel fault-detection schemes
relying upon state-parity checking and state-weight checking. Chapter II
concludes with Sec. II-B, which surveys in detail the knowncoding
techniques which appear to be appropriate for checking computer operations.
Chapter III is concerned with techniques for dynamic error control,
i.e., ways in which the logical interconnections amongthe componentsof
the computer may be altered. In Sec. III-A we discuss the particular sys-
tem organization features that facilitate dynamic maintenance processes.
Section III-B is concerned with the design of test schedules of minimal
length, for the fault diagnosis of combinational networks. Included here
is a review of the state of the art of diagnosis (Sec. III-B-3) @along
with a discussion of somenovel techniques for fixed-schedule and serial-
schedule types of tests based upon reduction of a fault table. In
Sec. III-C we consider the design of networks for a reconfigurable net-
work--an area which has received little prior attention. The design of
commutation networks--networks whosefunction is to provide interconnection
between operating modules and to disconnect faulty modules for the system--
is discussed in Sec. III-C-2. Twotypes of structures are presented--a
unique sequential network and a combinational type of network which is
somewhatsuggestive of the central telephone exchange. In Sec. III-C-3
we consider the design of a modular arithmetic processor wherein the
functions of computation, storage, and primitive control are all combined
in an iterated set of replaceable modules. The chapter concludes with
brief descriptions of somenovel techniques for realizing programmable
control units. This remains a major area for further research.
Appendix A* emphasizes the practical problems of applying redundancy
to spaceborne memories. Included here is an evaluation and comparison of
several state-of-the-art schemes,e.g., the use of codes to protect the
vii
data channels and the access circuits, in addition to somesuggestions
for future research such as a consideration of those reliability tech-
niques which relate to special memorytypes.
Appendix B is a detailed discussion of the reliability problems
peculiar to power supplies. Suggestions are given for novel meansof
error control, including considerations of weight and volume.
Appendix C is an examination of the possible role of magnetic logic
for attaining ultrareliable operation, with special attention to appli-
cations wherein the low speed of operation attendant to magnetic logic
does not limit overall computation speed.
Appendix D_ is a critical and selective survey of the literature
that is relevant to the attainment of reliable systems and networks
through the judicious use of redundant structures. Although several
complete bibliographies of the literature have appeared previously, no
surveys were available which could be used to quickly distinguish those
contributions which are concerned with tactical expositions, applications,
or advancedmathematical theories.
The technical studies reported here are the work of the following
membersof the ComputerTechniques Laboratory:
Mr. J. A. Baer
Mr. C. B. Clark
Dr. B. Elspas
Mr. J. Goldberg
Dr W. H. Kautz
Dr K. N. Levitt
Mr S. W. Miller
Dr R. A. Short
Dr H. S. Stone.
All of these individuals contributed to the writing of the various sec-
tions of the report. The report was organized and edited by Mr. J. Gold-
berg, who wasProject Leader, Dr. K. N. Levitt, and Dr. R. A. Short.
viii
CONTENTS
ABSTRACT........................... iii
FOREWORD............................. v
LIST OF ILLUSTRATIONS....................... xvii
LIST OFTABLES .......................... xxi
I OBJECTIVESANDAPPROACH................... 1
A. Statement of the Problem ................ 1
1. Basic Characteristics of an AdvancedSpaceborne
Computer ....................... 1
a. Special Requirements and Constraints in Computa-
tion, Maintenance, and Construction ....... 1
b. Design Consequences of the Special Requirements
and Constraints ................. 3
2. Problems of Design for Reliability .......... 5
B. Goals, Methods, and Assumptions of the Study ....... 7
I. Goals of the Study .................. 7
2. Method of Approach of the Study ........... 8
a. Survey of Known Techniques ............ 8
b. Conception of New Schemes ............ 9
c. Recommendations for Directions of Further
Research ..................... 9
3. Technical Assumptions of the Study; Definition
of Terms ....................... I0
a. Definition of Terms ............... l0
b. Scope of Design Techniques ............ Ii
c. Device Technologies ............... 12
C. Criteria of Performance ................. 12
D. Organization of the Report ................ 15
ix
CONTENTS(Continued)
II TECHNIQUES OF LOGICAL DESIGN FOR FAULT MASKING
AND ERROR DETECTION ................... 17
A. Fault-Masking Techniques for General Logic Functions . . . 18
I. Review of Significant Techniques .......... 18
2. Combinational Techniques for Fault Masking
in General Logic Networks ........... 26
a. Techniques for the Design of Multiple-
Line Fault-Masking Networks ........... 27
I) Introduction and Summary of Prior Work . . . 27
2) Techniques for the Analysis of Simple
Models .................... 33
3) Techniques for the Analysis of Arbitrary
Triplicated Networks ............. 44
4) Bounds on Network-Failure Probability .... 55
5) Techniques for the Realization of Multiple-
Output Networks with Voter Redundancy
for Fault Masking .............. 61
6) Conclusions and Future Problems for Study . 65
b. Techniques for the Combination of Fault Masking
68and Replacement .................
I) Introduction ................. 68
2) Techniques for the Realization
of the Adaptive-Voting Scheme ........ 72
3) Description of the Switching-Over-
74Voting Scheme .............
4) Description of the Voting-Over-
Switching Scheme ............... 75
5) Comparisons and Conclusions ........ 77
c. Voting Networks ................. 78
i) Introduction ................ 78
2) Logical Designs for Simple Majority
Networks .................. 79
3) Canonical Structures for Multiple-Output
Voting Networks ............... 83
3. Sequential Networks ................. 90
a. Introduction ................... 90
b. Classification of Faults ............. 92
1) Output-Only Faults ..............
2) Delay-Element Faults .............
3) Memory-Excitation Faults ...........
4) Output-Plus-Memory Excitation Faults .....
5) Overall-Network Faults ...........
93
93
94
94
95
x
CONTENTS (Continued)
III
B,
c. Logical-Redundancy Techniques .......... 95
d. Schemes for Fault Detection ........... 98
I) State-Parity Checking ............ 99
2) State-Weight Checking ............ 103
Use of Codes for Storage and Arithmetic Operations .... 109
I. Introduction ..................... 109
2. Codes for Checking Storage .............. 110
a. Threshold Decoding ................ 111
b. Tradeoffs Between Memory Redundant Channels
and Error Probability .............. 114
3. Codes for Checking Arithmetic Operations ...... 117
a. Separable Codes ................. 120
b. Nonseparable Codes .............. 125
c. Evaluation .................. 128
133
133
TECHNIQUES FOR DYNAMIC ERROR CONTROL .............
A. Problems of System Organization ............
i. Basic Behavioral and Structural Characteristics
of an Advanced Spaceborne Computer .......... 134
2. Organization of Basic Processes .......... 137
3. Approaches to System Structure ............ 140
a. Introduction .................. 140
b. Approaches to Structural Parallelism and Func-
tional Specialization for General Computation . . 140
c. Factors of Module Size and Specialization .... 142
d. A Suggested Model ................ 143
e. Approaches to Structural Specialization
for Maintenance Computation ........... 144
f. Coordination of Information Types ...... 147
g. Problems of Subsystem Design ........... 148
B. Tests for Diagnosis of Fault Conditions ....... 149
i. Introduction ................... 149
2. Fault Diagnosis in Combinational Circuits
Using Fixed Test Schedules ............ 149
a. Introduction ................... 149
b. Formulation of the Problem . . .......... 151
c. Formal Solution Using the G-matrix . . ...... 154
d. Simplified Solution Using the #-Matrix
for Fault Location ............... 161
e. Some Bounds on the Number of Tests Required . . 165
f. Reductions in the Size of the Fault Table .... 167
g. Implementation of the Test Schedule ....... 171
h. Tests for Multiple-Output Networks ........ 173
xi
CONTENTS (Continued)
C,
3o Fault Diagnosis in Combinational Circuits
Using Serial Test Schedules ............. 175
a. Introduction .................. 175
b. Fault Detection ................. 178
178c. Fault Location ................
d. Fault Location to Within Modules ......... 181
e. Bounds ...................... 182
f. Potential Economies of Serial Test Schedules
for Fault Location ................ 183
4. Fault Diagnosis in Digital Computers:
185Present State of the Art ..............
Design of Networks for a Reconfigurable Computer ..... 189
1891. Introduction ....................
2. Programmable Interconnection Networks ........ 190
a. Introduction ................... 190
b. A Sequential Commutation Network ......... 193
l) Overall Behavior of the Network ....... 193
2) General Description of the Propagation
Mode ................. 194
3) Description of the Cell Design ........ 197
4) Summary .................. 201
c. Combinational Commutation Networks--Minimization
of Number of Switches ............. 201
I) Introduction ................. 201
2) Single-Level Order-Preserving Network .... 204
3) Double-Level Order-Preserving Network .... 205
4) Single-Level Non-Order-Preserving Network . . 208
5) Double-Level Non-Order-Preserving Network . • 212
d. Setup and Control Circuits ............ 221
e. Failure-Tolerant Interconnection Networks .... 225
f. Conclusions and Problems for Future Study . . . 229
3. Programmable Processing Modules ........... 230
a. General Structure of a Modular Processor ..... 230
b. Module Description ............... 233
c. Microprograms for Common Functions ........ 238
d. Other Uses of the Module ............. 241
e. Problems for Further Study ............ 242
xii
/
CONTENTS (Continued)
4. Programmable Control Units .............. 242
a. Uses of Programmability in a Control Unit .... 242
b. Approaches to the Structuring of Modular
Programmable Control Units ............ 243
1) Control Based Upon a Microprogram
Memory Store ................. 243
2) Control Using a Programmable Cellular
Network .................. 246
3) Control Based Upon a Network
of "Universal" Logic Modules ........ 248
IV CONCLUSIONS AND RECOMMENDATIONS FOR FUTURE STUDY ....... 253
A. Conclusions ....................... 253
B. Summary of Needs for Technique Development ........ 254
C. Summary of Suggested Problems for Future Research .... 258
Appendix A ERROR-CONTROL TECHNIQUES FOR MEMORY SYSTEMS ...... 261
1. Introduction ....................... 263
2. General Discussion of the Problem ............ 263
3. Error Protection by Replication of Whole Memories .... 270
a. Triplication with Voting ............... 270
b. Duplication with Parity Checking ........... 273
4. Error Protection by Redundancy Within a Memory ...... 277
a. Redundant Bit Channels ................ 277
b. Redundant Words ................... 281
c. Accommodation to Access Faults ............ 282
d. Addition of Access Redundancy ............ 283
e. Redundant Access Circuits .............. 283
f. Redundant Material in the Storage Module ....... 285
g. Redundant Cycle Control ............... 286
h. Redundant Power and Environment Control ...... 286
5. Design of a Parallel Encoder/Decoder ........... 287
6. Conclusions ....................... 292
Appendix B DISTRIBUTED POWER-SUPPLY SYSTEMS ........... 295
i. Introduction ....................... 297
2. Advantages of Distributed-Power-Supply Systems ...... 297
3. Disadvantages of Distributed-Power-Supply Systems .... 298
xiii
CONTENTS (Continued)
4. The Interdependence Between Power-Supply
and Logic Circuits .................... 298
a. Noise Problems .................... 298
b. Fault Location, Isolation, and Corrective Action . . • 300
5. Examples of Three Possible Designs for Power-
Supply Systems ...................... 300
8. A Possible Configuration for a Power-Control System • • • 302
7. Weight and Power Required for a Distributed
Power Supply ...................... 304
8. Conclusions ...................... 306
Appendix C APPLICATION OF MAGNETIC LOGIC ........... 307
i. Introduction ....................... 309
2. Reliability of Magnetics ................ 310
3. A Magnetic-Monitor Concept ................ 312
a. A Metering Monitor ................. 312
b. An Information-Sampling Monitor ........... 313
4. Implementation of Magnetic Monitor ............ 313
5. Magnetic Switches ................... SIS
a. Converging Switch .................. 318
b. Interconnection Switch ................ 316
c. Data-Path Switch ................. 318
d. Power Switching ................... 319
6. Backup Control ...................... 321
7. Conclusions and Recommendations ............. 324
Appendix D A SURVEY OF THE PUBLISHED LITERATURE ON THE ATTAINMENT
OF RELIABLE SYSTEMS THROUGH THE USE OF REDUNDANCY . . . 325
i. Introduction ...................... 327
2. Summary of Subject Areas ............... 329
a. Overview ....................... 329
b. Categorization of Subject Areas ........... 330
3. Discussions of General Background ............ 331
a. On the Need for Reliable Systems ........... 331
b. On the Analysis of Reliable Systems--
Bibliographies .................... 334
c. Optimum Redundancy and Other Considerations ..... 337
xiv
CONTENTS(Concluded)
4. Discussions of Static Redundancy Applications ...... 338
a. Fault-Masking Techniques ............... 338
i) Nonvoting Schemes ................ 339
2) Voting Schemes .................. 341
b. Application of Coding Theory ............ 343
c. Static Redundancy in Sequential Machines ....... 345
5. Discussions of Dynamic Redundancy Applications ...... 347
a. Approaches to Fault Diagnosis ............. 347
b. Spare-Equipment Considerations ............ 351
c. System Organizations that Facilitate Self-Repair . . . 355
6. Peripheral Considerations ................ 356
REFERENCES ........................... 359
xv
ILLUSTRATIONS
Fig. II-A-I
Fig. II-A-4
Fig. II-A-5
Fig. II-A-6
Fig. II-A-7
Fig. II-A-8
Fig. II-A-9
A Classification Tree for Redundancy
Techniques ....................
Stage for Cascade Network .............
Simplified Representation of a Restored
Network Stage ...................
Reliability Improvement as a Function of Cost
for Cascade Model .................
Fan-In Stage for Tree Network ...........
Stage Partitioning of Uniform Tree ........
"Arbitrary" Triplicated Stage ...........
Network to Illustrate Linking ...........
Arbitrary Network to be Subdivided into
Stages ......................
Fig. II-A-IO Stage Subdivision of Network ...........
Fig. II-A-ll Cascade Network Which Minimizes Failure
Probability ....................
Fig. II-A-12 Network Which Maximizes Failure Probability ....
Fig. II-A-13 Tree Realization of Serial Decoder ........
Fig. II-A-14 Cascade Realization of Serial Decoder .......
Fig. II-A-15 Adaptive-Voting Scheme (Using Threshold
Logic) ......................
Fig. II-A-16 Switching-Over-Voting Scheme ...........
Fig. II-A-17 Voting-Over-Switching Scheme ...........
Fig. II-A-18 Adaptive-Voting Scheme (Using Digital
Elements) .....................
Fig. II-A-19 Combining Network for Digital-Adaptive
Scheme , .....................
Fig. II-A-20 Logical Structure for Switching-Over-Voting
Scheme . .................... .
Fig. ll-A-21 Linear-Input-Logic Majority Gate ........ ,.
19
33
34
40
42
43
45
52
53
54
56
58
62
63
69
70
71
74
75
76
79
xvii
ILLUSTRATIONS (Continued)
Fig.
Fig.
Fig.
Fig.
Fig.
Fig.
Fig.
Fig.
Fig.
Fig.
Fig. II-B-I
Fig. II-B-2
Fig. II-B-3
Fig. III-A-I
Fig. III-A-2
Fig. III-A-3
Fig. III-A-4
Fig. III-A-5
Fig. III-A-6
Fig. III-A-7
Fig. III-B-I
Fig. III-B-2
Fig. III-B-3
Fig. III-B-4
Fig. III-B-5
Fig. III-B-6
II-A-22 Majority-Element Majority Networks
(Amarei, Cooke & Winder) ............. 80
II-A-23 AND-0R Majority Networks ............. 81
II-A-24 MaJorlty-Elemeut Multlple-Output Voting
Network ..................... 84
II-A-25 AND-OR Multlple-Output Voting Network ...... 85
II-A-26 Diffuslon-Type Multiple-Output Voting
Network ..................... 88
II-A-27 Model of Sequential Network ........... 91
II-A-28 Partitioned Model of Sequential Network ..... 92
II-A-29 State-Parity-Checked Sequential Network ..... i00
II-A-30 Two-Out-Of-Five Counter ............. 104
II-A-31 Sequential Network with State-Welght
Checking ..................... 107
Threshold Decoding for Memory Channels ...... 112
A Separable Code System ............. 118
A Nonseparable Code System ............ 119
Serial Computer (von Neumann) .......... 141
Schemes with Local Parallelism .......... 141
Schemes with General Parallelism ......... 141
Scheme with Small-Module Parallelism ....... 144
System with Self-Diagnostic Computer and
Master Machine Controller ............ 146
Distinct-Maintenance-Center System with
Separate Working and Maintenance Computers .... 146
Polymorphic Systems with Floating Maintenance
Control ..................... 146
Path-Sensltizing Tests in Gate Networks ..... 168
Diagnoser Structures ............... 171
Decision Trees for Fault Location ........ 173
Decision Trees for Sequential Tests ....... 177
Contact-Tree Analogs of Decision Trees ...... 177
Sequential Decision Tree for a Limiting Case . . b 184
xviil
ILLUSTRATIONS (Continued)
Fig. III-C-i Interface Between Module Types .......... 191
Fig. III-C-2 Block Diagram of a Shift-Register Commutating
Network ..................... 195
Fig. III-C-3 Typical Symbol States in an Asynchronous
Shift-Register Commutator ............ 195
Fig. III-C-4 Block and State Diagrams for a Speed-Independent
Module ...................... 196
Fig. III-C-5 Block and State Diagrams of Full Shift-Register
Cell ....................... 199
Fig. III-C-6 Logical Realization of Shift-Register Cell .... 201
Fig. III-C-7 Single and Double-Level Interconnection
Schemes ..................... 203
Fig. III-C-8 Single-Level Order-Preserving Commutation Network
N 1 = N 2 = 15, M = 5 ............... 205
Fig. III-C-9 Double-Level Order-Preserving Commutation Network
N 1 = N 2 = 15, M = 5 ............... 206
Fig. III-C-IO Single-Level Non-Order-Preserving Commutation
Network N 1 = N 2 = 15, M = 5 ........... 210
Fig. III-C-II General Double-Level Non-Order-Preserving
Network ..................... 213
Fig. III-C-12 Double-Level Non-Order Preserving Commutation
Network N_ = 20 = 6, M = 5,
72 SwitchZes ' N2 = 6, N i
• 219
Fig. III-C-13 Holland-Type Cellular Commutation Network .... 224
Fig. III-C-14 Nonredundant and Redundant Commutation
Networks ..................... 228
Fig. III-C-15 A Reconfigurable Parallel Processing Unit .... 232
Fig. III-C-16 Module for a Reconfigurable Processor ...... 234
Fig. III-C-17 A Microprogram Control Unit ........... 245
Fig. III-C-18 A Programmable Cellular Logic Network ...... 246
Fig. III-C-19 Details of a Programmable Cell .......... 247
Fig. III-C-20 Logic Module Based on a Programmable Universal
Logic Module ................... 251
Fig. III-C-21 Reconfigurable Control Unit Based Upon
Programmable Universal Logic Modules ....... 251
xix
ILLUSTRATIONS(Concluded)
Fig. A-I Functional Connections BetweenMain Memory
Subsystemand ComputerSystem ............. 265
Fig. A-2 RedundantBit-Channel Connection ........... 278
Fig. A-3 Error-Correcting Code in Bit Channels ......... 289
Fig. A-4 Parity Generators for i and k ............. 291
Fig. A-5 Syndrome Decoder, Bit Corrector and Data Register
Transfer Gates .................... 292
Fig. B-I Power Conditioning Systems .............. 301
Fig. B-2 Possible Power-Control Systems ............ 303
Fig. B-3 Weight and Power Requirements for 20-Watt Distributed
Supply ........................ 305
Fig. C-i Data-Path Switch ................... 318
Fig. C-2 Ferroresonant Switch Circuit ............. 321
xx
TABLES
Table II-A-I
Table II-A-2
Table II-A-3
Table II-A-4
Table II-B-I
Table II-B-2
Table III-C-I
Table III-C-2
Table III-C-3
Table III-C-4
Table III-C-5
Table III-C-6
Table III-C-7
Table III-C-8
Double Failure Patterns by Linking Procedure .... 46
Triple Failure Patterns by Linking Procedure .... 50
Measure of Lower and Upper Bounds
on Failure Probability ............... 60
Comparison of Failure Probabilities for Cascade
and Tree Realizations ............... 64
Reliabilities and Memory Sizes for Single
Error-Correcting Codes .............. 116
Reliabilities and Memory Sizes for Double
Error-Correcting Codes ............... 116
State-Transition and Output Logic
for Register Cell ................ 200
"Break-Even" Values of N .............. 207
Comparison of Switch Networks ........... 218
Logic Equations for a Processing Module ..... 236
Basic Microoperations for a Modular
Processing Unit .................. 237
Microoperation Codes ................ 237
Microprogram for Multiplication .......... 239
Mieroprogram for Decoding a Binary Vector ..... 240
xxi
I OBJECTIVES AND APPROACH
In this chapter we shall discuss the problem of realizing ultra-
reliable spaceborne computers and explain the method of approach taken
in the study of the problem. We first discuss the basic characteristic
of an advanced spaceborne computer and the problems of design that result
from the novel operational and technological requirements involved. We
then discuss the particular goal of the study, its technical scope, and
the criteria employed.
A. Statement of the Problem
In this part we consider the basic characteristic of a future advanced
spaceborne computer, and the problems of design that arise from the re-
quirements of performance, the constraints on construction and operation,
and the unreliability of components and assembly.
1. Basic Characteristics of an Advanced Spaceborne Computer
a. Special Requirements and Constraints in Computation,
Maintenance, and Construction
The range of computational problems and the rates and capacities
of computation that will be required in the coming generation of space
computers are currently under study by NASA. .114 Some qualitative
statements may be made at this time:
First, the computations will be complex and highly varied. 4s
They may be expected to include the checkout, monitoring and control of
other spacecraft subsystems; guidance; navigation; maneuvering; the control
of communications and of experiments; the processing of data from experi-
ments and photographs; and display. If the scope of the missions is
extended to operations on the surface of a moon or planet, the list may be
* References are listed alphabetically at the end of this report.
1
extended to include control of complex stationary or non-stationary
mechanisms. The computations may thus be expected to be of the general
scientific type, possibly including heuristics.
Second, the performance requirements will be very high. For
the foregoing tasks, large memory capacity would be required for storage
of constants and intermediate variables and for storage of the many
complex programs required, both for the objective computations and for
executive and maintenance computations. The input signals for such
applications would range widely in form and rate and might be synchronized
with the computer's cycle of operation. Thus the computer must be
capable of sustaining a number of active inputs at once, and it must be
interruptible by external command.
Certain tasks, such as guidance in the vicinity of a planet,
require very high computation rates.
Some of the computations, such as pattern processing and co-
ordinate transformations, have functions that may be evaluated with a high
degree of parallelism; hence not only is general high-speed arithmetic
needed, but some kind of bulk-parallel processing may also be very useful.
Third, the computations will have a range of priorities. In
complex missions of long duration it is natural to include as many activities
as are permitted by constraints of weight and power. The activities will no
doubt have widely differing values to the overall mission, and there will
no doubt be a complex set of interdependencies among the activities. Also,
some computations will have a range of acceptable precision, with a corre-
sponding range of values to the mission.
Fourth, there will be special physical constraints on the
construction and operation of the computer. In construction, there will
be severe limitations on weight, volume and allowable power dissipation.
Also, it is likely that physical accessibility to components will be very
restricted. In the operation of the computer, there may be occasional
interruptions in power, either planned or unplanned. Restarting and recovery
of a computation after such interruptions must, of course, be automatic.
The most important physical constraint is that the components
available for construction are not perfectly reliable. In fact, for the
number of components needed and for the length of time of operation of
the missions of interest, not only is the probability of error-free operation
unacceptably low, but it is extremely expensive in time and equipment to
test a computer @ so as to estimate its reliability accurately.
Fifth, the amount of available human intervention will be very
limited. The computers of interest to NASA for the present study include
those for manned and for unmanned missions. The main functions that will
be affected by the presence of a man are the executive control of the
various phases of computation, the checking and repairing of the computer,
and the peripheral functions of input and output. Even with a man present,
there will be some limitations on control due to restrictions on time,
accessibility, or technical knowledge. Some radio communication with a
manned maintenance facility should be possible in many missions.
b. Design Consequences_the Special Requirements and Constraints
The requirements and constraints described have special signifi-
cance in the design of a computer system. The complexity and variety of
computations require that the computer be general-purpose programmable.
The high performance requirements call for large memory capacity, high
processing speed, and input-output facilities with elaborate signal-
processing and control features. Thus a large number of components will
be needed for logic and for storage functions. The variability in value
among the computations requires that the computer be capable of altering
the scheduling of tasks to match the available performance capability,
in the event that failures in equipment reduce that capability below
its nominal value.t
@ We emphasize the computer as a whole, since the reliability of the
assembly of components is as significant a factor in the reliability
of modern systems as is the reliability of the components themselves.
Such behavior is known colloquially as "graceful degradation" and
"failing soft."
3
The limitation on human intervention for executive control
requires that there be some degree of built-in capability for the basic
executive functions, as follows:
• For organizing the equipment and the programs
so as to achieve the highest possible level of
service on the tasks of a mission, according
to the value of the tasks.
• For organizing the equipment and the programs
to avoid errors, by avoiding the use of faulty
equipment or by performing computations redun-
dantly.
• For detecting errors and correcting them by
recomputation.
In addition to this built-in capability, any radio communication that is
available should be exploited to its limit, because the reliability and
depth of analysis in a manned facility will be superior to that available
in an on-board program. However, for deep-space missions, the data rate
and delay time and the reliability of such communications may well be in-
adequate to ensure the reliability of some real-time computations.
The special physical constraints on construction and operation
have a number of implications for the logical organization of the computer.
In modern semiconductor technology the weight and volume of a computer
is predominantly that of the packaging and the interconnections. These are
strongly influenced by factors of logical organization, such as the degree
of parallelism of logical operations, the size of the basic packages, and
the way in which logic functions are divided among the packages. The
limitations on power dissipation are significant in several ways. First,
there is a complex interaction between system speed and power dissipation;
i.e., use of slow circuits reduces the power cost per circuit, but it
requires use of more circuits, operating with a higher degree of parallelism,
in order to achieve a given computation speed. Second, in a redundant
4
system, it would be advantageous to be able to remove power from inactive
components, both to reduce power consumption and to increase the life
of components.*
Finally, the fact that it is impossible to ensure perfect
operation with adequate confidence for large computers over the period
of time appropriate to sustained space missions, requires that the
computer have the capability of accommodating failures among its component
parts. Since modern component parts do not have the capability for physical
self-repair, some form of redundancy of components is clearly required.
To summarize, an advanced spaceborne computer for future
deep-space missions must be general-purpose programmable; it must have
high memory capacity, high computation speeds and complex input-output
facilities; it must have some combination of remote and local control
of error-accommodation processes; and it must employ some form of
logical redundancy in its construction.
2. Problems of Design for Reliability
In the foregoing part, it was concluded that it will be necessary
to employ logical redundancy in high-performance, long-duration spaceborne
computers. A number of redundancy schemes have been described and employed
in practice, but a generally-accepted design art for such redundancy does
not exist. Recent advances in device technology are making it possible
to apply redundancy with much greater effectiveness than in the past; in
particular, microelectronlc fabrication has lowered the weight, size,
and cost Of logic elements, permitting the use of high orders of redundancy;
and continuing refinements in production have increased the inherent
reliability of components, thus making a given order of r_dundancy more
effective in extending component life.
* Knowledge as to the effect of removing power from modern digital com-
ponents on their life is not well substantiated. Various authorities
estimate an increase in the mean time to failure of from 50% to 300%.
The components and assemblies produced by employing micro-electronic
fabrication have cost and reliability factors that differ from those of
previous fabrications. Several examples may be given.
First, recent reliability reports S4s indicate that failures in
interconnections are as significant as failures in active elements.
Among various kinds of connections, those within a monolithic circuit may
be substantially more reliable than those between the circuit and the
external connection system, although as connections within arrays are made
more complex (e.g., by using two or more layers of connections) this may
not be so.
A second new factor is standardization. If the logic networks of a
conventionally organized computer are simply partitioned and realized as
monolithic arrays, a great number of different kinds of arrays will result--
perhaps as many as there are arrays. The number of different array types
influences the initial reliability of the arrays and the techniques of
diagnosis and replacement in service, as may be seen by the following
considerations.
Using many array types tends to reduce initial reliability, since it
is generally accepted that the reliability of a product increases with the
accumulated experience in producing and using it. Using many array types
requires that many different sets of diagnostic tests be kept available
within the computer memory. Finally, using many array types reduces the
effectiveness of a set of spare parts, since a given spare may be employed
only in a few positions.
The desirability of using large monolithic arrays of semiconductors
for low weight and high reliability (primarily due to the minimal use of
unreliable kinds of connections), is in conflict with the desirability
of standardizing the arrays, and new schemes of logical organization are
clearly needed to achieve a good balance among the various factors.
Other new criteria for the logical design also apply. Thus, networks
should be designed so that they are easy to diagnose, so that a failure at
a given point does Mot propagate very far in either direction of signal
flow; so that they are as multlfunctlonal as is practical; and in general,
so that they are well suited to various modes of system redundancy such
as error detection, fault masking, and replacement. The conventional
criterion of minimality of the number of active elements is clearly not
a major one in itself.
On the system level, the key new criteria that have not applied with
great strength in previous computers are autonomy and flexibility. It is
not sufficient, as in usual applications of redundancy, to have errors
indicated, but it is necessary to have the capability to accommodate
them incorporated in the system; furthermore, such accommodation should
be accomplished with great flexibility, both in programming and in hard-
ware, so that the redundancy of equipment is employed to realize the ut-
most in performance.
In summary, the new cost criteria and failure characteristics of
advanced devices and the special logical requirements of fault accommoda-
tion present novel problems and opportunities for logical design. These
problems and opportunities apply both in the realization of existing
error-control techniques and in the design and realization of advanced
error-control techniques.
B. Goals, Methods, and Assumptions of the Study
In this section we state the goals of the study, describe the method
of approach taken, and state the assumptions that were made in the study,
including the scope of the systems of interest and the criteria of evaluation.
i. Goals of the Study
In view of the crucial need for powerful techniques of reliability
design for advanced spaceborne computers, the goals of the study have
been as follows:
(I) To survey the state of the art of logical design of
spaceborne computers as it pertains to the enhance-
ment of reliability; in particular, to examine the
various known techniques so as to determine their
adequacy for the expected mission requirements, and
to determine their mutual compatibility when applied
in a computer system
7
(2) to conceive and evaluate new schemes of system
design and operation that offer promise of
advancing the state of the art
(3) to recommend further directions of research that
will aid in the improvement of present techniques,
the evaluation and realization of the new schemes
conceived, and the conception of further advanced
schemes.
2. Method of Approach of the Study
a. Survey of Known Techniques
The first task of the study was to survey relevant, known techniques.
The approach taken to this task was to distinguish those sections of a
hypothetical spaceborne computer to which distinctive problems of design
apply, to survey the literature for design techniques appropriate to those
sections, and to evaluate the merits of the techniques that are deemed
most useful for the application.
The major sections of a hypothetical spaceborne computer that were
distinguished were the general logic networks, the arithmetic section,
and the memory system. The reliability techniques that were considered
relevant to the study included the encoding of information for transfer,
storage, arithmetic, and control function and for error control; the
logical structuring and testing of networks; and those aspects of circuit
fabrication that bear on logic design. The literature surveyed included
books, professional journals, conference proceedings, and unclassified
research reports; in addition, several conferences were attended which
were in part or in whole devoted to problems of reliable computer design. _
The evaluation of a given technique was concerned both with the
state of the engineering art for its application and with its intrinsic
or potential value for the application. The state of the engineering
art was taken to include the accuracy and convenience of known methods
of analysis of systems employing the technique, and the difficulty of
_ A detailed survey of the literature on the application of redundancy
techniques to reliable computer design is presented in Appendix D of
this report.
8
applying the technique in practical system design. In some cases where
a promising technique appeared to need further development, effort was
made to solve some of the outstanding problems in order to contribute to
the art and to help assess necessary directions of development. The
choice of criteria of evaluation will be considered in Sec. I-B-3.
b. Conception of New Schemes
The second task was to conceive new schemes of system design
and operation that offer promise of advancing the state of the art.
Serious exploration of the basic concepts of redundancy design date back
at least to the work of von Neumann in 1952; 318 the concept of a highly
reconfigurable modular computer dates back at least to the work of Holland
in 1959. ISS Fundamentally new notions (e.g., Moore and Shannon's scheme
for recursive construction of relay nets 214 and Pierce's scheme for
adaptive voting) 24° have been rare. New schemes have generally been
ingenious implementations of known principles, or means for exploiting
some special circuit characteristics such as asymmetries in fault types.
Some new schemes of this type were developed during the study for the
realization of adaptive circuits using only digital switching elements.
Although fundamentally new schemes are to be desired, the study
has revealed that there is a substantial lack of design knowledge appro-
priate to the practical realization of a computer having a high degree of
flexibility and autonomy and employing modern device fabrication. Such
realization calls for the creation of particular schemes for the overall
distribution of functions in such a system, for the realization of
particular functions, and for the integration of various kinds of error-
control processes. A number of such schemes will be described in this
report.
c. Recommendations for Directions of Further Research
In the course of the study a number of significant problems
were uncovered. Some were concerned with the advancement of a known
technique--e.g., improvement of the facility and accuracy of the analysis
and synthesis of "restoring" type redundancy (Sec. II-A-2-a); some with
new kinds of networks--e.g., "commutation networks" (Sec. III-C-2); and
some with basic design problems, such as the incorporation of error-
control criteria (e.g. ease of diagnosis of faults) in general network
synthesis. This report presents the results of some original work on
these problems. Recommendations for fu£ure work are also included in
this report.
3. Technical Assumptions of the Study; Definition of Terms
In this part we shall discuss the technical assumptions that guided
the study; in particular, the scope of the design techniques considered,
the choice of device technologies, the level of reliability of interest,
and the criteria for evaluating reliability techniques.
a. Definition of Terms
We shall first define a number of terms that will be used
repeatedly in the report; we have attempted to be consistent with estab-
lished usage. The definitions are as follows:
Fault:
Error:
A physical condition of a component that pre-
vents the system of which it is a part from
completely performing its specified function.
A system within such a state will be called
faulty; otherwise it will be called perfect.
An incorrect information state (this state can,
of course, appear at the output of a perfect
network as a result of an erroneous input).
Fault Masking: A property of a system such that it is
perfect even though some of its subsystems may
be faulty (see Fault Accommodation).
Fault Detection: Determination as to whether or not a
system is faulty.
Fault Location: Determination as to which subsystem in
a system is faulty.
Fault Characterization: Determination of the subset of
functions of a system that are improperly per-
formed. For each output, the precise characteri-
zation will be that subset of inputs resulting
in erroneous output. A simpler characterization
might be simply a distinction as to which outputs
of a system are perfect.
10
Fault Diagnosis: Techniques for fault detection, location,
or characterization.
Fault Correction: Alteration of the physical condition
causing a fault so as to restore the system to
perfect operation (this could include changing
operating conditions such as voltages and
frequency).
Fault Accommodation: Essentially the same as fault
masking, but masking usually refers to an
instantaneous process, while accommodation may
also include a sequential process.
Error Detection: Determination that the information
state of a signal or a set of signals is in
error. This determination may be a computation
on the set of signals itself, or in reference
to another set of signals from which the set
is derived.
Error Correction: A computation upon a set of data,
perhaps including other related data, that
corrects an error in the set. @
Error Accommodation: (not a widely used term): Compre-
hends both error correction and alteration of
system behavior so as to achieve some modified
objective.
Error Control: A general term, including error accommo-
dation and fault accommodation.
Static (or Passive) Error Control: Error-control prc-
cesses in a system that do not involve changes
in the functions of its subsystems or their
interconnections.
Dynamic (or Active ) Error Control: Error-control pro-
cesses in a system that involve changes in the
functions of its subsystems or their inter-
connections.
b. Scope of Design Techniques
The scope of the error-control techniques studied was centered
about the logical behavior of a computer and its component subsystems.
Questions of good design practices for devices and circuits, on the one
hand, or on the other hand for computational programs, were excluded.
* The distinction between error correction and fault masking is usually
clear, but it is often dependent upon how a system is considered to be
partitioned.
11
However, the interactions between logical and circuit design, and logic
and program design, that affected the implementation of an error-control
technique were of great interest. Thus it is of interest to determine
what special constraints upon logical design result from limitations on
devices and programs, and also what special device types or program
functions would aid the effectiveness of a logical design scheme.
c. Device Technologies
Because of the requirement for high performance, it was assumed
that the major devices technology employed will be that based on modern
high-speed semiconductor devices. Special emphasis was given to the
use of integrated circuits, and in particular it was assumed that the
use of large-array monolithic circuits would be very significant in the
realization of future spaceborne computers.
The study of memory systems was primarily on the logical level,
so that choice of device type was net important. However, it was assumed
that the memory performance is consistent with high-speed bit-parallel
computation. In the report on that study (Appendix A) it is noted that
monolithic semiconductor memory arrays have some attractive features for
the spaceborne computer application.
A study was also made of the possible benefits of using magnetic
logic devices for special functions within a spaceborne computer (Appendix C).
C. Criteria of Performance
The major criteria used in evaluating an error-control technique
were the increases in the measures of reliability, weight, volume, and
power consumption of a functional unit employing that technique, relative
to the measures of those parameters in a unit having the same processing
function but without special error-control features.
There are a number of factors that complicate both the absolute
and relative estimates of these measures. In the case of the reliability
measure, modern logic components have a high reliability but the time-
failure distribution of a given product or assembly method is usually
not known; hence it is extremely costly or in many instances impossible,
12
to estimate absolute reliability. Since it has been established that
redundancy is essential to the task, and since one of the ultimate goals
of the study is to find the most effective means for employing redundancy,
it is sufficient to be able to compare alternative schemes on a relative
basis. Thus the major reliability criterion used in the study was compari-
son of the reliabilities of alternative networks as analytic functions
of the reliabilities of their components, where reliability is defined,
as usual, as the probability that a component remains perfect for a speci-
fied operating time. In some examples, the familiar exponential failure-
distribution law was assumed, and various published failure-rate values
were employed, in order to give some engineering "feeling" of real time.
The resulting probability and time values should be taken very cautiously.
A comment on time measures of reliability is appropriate at this
point. A common time measurement is the "mean time to failure" (MTF)
which in the literature is almost invariably computed on the assumption
of an exponential failure law. It should be noted that for such a failure
law the reliability of a system for the time base equal to the MTF is I/e.
This is too small a value for an expensive mission such as a deep-space
probe; hence if MTF is used as a measure, values substantially greater
than the mission life must be considered. Comparisons of schemes on the
bases of their MTF values must consequently deal with time values that
have only weak intuitive significance. Thus for I - P(t) = 1 - exp
(-TMission/MTF) very small [where P(t) is the probability of perfect
system operation at time t] MTF is approximately TMission/[l - P(t)].
For example for P(t) = 0.999, MTF = T X i000 A measure having
mission
much greater engineering significance, proposed by Knox-Seith 164 and
Angell s is the "useful life," defined to be the longest mission time
for which the probability of failure is no greater than _, where A is
usually much less than one. Not only does this measure have greater
heuristic significance for the missions of interest, but it is also
more sensitive to variations in redundancy than the MTF measure. This
measure is discussed further in Sec. II-A-2-a-l).
IS
Measures of weight and volume are complicated by packaging consider-
ations, since the weight of the active components in future technologies
may be considered to be almost negligible compared to that of the pack-
aging and interconnections. For nonintegrated circuits, packaging costs
are approximately proportional to component count; but for integrated
circuits, packaging costs depend upon the component count and the number
of components that may be incorporated within a package. The latter
number may be expected to increase within a range of two orders of magni-
tude over present integrated-circuit values (which typically provide
the equivalent of one flip-flop per package). Hence comparisons of the
weight and volume of realizations of alternate schemes must consider the
effects of a rapidly developing technology. Thus, if one scheme lends
itself better to larger array realization (by reason, for instance, of
greater modularity), the packaging cost may actually be less than for a
scheme whose count of logic circuits is lower.
Measures of power are somewhat simpler. The obvious factors of
significance are the number of logic circuits and the fraction of those
that may be in a power-on status. One indirect factor that is sensitive
to logical organization is the degree to which circuit speed may be ex-
changed for number of parallel-acting circuits. Thus if such an exchange
may be made in direct ratio, it would permit a reduction in total power
consumption to be realized by the use of devices having low values of
switching time--power consumption product.* Increases in parallelism
adequate for significant power savings may not be feasible because of
inherent serialism within the computations of interest, but the possibil-
ity of such savings should not be overlooked.
* To illustrate, if in a system with n circuits, each with circuit
switching time t and power consumption p, computation speed s is pro-
portional to t/n, then total power = p n _ pt/s. For a given system
speed, devices having lower values of pt would take less total power.
14
D. Organization of the Report
Chapter I has been a review of the operational and technical problems
of realizing reliable spaceborne computers and an explanation of the method
of approach of the present study. The technical analysis of reliability
techniques is presented in Chapter If, which is concerned with techniques
for fault masking and error detection, and Chapter Ill, which is concerned
with techniques for automatically controlled fault diagnosis and recon-
figuration.
The techniques of Chapter II are moderately well included in the pres-
ent state of the art of reliability design; but, as is seen, by no means
are they adequately understood. Most of these are forms of passive error
control, but several elementary active schemes are also included. For
example, the use of codes for error detection is a component of dynamic
error control, but it is included in Chapter II because it is a fairly
well-studied technique, and because the coding approach is helpful in
describing certain fault-masking schemes.
The techniques of Chapter IIl are all components of dynamic error
control, and they are all either new techniques, or practical implementa-
tions of hitherto "ideal" schemes.
In Chapter IV we present the conclusions of the study and recommenda-
tions for a program of further research.
There are four appendices. The first is concerned with the applica-
tion of error-control techniques to memory systems. The second is con-
cerned with the design of modular distributed power supplies in relation
to the modularization of the logic of a computer. The third considers
the possible applications of magnetic-logic devices. Since such devices
are slow, the study concentrated on those functions for which their use
would not substantially slow down a computer's basic cycle of operation.
The fourth appendix is an extensive guide to the literature that is
directly pertinent to the design of reliable spaceborne computers. It
is intended that this guide be directly usable as an introduction to the
literature; hence there is some overlap between it and the comments on the
literature found in the main text.
15
II TECHNIQUES OF LOGICAL DESIGN FOR FAULT MASKING AND ERROR DETECTION
This chapter is concerned with techniques of logical analysis and
design that are needed for the realization of computer functions in which
the control of errors is static; i.e., in which the error state of the
computer is a subject of concern only to the local areas.
In this chapter we will discuss refinements of basic techniques that
are well known. Our goal is to distinguish those methods which are parti-
cularly suited for the achievement of high reliability for specific
functions, at minimum cost, and also to indicate algorithm-like methods,
wherever possible, for the optimum application of the techniques.
Section II-A is concerned with fault masking as applied to general logic
functions, both combinational and sequential. A comprehensive review is
first presented of the known fault-masking techniques, followed by a
detailed discussion of the voting-type restoration scheme. Here several
techniques are presented concerning the analysis of arbitrary restored
networks; in addition, several novel implementations are presented of the
powerful adaptive-restoration scheme. Several schemes are discussed for
detecting failures in sequential networks.
The chapter concludes with Sec. II-B which discusses the status of
error-correction coding techniques for passive error control, in parti-
cular as applied to arithmetic and storage operations.
In the attempt to distinguish the optimum applications of the
logical design techniques many problems were uncovered, both analytical
and of an engineering nature. In some instances detailed solutions were
examined, while in other cases rough designs were presented with con-
jectures relating to the optimum solution, thus providing a framework
for future research.
17
A. Fault-Masking Techniques for General Logic Functions
In this section we consider logical techniques for the masking of
faults in networks realizing general logic functions. In the first part
we review the most significant known techniques. In the second part we
examine two of the most important techniques in detail--the multiple-line
voting scheme and the combined fault-masking and replacement scheme--in
order to assess the state of the art of their application. Also in the
second part we examine some realizations of voting networks. The first
two parts are concerned essentially with combinational networks; in the
third part we examine the state of the art of techniques for error control
for sequential networks, with special emphasis on error detection.
i. Review of Significant Techniques
In this part we attempt to assess the applicability to spaceborne
computers of the most significant known techniques for designing logic
networks that can mask internal faults. A number of the most attractive
techniques are distinguished, and reference is made to the sections of
this report that consider the techniques in greater detail.
Many schemes have been described for designing logic networks having
fault-masking capability. The well-known schemes exhibit a great deal of
ingenuity, but only a few have combinations of features that make them
practical for application to present-day digital networks.
The tree of Fig. II-A-I has been constructed in order to display the
most significant techniques in an orderly perspective. Only those tech-
niques have been included that have been described in sufficient detail
that functional networks could be designed, and for which circuit and
design techniques are currently available. Several other interesting
schemes, for which some circuit or logical action has been hypothesized
but for which no practical designs have been given, will be mentioned
separately.
_b Several reviews that may be of interest to the reader are those by
Teoste, 3°° Pierce, 243 and Garner et al. s2
18
REDUNDANCY
TECHNIQUES
LOGICAL CIRCUIT
*MOORE-SHANNON
I
PROGRAM
FIXED LOGIC
I
VARIABLE LOGIC
(.I)
MASKING
I
SWl TCH-OVE R
("A) (lil)
COMBINATION AL
I
MODULAR
I
SEQUENTIAL
("A3)
I
RECURSIVE
*AMAREL
*URBANO
SIMPLE
REPLICATION
L
FIXED
RESTORATION
I
REDUNDANT
OUTPUT
I
DISTINCT
RESTORATION
I
INTEGRATED
RESTORATIONI
*MULTI-
LINE
VOTING
(IIA)
L
*TRYON-
INTERWOVEN
I
NON-REDUNDANT
OUTPUT
DISTINCT
RESTORATION
*yon NEUMANN
*TEOSTE
I
INTEGRATED
RESTORATION
I
*LIU
*dePIAN
I
ADAPTIVE
RESTORATION
I
NON-REDUNDANT
OUTPUT
I
DISTINCT
RESTORATION
I
*PIERCE
(llA)
i
CODED
REDUNDANCY
t
FIXED
RESTORATION
L
NON-REDUNDANT
OUTPUT
L
DISTINCT
RESTORATION
*HA!MING
*PETERSON
(liB)
*ARMSTRONG
FIG. II-A.1 A CI_ASSIFICATION TREE FOR REDUNDANCY TECHNIQUES
19
The first (top) level distinguishes these major classes of redundancy
techniques: circuit, lo$ical, and programming. The standard method of
circuit redundancy is the replacement, within a circuit, of an unreliable
circuit component (e.g., a diode or a resistor) by a cluster of components
whose net circuit impedence or transfer function changes within limits that
are acceptable for correct circuit operation when some number, or fewer,
of the components fail. General methods for organizing such networks
have been given by Shannon and Moore. 214 In practice, this approach is
useful for special circuits within a computer, in which a given component
is under heavy electrical stress (e.g., in power supplies or high-current
pulse drivers), or in which the circuit has very generous operating margins.
It has been found that application of the technique to low-level logic
circuits or memory sense amplifiers results in circuits that have sub-
stantially poorer margins with respect to component aging, noise, and
variations in operating voltages and temperature than equivalent non-
redundant circuits; hence there is likely to be a reduction in reliability
if the technique is applied to low-level circuitry. Programming redundancy
is a very important technique for spaceborne missions; it is discussed in
Sec. III-A of this report. The remainder of the tree is concerned with
techniques of redundancy at the logic level.
The second level distinguishes techniques in which the logical
structure of a system is variable or fixed. Provision of a variable
structure in a system that is part of a larger system increases the
flexibility with which the larger system may accommodate faults; thus,
not only may the number of tolerable fault conditions be increased, but
the system may permit tradeoffs in performance that may enable critical
functions to be performed. This technique is discussed at length in
Chapter III of this report.
The third level distinguishes techniques of fixed logic function in
which faults are accommodated by masking or by switchover. Automatic
switchover of spare parts to replace faulty parts is commonplace in
general electrical and electronic practice, and is an established practice
in space vehicles. 265 Although many computer-oriented reliability
20
analyses have been made of abstract models of this method (e.g., by
Kruus 171 and Muth, 225), practical applications to computers has been rare.
Some recent examples are the Bell System ESS-I Central Switching System 39
and the memory system of the Saturn Guidance Computer 61 (anticipated by
Kemp, Is7) both of which use duplex switching. Recently, descriptions of
techniques for the practical logical design of computers, in which switch-
over would be applied at relatively low levels within the computer, are
appearing in the literature (e.g., Terris 3°3 and Agnew et al.1). In this
report, techniques relating to this method are discussed in Sec. II-A-2-b
and in Chapter III. All the remaining schemes are forms of fault masking
within a network of fixed structure; i.e., at no time may a part of the
network be blocked from contributing to the network output.
The fourth level distinguishes techniques that deal with networks
essentially as sequential or as combinational networks. Although a com-
puter is ultimately a sequential network, it is helpful to approach the
design of various subsystems with emphasis either on their state-sequence
behavior or on the classes of combinational functions that are realized
in the network. Error-control techniques for sequential networks are
discussed in Part 3 of this section.
The fifth level distinguishes techniques in which the process for
the construction of a fault-masking network is either recursive or
modular. In a recursive process, first employed by Shannon and Moore, a
given subsection of a network is replaced by a cluster of elements in a
way that preserves the structure of the network, and this process is
repeated for all subsubsections of the subsection, until a desired im-
provement in reliability is achieved. Amarel and Brzozowski 6 described
an approach in which the unit of recursion is a single gate (producing
so-called "triangular nets") and Urbano sx2 extended their scheme by
making the unit of recursion a network (producing so-called "iterated
neural nets"). These approaches are of considerable theoretical interest
in relation to the question of limits to achievable reliability, but the
schemes lead to networks that are so large, or that have such drive
requirements that they may be considered impractical for present-day
realizations. Modular schemes proceed by adding functional networks to
21
one or more nonredundant functional networks, and combining their outputs
in some special manner. The remaining schemes to be discussed here may
all be considered to be modular.
The sixth level distinguishes between simple replication and coded
redundancy. In coded redundancy, the functions provided by the resultant
nets are not identical to those of the original nets, but are related in
ways that may usually be described conveniently in terms of a code. Code
redundancy may be more or less effective than replication, depending upon
the logical functions performed. Coding redundancy has been demonstrated
to be either superior to or competitive with replication redundancy for
special functional areas within a computer that are characterized by a
high degree of uniformity of structure. These are data storage, arith-
metic, and analog-digital conversion. Methods for these functions have
been highly developed, following the early work on the different subjects
by Hamming, 117 Peterson, 23s and Kautz, TM respectively. _ Recent develop-
ments are discussed at some length in Sec. II-B of this report. Coding
methods for sets of general, complex logic functions have been described
by Lofgren, TM and presented (more comprehensively) by Armstrong. I° The
value of these methods for general logic functions has not been demon-
strated, and has been doubted by several authorities (e.g., Pierce, 243
pp. 132-145). The two factors that lead to low efficiency are the need
to apply fault masking to the decoding logic and the possible high cost
of producing the redundant checking functions independently of the non-
redundant functions. Although the approach is not attractive as a general
method, logical designers would be well advised to be aware of its pos-
sible value for particular applications.
The seventh level distinguishes between schemes in which the resto-
ration of the desired function from the possibly imperfect set of functions
produced is performed according to an adaptive rule or a fixed rule. The
* In citing early authorities, we attempt to specify those references
which first developed a concept with some generality; these have often
been preceded by disclosures of particular schemes that embody the
general principles.
22
use o£ an adaptive rule was proposed by Pierce, who demonstrated a
rule which, i£ implemented reliably, would achieve higher reliability
than a £ixed rule o£ the same order o£ redundancy. The method has not
been applied, because o£ the unavailability in modern technology o£
circuit elements that would implement the rule reliably. With the
miniaturization o£ logic elements, it appears to be increasingly £easible
to implement an adaptive rule in an all-digital circuit. Several schemes
£or such implementation are described in Sec. II-A-2 o£ this report.
The remaining schemes are distinguished by the criteria o£ levels
8 and 9. These are, respectively, whether or not the £ault-masked net-
work produces redundant output functions, and whether the restoration
£unction and the basic logic £unctions are accomplished in distinct logic
networks or are integrated in a single network.
The scheme of Liu and Liu lss (with its derivatives) is the only
example found in this study o£ a general treatment of single-output,
integrated restoration. It employs a network o£ redundant-input threshold
logic elements. In these elements, an output is produced i£ the linear
sum o£ weighted inputs exceeds a threshold, and the weight and the
threshold are chosen so that i£ an input is in error, the sum is still on
the correct side o£ the threshold. A serious disadvantage o£ this scheme
is that the sum o£ the weights required £or the many inputs to an element
is so great, even £or simple £unctions, that presently designable threshold
logic circuits would have unacceptably low margins. The scheme as it
stands must thus be considered impractical. It is clearly possible to
trans£orm the threshold-element networks into networks o£ simpler
(e.g., NOR) elements, but it is not known how such networks would compare
in size with those realized by nonintegrated schemes. This possibility
has not been investigated during the present study, but some consideration
is recommended. De Plan and Grisamore s9 have noted the possible merits
o£ this approach, and have given an illustration that is too simple to
permit a general evaluation.
Networks built according to the schemes o£ von Neumann ("single-
vote-taker redundancy") 318 and Teoste ("gate-connector" redundancy) 3°°
23
have single outputs and distinct sections for corrections and for the
basic logic. The scheme of Teoste may be considered as an adaptation of
the von Neumann scheme for even-order redundancy. Although its de-
scription supposes the availability of three-terminal branch-type (relay-
like) logic for the restoration circuit, there is a simple gate-type
equivalent. In the very well-known von Neumann scheme, the restoration
logic is the majority function. A number of significant, good features
of these schemes may be enumerated as follows.
(i) The scheme is equally effective for both 0 _ 1 and 1 _ 0 errors.
(2) The correction logic may be realized by the same kind of digital
circuitry as the functional logic; i.e., no special elements are
needed.
(3) The size of the functional module is unlimited; i.e., it may
range from a single gate to a whole computer.
(4) No modifications are required to the method of realizing the
functional logic, either in network structure or in factors of
element usage, such as fan-in or fan-out.
(5) The scheme is extendable to high orders of redundancy, and a
system may employ different orders at various points without
causing any special problems in design.
A significant limitation of the scheme is its sensitivity to faults within
the restoring network. For the ultimate output of a computer, if there is
a single receiver, the reliability of the element producing the output
must affect the reliability of the output; also, there is no benefit in
cascading vote-takers upon vote-takers. Hence the reliability of the
restoring network is an ultimate limit to the reliability of the system.
When the receiver of a system output may be replicated itself, then
limitation due to the restorer can be reduced significantly, as in the
following schemes.
The multiple-line voting scheme (also called multiple vote taking,
triple modular redundancy, etc.) derived by a number of workers from
yon Neumann 318 and the scheme of Tryon S°s (also called quadded logic or
interwoven logic) employs a redundant encoding on the output of a
functional network. Thus, a number of versions of a logic function are
generated, and all of these are employed as inputs to successive logic
networks (the order of redundancy of output lines is often made the same
24
Das the order of redundancy of the functional modules, but this is not
essential). This approach is beneficial because it reduces the dependency
of the system's reliability upon the reliability of the elements in the
restoring circuit.
Tryon's scheme is effective and simple to apply (see Pierce, 243
Chapter V). The following are some of its weaknesses:
(i) Fan-in and fan-out requirements for all logic elements are
increased significantly; e.g., for fourfold redundancy both
fan-in and fan-out are doubled. This tends to decrease the
basic reliability of the elements, and if more elements are
used to handle the increased input and output loading, new
sources of error are introduced. Moreover, for quadding the
redundancy in elements is approximately eightfold.
(2) Cross connections among logic elements (which are essential for
the fault-masking action) must be made at every level of logic.
This tends to increase the number of interconnections and the
weight and volume of the interconnection scheme. Also, it does
not permit variation in redundancy ratio except by changing
whole orders of redundancy.
(3) Reliability analysis of the networks constructed is extremely
complex; hence it is difficult to design for good reliability
improvement under weight and power constraints.
(4) Design of networks for ease of preflight testing @ appears to be
difficult; for example, in order to isolate a set of elements
for testing, all untested gate outputs must be set to a constant
i or a constant O, depending on the position of a gate within a
network.
Because of these weaknesses, the Tryon scheme has not been employed
extensively in practice. It is an elegant scheme of substantial theoreti-
cal interest, and it is almost practical, but it appears to be generally
inferior to the multiple-line voting scheme.
The multiple-line voting scheme has all the advantages enumerated
for the single-line voting scheme, in addition to the reduction of the
sensitivity of the system to vote-taker unreliability. Also, preflight
testing is reasonably straightforward, given independent control of the
_b Such testing is desirable in order to expose faults that would other-
wise be masked, so that the computer may be started in as perfect an
initial state as possible.
25
power supply for the individual ranks of logic. This schemeappears to
be the most attractive way o£ accomplishing fault masking for general
logic functions presently known. Furthermore, it is applicable to net-
works for special functions such as arithmetic and storage, and it is
in fact quite competitive with schemesbased on coding. A number of
important problems o2 analysis and optimum designs remain to be solved.
These are discussed in detail in Sec. II-A o2 this report.
Several concepts for fault masking that have been described in the
literature have been omitted from the above classification and discussion.
Someo2 these are the multivalue logic schemeso2 LowenschussIs5 and the
Transor and Quantile schemesby Mann195. These have been omitted because
they require somehypothetical circuit element or circuit-design scheme,
the feasibility o£ which has not been evaluated. It is not known that
any o2 these schemeshave been developed since their description.
Inspection of the classification tree o2 Fig. II-A-1 suggests a
numberof possible variations in the approaches o2 adaptive restoration
and coded redundancy, based on the use o2 either redundant outputs or
integrated restoration or both. Someo2 these variations may provide
bases for useful schemes.
It should be reiterated that all the schemesdescribed here accom-
plish 2ault masking locally within a system; i.e., there is no provision
for reallocation o2 redundant equipment amongdi2£erent 2unctions. It
should be noted that those schemesin which basic logic 2unction and
restoration are distinct permit an easy augmentation in logic to provide,
concurrently with the fault-masked data output, a separate signal that
may indicate a fault condition; and further, to provide indication as to
which replica is faulty. This 2eature is clearly use2ul in a replacement-
switching system.
2. Combinational Techniques for Fault Masking in General Logic
Networks
In this section we discuss techniques for the design of networks
for realizing general logic functions that can mask internal faults.
The techniques to be considered do not depend on the history of the
28
network, so although the network maycontain sequential elements, it is
convenient to visualize the networks to be discussed in this section as
strictly combinational networks. Fault-masking schemes that are based
on the sequential nature of a network are discussed in Sec. II-A-3. In
part a we review the state of the art for the analysis and design of
multiple-line voting-restoration networks, which is deemed to be the most
important available technique. A number of improvements to existing
methods of application are offered, and a number of suggestions are made
for further points of improvement. In part b we consider several new
schemes for the combination of fault-masking and switch-over redundancy.
Detailed schemes for networks having such features are presented, and a
number of problems that require further development are indicated. In
part c we present a number of particular designs for restoring elements;
included are single-output majority-function nets which are suitable for
passive fault masking, and multiple-output threshold-function nets,
which are useful in schemes for combined fault masking and switch-over.
a. Techniques for the Design of Multiple-Line Fault-Masking
Networks
i) Introduction and Summary of Prior Work
The preceding discussion has served to introduce the
various known techniques of digital logical design for error detection,
fault masking, and repair.
In the present section a detailed description will be
presented of the passive fault-masking technique wherein redundant
circuits are used in connection with selectively placed restoring organs
(voters) throughout the system. Our ultimate aim in this study is to
distinguish those portions of a spaceborne computer which can most benefit
from the application of the restoration technique, and also to provide
computer designers with an indication of the reliability improvement
that can be anticipated and specific rules for designing redundant net-
works.
Since yon Neumann's initial paper proposing the restoration
scheme, many subsequent papers have appeared discussing various aspects of
27
this technique. Each of these studies can be considered as relating to
one or more of the following three categories:
(a) Qualitative description of the advantages of using
redundancy and restoring organs. 31s, Is8, 33, 333
These studies were primarily concerned with estab-
lishing the basic theory embodyingthe restoration
scheme,and also with distinguishing the overall
improvement in reliability for simple digital models
incorporating restoration.
(b) Quantitative discussion of simple models. These
studies were concerned with the extension of previous
results to provide a quantitative evaluation of the
improvement in reliability which can be achieved with
the restoration scheme, and also with methods of
determining for a given system "cost," the optimum
allocation of redundancy, i.e., the evaluation of the
order of replication and the placement of voters so
as to minimize the probability of system failure.
Unfortunately these studies pertained only to simple
models of digital systems--namely, the visualization
of a computer as a cascade of single-input, single-
output blocks, Is4' i_5 or as a tree network of
double-input, single-output blocks, s9
(c) Analyses of complex system models. The recognition
o£ the shortcomings of the simple cascade or tree
models led to the search for techniques permitting
the analysis o£ arbitrary restored systems. These
studies culminated in Monte Carlo simulation programs
for the analysis of specific systemsvs, 188, 141 and
also in an approximate analytical approach based upon
determining the set of network cut-sets which when
individually faulty will result in system fail-
ure.140, 141 With either analysis technique, it
is possible to determine the optimum placement of
voters (given a maximumpermitted numbero£ voters)
by, in effect, evaluating the probability of system
failure for each possible set of voter locations,
and then choosing that set which minimizes the failure
probability. A dynamic programming approach272 to
the voter placement problem has been proposed along
with several other approachesTM which are not as
lengthy as the exhaustive procedures, although their
application has not yet been established.
Although this prior work provides an adequate framework
for determining, for certain systems, the expected gain in reliability
28
from the use of the restoration technique, several important questions
have remained unanswered. These include the following.
(a) For a given logical network, and a maximum permitted
overall redundancy, how can one estimate the expected
reliability, assuming optimum allocation of voters?
This question has remained unanswered in spite of the
abundance of research. For example, a reliability
simulation was conducted 7s of an arithmetic and con-
trol section of an airborne digital computer, which
as a nonredundant unit required 300 4-input, single-
output gates and exhibited a probability of failure
of 0.095. With triple replication and the use of
3 X 246 = 738 voters it was found that the probability
of failure was decreased to 0.004. On the basis of
visualizing the computer section as a cascade of 300
triplicated gates with 3 X 246 voters placed optimally
in the network, it can be shown by the analysis
technique discussed in Ref. 164 that the failure
probability is O.0001. Thus it is indicated that
extreme care should be exercised in applying results
derived from consideration of the simple model.
(b) What logical design techniques, if any, relating to
the logical dependence of the outputs of a multiple-
output network, yield replicated networks with minimum
probability of failure?
(c) For a given logical network, and a maximum permitted
redundancy, what are "simple" techniques for deter-
mining the optimum allocation of the available
redundancy?
Question (a) is discussed in Sec. II-A-2-a-3), where
several simple techniques are presented for the analysis of given repli-
cated logical networks with arbitrary placement of voters. These analysis
techniques are then used in Sec. II-A-2-a-4) to derive upper and lower
bounds on the redundant network reliability (assuming that the initial
network has a high circuit reliability) as functions of the number of
gates in the network and the maximum fan-in and fan-out. It is shown
that the simple cascade model provides an optimistic estimate of the
probability of failure.
Question (b) pertains to optimum logic design techniques
for multiple-output networks which are replaced by redundant versions of
the network with a restoring organ for each output. The major query is
29
whether, on the basis of minimum probability of failure, the network
should be realized in a minimal-gate manner--with, of course, a dependency
in the outputs--or whether it should be implemented with a distinct
independent block for each output. In Sec. II-A-2-a-5) it is shown that
for a complete decoding tree the minimal dependent output implementation
is to be preferred, although at this time we have not yet established
that the minimal-gate solution is to be preferred in general.
Question (c) concerns simple techniques for the determina-
tion of the optimum replication order and the optimum placement of voters
given a fixed, but arbitrary, amount of available redundancy. Except for
the simple cascade and tree models this question has remained unanswered,
although we have formulated several conjectures relating to heuristic
programming techniques, which are discussed in Sec. II-A-2-a-6).
Before proceeding to the development of analytical
techniques for answering the three questions, it is convenient to define
seven assumptions concerning circuit failures.
(i) A nonredundant network is composed of many circuit
blocks; the operation of each circuit block is
45
essential for overall network operation.
(2) A circuit under consideration either is working
properly or has failed completely.
(3) If a circuit has failed, its output will always be
in error, regardless of the condition of the input
variables. This is tantamount to assuming that the
period of occurrence of those inputs for which the
output is in error due to the failure is low com-
pared to the mission life.
(4) The failure of a circuit will not affect the input
to the circuit.
(5) If a circuit has failed, it can only be restored to
proper operation by being repaired.
_* This assumption negates the possibility of inherent logical redundancy
which can effect fault masking.
3O
(6) Circuit failures are independent and are due primarily
to random component failures, as contrasted to pre-
dictable component wear.
(7) The probability of circuit failures is small, through-
out the mission time of interest.
In the discussion to follow we will determine the perfor-
mance of complex redundant systems as a function of the failure probabili-
ties which are assigned to each of the circuit blocks of the network.
Although it is understood that the overall failure probability of the
network will be a function of time, the exact functional dependency can
be determined only if a temporal function relating the occurrence of
circuit failures is known. It has been shown 6s that for the assumption
of random circuit failures, the number of circuit failures in a given
period of time will have a Poisson distribution if each circuit is
repaired or replaced after it has failed, and also if several other weak
conditions are satisfied. The replacement (or renewal) assumption is
valid for redundant systems with high values of replication, in which
case for the purposes of calculation it can be assumed that the probability
of an individual circuit block operating correctly decreases exponentially
with time.
It is not immediately clear that the exponential failure
law applies to the circuit blocks when low orders of replication (e.g. 3)
are used, but because of the unavailability of accurate data on the
distribution of component life we will assume that the exponential law,
with published parameters for components, can be applied.
In the consideration of an appropriate measure of reli-
ability it is important to weigh the intended application of the system.
Clearly all of the pertinent information concerning system reliability
is contained in the function P(t), which is defined as the probability
that the system is operating correctly at time t. The system behavior
could also be expressed in terms of the function Q(t) = l-P(t)--the
31
probability that the system will fail during the time interval (O,t). _
The relative improvement in reliability obtained by using redundancy can
be expressed as
P(t) for redundant system
P (t) for nonredundant system
pR(t)
- Po(t ) (II-i)
In many cases it is difficult to derive explicit functions
P(t), Q(t), and moreover it has been the practice of reliability engineers
to refer to one parameter as a means for describing performance. The
commonly used parameter is the mean time to failure (MTF), defined as
0o
MTF = J_ P(t)dt . (II-2)
o
The corresponding improvement factor is expressed as
I = MTF for redundant system (II-3)
MTF for nonredundant system
Assuming that the exponential failure law is satisfied for
the system, then the MTF expresses the time at which the probability of
the system operating correctly is i/e = 0.37. If a single parameter is
used to describe the system performance for such applications as manned
space missions, it is evident that the term MTF is not appropriate since
it is unlikely that a mission would be permitted to progress to the point
where the probability of success is as low as 0.37. Perhaps a more
appropriate single term for describing the performance of a system is the
useful life164, s T_, which is defined to be the longest mission time for
which the probability of failure is no greater than A, where A is usually
much less than one.
_ In the following sections it will at times be convenient to discard
the argument t in referring to success or failure probabilities. It
is understood, however, that these probabilities indeed are temporal
functions.
32
It then follows that
Q(TA) = A , (II-4)
and the corresponding improvement factor is defined as
T A for redundant system
T for nonredundant system T50
In the following section a brief review is presented of
the simple models which visualize a complex network as a cascade or tree
of simple logic networks.
@
2) Techniques for the Analysis of Simple Models
In Fig. II-A-2 the use of redundant binary circuits
followed by a redundant set of majority vote takers is illustrated.
r I
i I
L J
TA- 55 llO - 4
FIG. II-A-2 STAGE FOR CASCADE NETWORK
If voters are spaced throughout a large redundant logical network it is
not difficult to visualize the network as consisting of sets of redundant
9b Portions of this section incorporate some of the analytical techniques
presented in Refs. 164, 69, and 188.
33
circuit blocks surrounded by redundant voters. It will be convenient to
define a stage in such a redundant network as a portion of the network
including a set of input voters and all of the circuit blocks between
the input and output voters. The stage concept, which is admittedly
ambiguous at this point, will be clarified in Sec. II-A-2-a-3) when
arbitrary networks are discussed. At present it will suffice to refer
to a stage, shown enclosed in the dotted lines of Fig. II-A-2, in a
simple cascade of redundant, single-input, single-output circuit blocks
with spaced voters. (It will at times be convenient to represent the
replicated networks with voters as a nonredundant network with circles
placed at locations where the set of replicated voters would appear.
This simplified representation is illustrated in Fig. II-A-3 for the
cascade stage.)
[ I
L J
T&-5S80-12
FIG. II-A-3 SIMPLIFIED REPRESENTATION
OF A RESTORED NETWORK
STAGE
In this case the order of replication is 2e + i, e = i,
2, .... We will characterize the stage (distinguished as the i th stage)
as operating (correctly) if at least e + i of the circuit blocks
s(i)_(i) provide correct outputs, given that at least e + 1S (i)
i' _2' '''' 2e+l
of the previous circuit blocks (contained in the i - 1 stage) S ki-lj,_
S_ i-l), ..., S_ ) provide correct outputs. Similarly the ithlstage
is characterized as not operating if at least e + i of the pertinent
circuit blocks do not provide correct outputs. With these definitions
of "stage" and "operating stage," an entire network will be operating
if all of its stages are operating.
34
Referring to Fig. II-A-2 the probability q
stage is not operating is
(i) that the ith
2e+l
q(i) = z
j =e+l
2ej+ i)_ _2e+l-j
• kppv J (i - ppv )j (II-6)
where
P
Pv
j - j'(2e + 1 - j):
= probability that a circuit block is operating
= probability that a voter is operating
(II-7)
If Eq. (II-6) is expanded in a series involving powers of
the parameters q = i - p and qv = 1 - Pv' and all higher-order terms are
discarded, it is noted that
_) _e+lq(i) _ \ e+Re + (q + qv ) when q, qv << 1 (II-8)
It is of interest to augment the conditions q, qv << i, for which the
above approximate equation is valid, with more precise conditions. It
can be shown that the term of the expansion of Eq. (II-6) in which q, qv
appear with exponents summing to e + 2 is
( : ) _e+2 (2: + i)_ [qe+l e+lq] e+22 +2+ 1 [q + qv ] _ + 1 e + i) qv + qv + eq
e+2"]
+ eq v
(II-9)
The ratio of (II-9) and Eq. (II-8) is, for q _ qv' of the order of q.
This indicates that the error introduced by using the approximation
(II-6) is negligible for practical systems. Also, for e = i,
(II-9) is negative indicating that (II-8) provides an upper bound on the
stage failure probability.
Equation (II-6) can be derived in an alternate manner.
Let us say that the stage of Fig. II-A-2 does not operate correctly if
and only if the proper combinations of e + 1 failures occur covering
35
voters and circuit blocks. (Under this failure condition we ignore the
occurrence of more than e + 1 failures.) As an illustration of the pro-
cedure for the counting of the proper failure combinations, let e - 1
of the total of e + 1 permitted failures occur in the voters, and 2
failures occur in the circuit blocks. There are then -(2e + i) " combina-
e - i. _
tions of e - 1 voter failures, and !_e + 2)\ circuit-block" failures which2
can result in stage failure. Hence the probability of stage failure due
to e - i voter failures and 2 circuit-block failures is
t2e + l)(e + 2\ e-i 2 e+2 2e-i
_ e - 1 2 /qv q PV P
(2e + i" (e + 2) e-i 2e - l J 2 qv q for q, qv << 1
Considering, then, all proper combinations of voter and circuit-block
failures, the expression for the probability of stage failure is approxi-
mated by
q(C) e_l(2e + i )(e + j)e jqje +I- j j qv-
j =o
(Be + i) q)e+l= (qv ++ 1
( i-io)
This failure-counting technique will be exploited further in Sec. II-A-
2-a-3) in the consideration of arbitrary networks.
When a network is formed by connecting N identical stages
of the type shown in Fig. II-A-2 in a simple cascade_ the resultant
expression for the probability of network failure is
(i))N (II-11)QR = i- (i- q
_ Once the positions of the voter failures have been established there
are only e + 2 circuit block locations remaining for which combinations
of failures can result in stage failure.
36
Equation (II-11) can be approximated as
QR _ N(2ex e + 1)(qv + q)e+l (II-12)
with an error term on the order of the value specified by Eq. (II-9) but
increased by a factor of N.
Equation (II-12) reflects the probability of overall net-
work failure based upon the assumption that only the proper combination
of e + i failures occurring in the N(2e + i) circuit blocks and voters
can result in failure.
As a final remark concerning the reliability of simple
cascades, consider the problem of determining the optimum number of
voters (and also, as a trivial question for the cascade, the placement
of the voters) so as to minimize the failure probability. Assume that
in the cascade of N replicated circuit blocks, a set of voters is placed
after every N/N t circuit blocks. @ We then find that the expression for
I
network failure probability QR reduces to the following Eq. (II-13),
where it is noted that the network, whose performance is described by
Eq. (II-12), now consists of a cascade of N t stages and the probability
of circuit-block failure in each stage is (N/Nt)q.
Then
QR _ N' 2 + 1
+ i (II-13)
We find that QR is minimized (performing the minimization
over the variable N ') for
N'qv = eNq . (II-14)
@ It is assumed that N' divides N.
37
For e = l--which, it is recalled, is tantamount to triple
replication--the optimum stage division of the network is such that the
failure probability of the voter is equal to the combined (nonreplicated)
failure probability of the circuit blocks in each stage. It is noted
that with the optimum stage division, for triple replication, the amount
of equipment required is on the order of six times the nonredundant
equipment, assuming that circuits exhibiting comparable failure probabili-
ties are of similar complexity. If fewer stage divisions than specified
by Eq. (II-14) are employed in order to reduce the overall redundancy
level, then the optimum placement of the available voters is such that
all stages exhibit equal (or nearly equal) failure probabilities.
It is of interest to determine the tradeoffs between
reliability improvement achieved through the use of redundancy as a
function of the cost of implementation, and the cost of the nonredundant
network. In this instance the cost factor will simply reflect the ratio
of the number of components in the redundant and nonredundant networks.
The analytical procedure, however, can be easily modified to include
other cost factors.
Consider the simple cascade of N circuit blocks each with
failure probability q = 1 - p. The probability of failure for the
resultant restored network, with voters placed after every N/N _ circuit
blocks, can be derived in a conventional manner by application of simple
modifications of Eqs. (II-6) and (II-ll) @ accounting for the existence
l
of N stages as opposed to N stages. The following assumptions are
pertinent to the analysis:
_ The following discussion related to reliability as a function of cost
was adapted from notes of E. K. Van De Riet. The results reported
herein are an extension of the results presented in Ref. 164 to include
the effect of realistic cost and reliability measures for the voters.
The exact formulas for failure probability rather than the approximate
formula (II-12), were employed because some consideration was given to
circuit-block and voter failure probabilities which were not low
enough to warrant the use of the approximation.
38
Ca) There are an equal number of components in a single-
circuit block and a 3-input voter circuit.
(b) The complexity of a voter circuit, for more than 3
inputs, is proportional to 3 e-1. (This complexity
assumption is based upon a component count of non-
minimal realizations of majority gates. The majority-
gate implementation presented in Sec. II-A-2-c
indicates that perhaps a more realistic complexity
factor would be 2e-1.)
(c) We can define the following overall restored-network
cost ratio C, based upon the above assumptions, and
also contingent upon the assumption of component
count as the primary cost factor.
N I
C = (2e + 1) + _ (2e + 1)3 e-1 . (II-15)
Also the probability of voter failure can be
expressed as
qv 3e-i= q . (II-16)
It will be convenient to express all failure probabilities
in terms of the failure probability of the nonredundant network,
Q0 = 1 - P0' Thus the probability of a circuit block operating is
given by
p = (1 - Q0)I/N .
Then the expression for network failure probability becomes
2e+l
Qr = 1 - i - _.
j =e+l
(2e J÷ l) iX - Q0)X/Nt3 e-I _ - (1 - QO)I/N
N'
(1- Q0)I/N'3e-1 _- (i - QO)I/N']}> j] .
(Ii-17)
39
Figure II-A-4 shows plots of the ratio of redundant
failure probability and nonredundant failure probability, QR/Qo , as
a function of the cost ratio and the nonredundant-network failure
probability. The curves provide an indication of the cost of achieving
failure-probability improvements corresponding to QR/Qo = i0 -I, l0 -2,
i0 -S 10 -4 The discontinuities in the curves for the latter three
values of QR/Qo indicate where the particular reliability improvement
can be achieved by changing the order of replication•
14
12
I0
0
IIII I I I I IIIIII I I I llllll I I I Illll I I I I
O. I 0.01 O.O01 0.OOOI
NON REDUNDANT FAILURE PROBABILITY(Qo )
FIG. II-A-4 RELIABILITY IMPROVEMENT AS A FUNCTION OF COST FOR CASCADE MODEL
The preceding discussion is concerned with the effect of
voting-type redundancy in that model wherein a computer network is
visualized as a cascade of identical single-input, single-output circuit
blocks. The analytical techniques employed therein can be directly
4O
extended to a case where a uniformly converging tree is realized from a
basic primitive fan-in circuit block. A redundant fan-in stage with
f inputs and one output is shownin Fig. II-A-5. As with the cascade
stage of Fig. II-A-2, it is convenient to visualize the set of input
voters as associated with the stage. Wefind that the probability of
the fan-in stage not operating is approximated by
q(i) _ (2e ++ll)(q + fqv)e+l , for q, qv << 1 (If-18)
The error term is evaluated from Eq. (II-9) by replacing qv with fqv"
A uniform tree is easily formed from the fan-in stages; in this case
voters are placed on f input lines to each circuit block. For example,
a 4-1evel tree is shown in Fig. II-A-6, where the dotted lines indicate
the division of stages (of which there are 15). The network failure
probability can be approximated by multiplying the number of stages formed
(i)
by the value of q derived from Eq. (If-18). If, in the uniform tree,
voters do not appear on the inputs to each circuit block--as shown also
in Fig. II-A-6 where voters appear every other level to form 5 stages
(the stages are enclosed in dashed lines)--then the application of the
stage failure probability expression, Eq. (II-18), must be altered. This
is because for the case considered here each stage encloses three circuit
blocks* and the fan-in to each stage is increased to 4. Thus for each of
the "dashed" stages of Fig. II-A-6 the probability of failure can be
expressed as
q(i) (2e + i) )e+l (II-19)e + 1 (3q + 4qv
The relative values of q and qv which minimize the failure
probability of a tree can be easily determined in a manner similar to
that used for the cascade. Curves for reliability improvement as a
function of cost can also be easily derived for the tree model. The
* Clearly deleted from the stage are the two voters connected to the
output circuit block.
41
FIG. II-Ao5
@
@
@
FAN-IN STAGE FOR TREE NETWORK
T&-_SeO-2
42
I
I
I
I
I
I
I
I
!
!
I
FIG. II.A-6 STAGE PARTITIONING OF UNIFORM TREE
43
results for the tree case are not presented here since it is felt that
the corresponding results for the cascade model provide a sufficient
qualitative measure of the utility of the restoration technique for a
given network size. In the following section techniques are discussed
for the analysis of specific arbitrary networks and the utility of the
simple cascade model is discussed.
3) Techniques for the Analysis of Arbitrary Triplicated
Networks
In the preceding section it was shown that the probability
of failure of a cascade-type restored stage (Fig. II-A-2) could be
approximated by determining the number of sets of e + 1 failures, e + 2
failures, etc., which resulted in stage failure. A similar analysis of
determining the occurrence of the "most-likely" failures will be used
to evaluate the failure probability of arbitrary networks, and the
analytical method will be illustrated with an example. Consider the
"arbitrary" triplicated @ stage of Fig. II-A-7, where it is assumed that
the failure probability of each voter is qv and the failure probability
of each circuit block Sij , i = 1 .... , 4, j = i, ..., 3 is q. (This
latter assumption concerning the equality of circuit-block failure
probabilities is based upon our supposition that logical designers will
probably attempt to employ large sets of nearly identical blocks in
designing combinational and sequential nets. If this assumption is not
valid in certain applications, then the analytical method to be discussed
is still applicable in theory, but the "bookkeeping" operations become
somewhat involved.)
As a first step in the analysis of this stage let us
count the number of occurrences of two failures which result in stage
failure, where the stage (of Fig. II-A-7) is defined as operating correctly
if and only if at least two of the outputs designated A and at least two of
* Unless otherwise stated we will, in the remainder of this section,
consider only triple-order replication; but it should be noted that
the results reported herein can be extended to higher-order replication
values.
44
©G
@-
G-
@-
®
TA-5580"13
FIG. II-A-7 "ARB TRARY" TRIPLICATED STAGE
the outputs designated B are correct. For example, failures of the
elements Vll and $42 will result in two errors in the output B (and
hence will result in stage failure), while failures of the elements
S21 and $42 will not result in stage failure. In the terminology of
Ref. 141 we say that groups V 1 and $4" are linked t and groups S 2 and S4
not linked, where two groups are defined as linked if two failures, one
occurring in each group, can result in stage failure. It is noted that
* Only a single subscript is used if the purpose is to identify a group
of 3 replicated elements (voters or circuit blocks).
The analytical approach based on linked groups is somewhat related to
previous investigations (Refs. 140, 141), but these prior studies give
accurate results only when the probability of occurrence of more than
two failures is minimal or when the network structure closely approxi-
mates either of the aforementioned simple models. We have modified the
procedure so as to account for the occurrence of more than two failures.
45
all groups are trivially linked to themselves. In the consideration of
the linked groups S. and S. _ , i _ i , it is easily shown that there are
l l
6 sets of double failures (one failure occurring in each group) which
can result in stage failure. In the consideration of the trivially
linked groups S. and S. it is verified that there are 3 sets of double
i l
failures which result in stage failure. If two groups are not linked
then there are no double failures covering these 2 groups which can result
in stage failure.
The "bookkeeping" procedure for the double-failure patterns
is facilitated by a table such as Table II-A-I, relating to the stage of
Fig. II-A-7 and reflecting the linked groups in the stage. An entry in
Table II-A-I
DOUBLE FAILURE PATTERNS BY LINKING PROCEDURE
V 1 V 2 S 1 S 2 S 3 S 4
I
V 1 3 3 f 3 3 3 3
Region I l Region II
V 2 3 3 I\ 3 3 3 3
S 1 3 3 \I 3 3 3 3
I
S 2 3 3 I 3 3 0 0
I
Region II S 3 3 3 I 3 0 3 0 Region III
I
S 4 3 3 i 3 0 0 3
the table is either 3 or O, according to whether the groups corresponding
to the respective row and column headings are linked or not linked. The
table is partitioned into three regions, each reflecting the element types
in which the failures have occurred. Region I relates to the occurrence
of failures exclusively in voters; region II relates to the occurrence of
_ We could have equivalently selected the linked groups V i and V i, or
the groups S.I and V i.
46
one voter failure and one circuit-block failure; and region III relates
to the occurrence of failures exclusively in the circuit blocks. The
probability of stage failure resulting from the occurrence of two element
failures is then derived from the double-failure-pattern table as
Prob. of stage failure from 2 element failures =
N(r)_2 N (r)
(Sum of entries in Region I)[q_(l - qv ) v (i - q) s ]
N(r)_l
+ (Sum of entries in Region II)[qvq(l - qv ) v
N(r)_l
(i - q) s ]
N (r) N (r)_2
+ (Sum of entries in Region IIl)[q2(l - qv ) v (i - q) s ]
(II-20)
where N "r"( _ and N 'r'( _ are the number of voters and the number of circuit
v s
blocks (replicated), respectively, in the stage. It will be convenient
to approximate Eq. (II-20) as
Prob. of stage failure from 2 element failures =
f2(q,q v) + f3(q,q v) (II-21)
where f2(q,qv ) and f3(q,qv ) are respectively the terms in the expansion
of Eq. (II-20) in which the sum of the exponents of q and qv are 2 and 3.
We can visualize f2(q,qv) as representing an approximation to the failure
probability and f3(q,qv ) as representing a contribution to the dominant
error term.
Hence for the stage being considered the expression for
the probability of failure becomes
12q2v( 1 _ qv)4(l _ q)12 + 24qq v(l _ qv )5(I - q)ll + 30q2(i_ qv )6(I - q)lO
(12q2 + 24qqv + 30q2) - (48q3v + 264q2q + 444qvq2 + 300q 3) .
47
It is of interest to consider the effect of error patterns
of weight greater than 2 on the failure probability, to augment the error
term f3(q,qv). We have developed a systematic procedure for determining
the contribution to the failure probability from triple element failures,
and this procedure can be generalized for the consideration of an arbitrary
number of element failures. As might be expected, the complexity of the
bookkeeping operations increases as additional failure combinations are
considered.
In the consideration of triple-element failures, six cases
completely cover all combinations of group linkings of circuit blocks (or
voters). For example, if we are concerned with the number of triple
failure patterns of the groups Si, Sit , Six , (one failure occurring in
each group) which will result in stage failure, then the counting pro-
cedure simply reflects the manner in which the three groups are linked,
and also the equality of i, i' and i x . As a specific case consider (for
the stage of Fig. II-A-7) the 3 failures as occurring in the groups SI,
$2, and SS. In this case one group (SI) is linked to each of the other
groups, and also i _ i' i v ' ', i _ , i" _ i , where i = i, i' = 2, i v = 3.
There are 27 triple error patterns covering the 3 groups, and 24 of
these--all except the error patterns SIj S2j S3j , j = i, 2, 3--will
result in stage failure. The six cases are summarized below.
Case a :
Case b:
Case c :
Case d:
Case e:
Case f:
i = i' = i#--i failure pattern
.t i _ _ #
i = i _ ; S i and S i not linked--9 failure
patterns
i = i t _ i v; S i' and S #i linked--9 failure patterns
i _ i I i v .t #, i _ , I _ i ; all group pairs not
linked--O failure patterns
# S ti _ i t , i _ #, i t _ i ; S and . linked,
remaining two group pairsinot li_ked--18 failure
patterns
t and S_--24i = i t , i _ i v, i S _ i v, S i linked to S i I
failure patterns.
As was the case with the double failure patterns, the
failure probability calculation resulting from the consideration of triple
48
error patterns is greatly facilitated with the use of a failure-pattern
table, as illustrated for the stage of Fig. II-A-7 by Table II-A-2. The
row headings, as in the table for the double failure patterns, are a
listing of the N voter groups followed by the N circuit-block groups
V
However, the column headings are a listing of the 2 + N + N
V S
combinations of the groups taken two at a time. The entries in the
table relate to the number of failure patterns corresponding to each of
the 6 above cases, where the particular case is distinguished by the
corresponding row-column headings. For example, the entry corresponding
to row S 3 and column $2S 4 is defined by Case e, in that S 3 and S 4 are
linked, and the resultant entry is 6.
The table is partitioned into 4 regions, each reflecting
the element types in which the failures have occurred. Then the contri-
bution to the stage failure probability from the occurrence of triple
failures is given by the following, where only the lowest-order terms
are retained.
Prob. of stage failure from triple failures
(sum of entries in Region I)q_
+ (sum of entries in Region II)q_q
+ (sum of entries in Region III)qv q2
+ (sum of entries in Region IV)q 3 . (II-22)
Combining the results of Eqs. (II-21) and (II-22) we can
then represent the failure probability of a stage by the term f2(q,qv )
_b The entries corresponding to Cases d, e, and f are 0, 6, and 8 respec-
tively or one-third of the number of failure patterns listed for these
cases, since there are 3 locations in the table in which entries appear
for Si, Sil , Sin , i _ i l, i d in , i I _ iH. Those entries corresponding
to Cases b and c are 9/2 or one-half of the number of failure patterns,
since there are 2 locations where i = i I _ i#. The entry corresponding
to Case a is i.
49
o']
[/3
[/3
o'3
[/3
Z
_=_
I _ ,_
,-.4 F._
[-_
,-,4
r_
,-4
,-4
r/3
g
@
e_
_I_ _I_ _I_
_ _I_ _I_
_I_ _I_ _ _I_
@
_I_ _ _ _I_
_I_ _I_ _
<_I_ _I_ ,_ _I_ _i_ <_I_
\
I
_I_
_I_
_I_ _ _
_I_ _ _I_
\
_I_ _I_
,,-4 L'_
_I_ _I_ _i_ _I_
0
@
.,-I
<_I_ _i_ _I_ _I_
@
5O
plus an error term given by the sum of f3(q,qv ) and the result of
Eq. (II-22) above. It appears that the error term (for triple repli-
cation) is always negative, but no proof of this conjecture has as yet
been formulated. It also appears, for q _ qv' that the ratio of the
error term to f2(q,qv ) is on the order of the product of qN s.
For the stage we have been considering, it is easily
verified that the error term is
2 3
20q3v + 96q2vq + 84qvq + 98q
Prior to this point in the discussion of arbitrary
restored networks we have been concerned with finding the failure
probability of an artefact unit called a "stage." It is noted, by
referring to the stage of Fig. II-A-7, that voters only appear on the
input lines. However, the techniques illustrated for the stage can be
applied to any arbitrary restored network with arbitrary placement of
voters. For arbitrary networks, the failure-pattern tables for double
and triple failures are applicable, but appropriate row and column
headings must be present for each group of circuit blocks or voters. In
the compilation of this table we note that two groups (circuit blocks
or voters) are linked if there exists at least one path (in the direction
of normal signal flow) connecting the output of either group with the
output of the other group which does not include a voter, or if there are
two paths, one emanating from the output of each group, which intersect
at a circuit block without including a voter. For example, referring
to the network of Fig. II-A-8 (in simplified representation), it is seen
that V 3 and S 4 are linked and S 2 and V 4 are linked, but S 1 and S 2 are
not linked.
It is easily verified that this property of linked groups is consistent
with the definition presented previously in this section in terms of
the effect of failures in the respective groups. It is also noted that
a circuit-block group followed by a voter group is not linked to it (in
the absence of feedback) since there is a voter in the path connecting
the outputs.
51
-FIG. II-A-8 NETWORK TO ILLUSTRATE LINKING
The approximation to the failure probability of a network
as the probability of only two or three linked-circuit-block failures is
only valid if the product of the failure probability of a circuit block
and the number of circuit blocks in the nonreplicated network is much
less than unity. _ If this assumption is not valid (although for most
networks in a spaceborne computer the assumption indeed appears to be
satisfied) then an accurate analysis can be performed by visualizing the
network as a composition of stages and utilizing the analytical techniques
discussed in order to determine the reliability of each stage.
In Sec. II-A-2-a-2) an ad hoc definition of a stage was
given. At this time we can formulate the definition in more precise
terms relating to linked groups.
Definition: A stage of an arbitrary restored
network is a collection of circuit-block and
voter groups with voters only on external in-
put lines and also with the restriction that
no group within the stage is linked to a group
external to the stage.
The probability of the network operating correctly is
then given by the product of the operating probabilities of the respec-
tive stages, where the expression for the probability of a particular
stage operating is derived by the methods of this section, t
¢_ An illustrative example is discussed in Sea. II-A-2-a-4).
The circuit-block failure probabilities will probably be low enough to
permit the evaluation of the stage operating probability in terms of the
probability of 2 or 3 element failures.
52
It has been noted previously 272 that for certain network
topologies there is somedifficulty attendant to the problem of revealing
the stages. As an example, consider the restored network of Fig. II-A-9.
£ H
M
C
D
FIG. II-A-9 ARBITRARY NETWORK TO BE SUBDIVIDED INTO STAGES
The subdivision of the network into stages is not possible with the
present connection. However, the subdivision can be performed by con-
t # appearing on the appropriatesidering the voter V 6 as two voters V 6 and V 6
input lines of circuit blocks SII and S12 respectively, and also con-
' and # appearing on the appro-sidering the voter V 4 as two voters V 4 V 4
priate lines of circuit blocks S 7 and SI0. Three stages, as shown in
Fig. II-A-10, can then be formed; it can be shown that the failure
probability of the original network is slightly lower than the failure
probability of the modified network with the additional voters. Thus, in
the analysis, the error attendant to the modification is in the direction
of a conservative estimate of reliability.
In the present section techniques have been given for the
analysis of arbitrary restored networks. In many applications it is
important to estimate the reliability of a given network with an assumed
number of available voters, without going through a lengthy analysis. A
method for the evaluation of upper and lower bounds on network failure
probability is discussed in the following section.
53
K M
1
FIG. II-A-IO STAGE SUBDIVISION OF NETWORK
54
4) Bounds on Network-Failure Probability
In this section we will present upper and lower bounds on
the failure probability which can be expected from triplicated restored
networks of a given "size" with a given density of voters. Our major
motivations for considering this problem are to discover the relevance of
describing the performance of arbitrary networks in terms of the simple
cascade model, and also to present easily applied methods for evaluating
a rough measure of the reliability of redundant networks. It is shown
below that visualizing a complex network as a simple cascade gives an
optimistic measure of the reliability.
In considering the derivation of upper and lower bounds on
the expected failure probability of networks we have assumed the following
constraints, in addition, of course to the assumptions discussed in
Sec. II-A-2-a-l).
(a)
(b)
The (nonredundant) network is composed entirely of
one primitive element type; in the analysis to follow
we assume the primitive element to be a S-input,
single-output circuit block where all possible inter-
connections are permitted (provided the fan-in and
fan-out do not exceed 3). The assumption of this
primitive element type is not restrictive, and the
results can be readily generalized.
The voters are dispersed throughout the network so
that the overall failure probability is close to
minimal. It is not known how to optimally place the
voters in an arbitrary network, so we have operated
on the premise that the restored network will be
close to optimal if the voters are placed so that the
maximum number of circuit blocks traversed on any path
between voters is minimized. (For regularly structured
networks this placement criterion appears to be
optimal, but it is not difficult to visualize topo-
logies for which the failure probability is not
minimized by such a technique.) In order to facili-
tate the calculations we assume that the maximum
number of circuit blocks on nonrestored paths could
assume, for different cases studied, the values
i, 2, .... This latter assumption enables the deter-
mination of bounds on the failure probability for
different numbers of available voters.
55
(c) Voters are placed on the output lines of circuit
blocks.
(d) The circuit-block and voter failure probabilities are
low enough that the analytical techniques of the
previous section, based upon only double element
failures resulting in network failure, are applicable.
First we will consider the determination of the greatest
lower bound on the failure probability, Q_L)-. Assume that we have a
network composed of interconnections of N 3-input single-output circuit
s
and that voters appear following every Ns/N' blocks. @ Hence forblocks,
triple replication 3N t voters are required. We recall that the failure
probability is minimized by minimizing the number of linked circuit
blocks and voters. Since there are at least N /N t circuit blocks between
s
voters, the failure probability is minimized if each circuit block is
linked to a total of N /N t circuit blocks, each voter is linked to Ns/NI
s
circuit blocks, and each voter is only linked to itself. The network
which satisfies these linking conditions is the simple cascade shown in
Fig. II-A-II; this cascade is admittedly artificial since all of the
inputs to a circuit block originate at the same source.
FIG. II-A-11
T&-5500-25
CASCADE NETWORK WHICH MINIMIZES FAILURE PROBABILITY
mined as
The failure probability of the network is easily deter-
G Ns )2Q(i_)L = 3N' +v _'7 q . (II-23)
Now consider the upper bound on failure probability which
is realized for a topology for which the number of circuit-block and
@ It will be assumed that N
S
and N' are such that N'IN .
s s
56
voter groups which are linked is maximized. Consider a 3-input, single-
output circuit block whose output is connected to a voter. Distinguish
this circuit block as a first-order block. Since there is a voter on the
output of the first-order block, only self-linkings and linkings by means
of the 3 inputs are possible. Assume that the sources of these three in-
puts, called second-order blocks, are all distinct. In turn, each of
these second-order blocks is a sink for 3 third-order blocks, etc.
(Since in the network as visualized the number of circuit blocks between
voters does not exceed Ns/Nt , the maximum "order" of a block is Ns/Nt.)
It is then seen that the maximum number _8 of e-order blocks with which
a B-order block is linked is given by
_0_ = 3 [max(c_' _)-i ] . (II-24)
The maximum number of y-order blocks which can be linked
to a y-order block is
_7 = 37-1 . (II-25)
In a network with N' voters we can treat the voters as equivalent, on the
basis of the calculation of linkings, to (Ns/Nt + 1)-order blocks. A
portion of the network which exhibits the maximum possible linkings among
the various circuit blocks and voters, for N /N I = 2, is shown in
s
Fig. II-A-12.@ It is seen that all 9 voters are linked and also that
each voter is linked to each of the 18 circuit blocks. Each second-order
circuit block is linked to 3 second-order circuit blocks (including it-
self), and is also linked to 3 first-order blocks. Each first-order
block is linked to itself, and also to 3 second-order blocks.
i/(3Ns/N')@ The overall network, which is a cascade of N stages of the
Ns/N N /N Is
type shown in the figure, contains 3 inputs and 3 outputs.
57
SECOND ORDER
BLOCKS
\
FIRST ORDER
BLOCKS
TA-_5110 -II
FIG. II-A-12 NETWORK WHICH MAXIMIZES FAILURE PROBABILITY
58
The expression for the upper bound on the failure probabil-
ity, Q_U)- for arbitrary N and N l, is derived as follows.
' s
Consider, first, the contribution to the failure probabil-
ity from the linking of circuit blocks. A block in order i is [from
Eq. (II-24)3 linked to a maximum of 3 i-I blocks, contained in orders
i, 2, ..., i - i. This block in order i is also linked to 3j-I blocks
contained in orders for which j satisfies j = i, i + i, ..., N /N'.
s
Thus the total number of circuit blocks to which each block in order i
is linked is given by
Ns/N '
(i - i)3 i-I + _. 3 j-I
j=l
(II-26)
However we note that, assuming an equal number of blocks of each order
in the network, there are a maximum of N' blocks of order i; and also,
from Eq. (II-20) the contribution to the failure probability from each
2
linking between circuit blocks is 3q . Thus the contribution to the
network failure probability from circuit-block linkings is given by
N s/N ' Ns/N '
3NZq 2 T, [(i - i)3 i-I + E 3 j-l] .
i =i j =I
Ns/N '
Each voter is linked to a maximum of 3 voters, and
N tsince there are voters in the overall network, the contribution to
the failure probability from voter linkings is given by
(II-27)
l
3N 'qv23 Ns/N . (I 1-28 )
N /N'
Similarly, each of the N' voters is linked to (Ns/N')3 s
circuit blocks (and of course each of the N s circuit blocks is linked to
N /N'
3 s voters), thus indicating that the contribution to the failure
probability from voter-circuit block linkings is given by
N /N'
s (II-29)3 X 2Nsqqv3
59
Simplifying Eq. (II-27) and summingthe resultant expression
(u) becomeswith Eqs. (II-28) and (II-29) we find that QR
f )Ns/N' Ns/N1Q_U) = 3Ntq2 Ns/NI 3 + 1 - 3
I 2_Ns/N ' Ns/N'
+ 3N qv,_ + 6Nsqqv3 . (II-30)
It is of interest to compare the lower and upper bounds as
specified by Eqs. (II-23) and (II-30). We can visualize the two networks
which satisfy the lower and uppe_ bounds as each containing N s circuit-
block groups and N' and N'/3 Ns/N" stages respectively. A meaningful
comparison is achieved by tabulating as a function of Ns/N' (the number of
circuit blocks between voters) the quantities
and
(3N ,q2 ) (3N ,q2 )
These two quantities are a measure of the failure probability per voter
group divided by q2 (assuming q _ q ). These values are shown in
v
Table II-A-3; it is not meaningful, in the context of the two networks,
to show smooth plots connecting these points.
Table II-A-3
MEASURE OF LOWER AND UPPER BOUNDS
ON FAILURE PROBABILITY
N s/N '
1
2
3
4
5
QR(L)/3N ,q2
4
9
16
25
36
QR(u)/3N ,q2
i0
55
243
963
3646
60
It should be noted that the upper bound is somewhat
pessimistic since it is unlikely that circuit connections of the type
shown in Fig. II-A-12 would occur in practical networks. However, these
studies do indicate that care should be exercised in applying the reli-
ability measures obtained from consideration of the simple cascade model,
since these results tend to be optimistic. If an accurate measure of
the performance of a network is required, then it appears that a complete
analysis must be performed. It is recommended that consideration be given
to the development of computer-aided techniques for the rapid analysis of
arbitrary restored networks; for systems with fairly high initial reli-
ability the simple analysis technique based on linked elements and
described in this report can be applied.
5) Techniques for the Realization of Multiple-Output
Networks with Voter Redundancy for Fault Masking
In this section we briefly discuss a problem which apparently
has not been previously considered. It was noted previously [Sec. II-A-
1-b-l)] that the output functions in a multiple-output network must be
realized independently in order to apply parity-cheek codes for the
parallel checking of the outputs. This independent realization condition
is required so that the number of outputs in error does not exceed the
number of failed components in the network.
A related question arises in the consideration of multiple-
output networks to which are applied the conventional restoration tech-
niques, discussed in this report. The question is whether, if the outputs
of a multiple network are each protected by voting-type redundancy (where
the voters are placed only at the network inputs) the output functions
should each be realized independently so as to minimize the failure
probability.
_ The results of the cascade can be applied if the density of voters is
high--say one voter for every 2 or 3 circuit blocks--in which case the
difference between the upper and lower bounds is probably less than the
inaccuracies attendant to the determination of component reliabilities.
61
At present, we do not know the answer to this question
for all function types, although we conjecture that the minimumfailure
probability is achieved if the function is realized minimally, regardless
of the dependencyof the outputs. An example illustrating this conjecture
for triple replication is presented below, wherein the failure probabili-
ties of two realizations of a serial decoding circuit are compared.
A serial decoder for 2 m signal lines can be realized as
an m-level tree network, of the type shown in Fig. II-A-13, composed of
simple, single-input, double-output sequential decision blocks. It is
assumed that there are 3 replicas of the entire network, and a single
_ 2 mperfect voter is employed for each of the outputs.
SERIAL
INPUT
LEVEL I LEVEL 2
• 2 m OUTPUTS
LEVEL m
TA-5580-18
FIG. II-A-13 TREE REALIZATION OF SERIAL DECODER
_ In order to simplify the analysis we have assumed, with no loss of
generality, that the voters are perfectly reliable.
62
Another implementation of the decoder is shown in
Fig. II-A-14, in which each of the 2 m outputs is realized independently
by a cascade of single-input, single-output sequential decision blocks.
We would expect that the sequential blocks of the tree realization and
the cascade realization would be of comparable complexity, indicating
that equal failure probabilities can be assigned to the blocks of both
networks. Also, in the discussion to follow, we will employ the sim-
plified analytical techniques based upon the assumption that the occurrence
of more than two failures can be ignored.
LEVEL I LEVEL 2 LEVEL m
2 m OUTPUTS
T&-5580-19
FIG. II-A-14 CASCADE REALIZATION OF SERIAL DECODER
easily determine that the failure probability of Q_c)(We
of the independent cascade connection is given by
QR(c) = 3 • 2m(mq) 2 = 3 • 2mm2q 2 . (II-31)
The analysis for the tree is as follows. The block in the
first level is linked to all blocks in the network. Each block in level 2
is linked to one-half of the blocks in each of levels 2, 3, ..., m.
Similarly each of the 2 0/-1 blocks of level _ is linked to 2 8-_ blocks in
each of the levels _ = _, _ + i, ..., m, and each is also linked to
1 block in each of the levels i, 2, ..., _ - i.
63
Thus the expression for the failure probability of the
tree, Q_T),_ can be reduced by application of Eq. (II-20) to
m mQR(T) = 3q 2 7, 2a'-l[(01 - i) + 7, 2 . (II-32)
8-_2
It can be verified that the above equation reduces to
Q(T) = 3q213 + 4m2 m-I - 3 • 2m3 (II-33)
For the purpose of comparison the values of Q_C) and
Q_T) for a few values of m are shown in Table II-A-4.
Table II-A-4
COMPARISON OF FAILURE PROBABILITIES
FOR CASCADE AND TREE REALIZATIONS
m
2
3
4
m>>l
2
48q
2
216q
768q 2
6m22m-lq 2
2
21q
2
81q
2
249q
6m2mq 2
It is seed that for large values of m the ratio _R /_R is of the order
of m/2; but for small values of m, (say _ 4) the difference between the
reliability measures is probably negligible.
At first glance these results appear somewhat paradoxical
since, in the application of parity-check codes for the masking of faults
in multiple-output networks, a basic requirement is that all outputs must
be realized independently. However, this basic independence tenet is not
in fact violated. The code being applied here to each output is equivalent
to the simple slngle-error correcting code which consists of one informa-
64
tion digit and two independently generated check digits for each output.
This code as applied to the serial decoder in question (either realization)
will mask single failures, and it is clear that the occurrence of any
single fault cannot result in two errors on the three signal lines
associated with any output.
6) Conclusions and Future Problems for Study
The ultimate aim of this study, as stated in Sec. II-A-
2-a-l), was to indicate to a designer the expected improvement in reli-
ability that could be anticipated from the application of the restoration
technique, and also to provide algorithms for the optimum application of
the technique. At this point in the research the aim has not been entirely
achieved, but with the tools outlined in this report and in the many papers
related to the subject, it is possible for a potential user of the res-
toration method to determine its application in most networks of interest
by following systematic (though possibly lengthy) analytical procedures.
The major results of the study are as follows.
(a) Techniques are presented for the analysis of
arbitrary restored networks, where the appli-
cation of the analysis is subject to the
constraint that a fairly simple failure model
is pertinent. The analysis is facilitated if
the original nonredundant network is reasonably
reliable, although this is not a necessary
requirement.
(b) A method is presented for determining upper
and lower bounds on the failure probability of
restored networks on the basis of the assumptions
that the simple failure model is pertinent, that
the network is composed of interconnections of a
single primitive circuit block, and that the
voters are placed optimally throughout the
network.
(c) It is indicated that when the restoration
technique is applied to multiple-output net-
works, a lower value of failure probability
is probably achieved if each replica of the
network is realized in a minimal manner--even
though the outputs might be quite dependent--
than if the network is realized with independent
outputs.
65
A numberof outstanding problems still remain, primarily
related to the need for techniques by which restored networks which
globally minimize the failure probability can be synthesized, subject to
a maximum available redundancy. Some of the specific areas for recom-
mended future research, related to this problem, are as follows:
(a) The development of techniques for determining
for arbitrary networks the "form" of the res-
toration, so as to minimize failure probability.
For example, if an overall redundancy of 5.S is
available, the question is whether the replica-
tion order should be S with a high density@of
voters, or 5 with a low density of voters. In
addition the expected reliability improvement
realized by the application of several other
forms of restoration should be quantitatively
evaluated. These schemes include:
• Variable redundancy; i.e., replication
order not constant throughout the network.
• Techniques wherein only a subset of the
replicated circuit blocks are connected
as inputs to each of the replicated
voters--previous research in this area232, _43
has been concerned with qualitative de-
scriptions.
• Generalized interwoven logic243--it is
felt that the analysis will be extremely
difficult, and at present the technique
appears to be quite costly, especially
when multiple failure correction is re-
quired.
(b) The development of easily applied techniques for
determining the optimum placement of voters--
initially for networks where the replication
order is constant and each voter weighs all
replicated signals equally. The placement
techniques which have been described relate to
the evaluation of failure probability for each
set of voter positions either by simulation 75
(requiring approximately two hours of computer
* The results of Knox-Seith Is4 which provided a solution to this prob-
lem for the cascade model, are discussed in Sec. II-A-2-a-2).
66
(c)
(d)
time for a network of 300 gates) or by systematic
analysis. TM It is felt that a much simpler
technique can be developed, in particular when
only double failures are considered in the
analysis. We have conjectured that the following
dynamic programming systematic procedure will
always converge to the optimum placement.
• Consider an arbitrary initial placement
of voters, and evaluate the failure prob-
ability. Move one voter in turn to each
available position and for each position
calculate the failure probability. If
the failure probability with the voter in
any new position is not lower than the
initial failure probability, return the
voter in question to the original posi-
tion; otherwise place the voter in that
position which provided the minimum
failure probability. Repeat this opera-
tion, perturbing the position of each
voter separately, until the failure
probability is not further reduced. If
this placement technique proves to be
suboptimal, then it is recommended that
consideration be given to determining
for what network topologies a solution
close to optimal is obtained. It is
also of interest to determine, in parti-
cular for large networks, the probability
that a random voter placement yields a
solution close to optimal.
The development of computer-aided techniques for
the analysis of given restored networks, and also
for the synthesis of optimum restored networks,
possibly on the basis of the conjectured optimum
dynamic programming approach described above.
It appears that some LISP programming techniques
designed for locating loops in a linear graph
might be applicable for specifying the linked
elements in a restored network.
All of the analytical procedures developed for
restored networks have been based upon a simple
failure model* where it is assumed that failure
of a component will not affect the state of any
signals which appear as inputs to the failed
* See Sec. ll-A-2-a-1) for a complete discussion of the assumptions
attendant to the model.
67
component. This model is only consistent with
networks in which there is somedegree of iso-
lation between gates. A more complex model
based upon considering componentfailures as
producing errors in both outputs and inputs,
which has been considered in simulation studies, 76
can be examinedby appropriately modifying the
linking definition of Sec. II-A-2-a-3).
In describing the performance of complex networks, we
have distinguished the network as either operating (correctly) or not
operating, and then described meanswhereby the probability of the net-
work not operating is minimized. However, the criterion of minimum
failure probability is somewhatinconsistent with the tenet that a com-
puter should function, although with possible loss in computation capa-
bility, as failures occur. (This point of view has been considered
briefly before. Is) Hence, for somemultiple-output networks it is
meaningful to assign a cost metric to the occurrence of each failure,
and then allocate the redundancy so as to minimize the average "loss"
in capability. In addition, it is meaningful to assign a probability
measureto the occurrence of somemembersof the set of inputs. The
inclusion of the cost function might alter the conclusion of Sec. II-A-
2-a-5) relating to the optimum realization of multiple-output functions.
b. Techniques for the Combination of Fault Maskin_
and Replacement
1) Introduction
In this section logical designs will be presented for
three schemes of replacement-type redundancy. All of the schemes are
autonomous in the diagnosis and repair of faults, and they also provide
for a certain degree of fault masking during replacement. The autonomy
and masking are provided by the employment of various forms of voting,
so that the schemes might actually be considered as hybrids of voting
and replacement redundancy.
An idealized model that encompasses all three schemes
has been described and analyzed by Kruus. 171 It is hoped that the
presentation of practical designs will enable designers of future complex
systems to evaluate the proper system level for the application of these
schemes.
68
Das follows:
The three schemes to be described may be characterized
(z) Adaptive voting (after Pierce24°)(Fig. II-A-15)--
A basic functional network is replicated, and
the outputs of the replicated units are combined
in a variable-threshold network to provide a
system output; if a unit dissents from the system
output, it is disconnected, and the threshold is
diminished so as to make the system output equal
to the majority of the outputs of the remaining
units.
WEIGHTS
X I • "1-1
> 0
J --I
XR
EXCL, FF
-OR TLU
D
r
CLEAR TA-5580-72
FIG. II-A-15 ADAPTIVE-VOTING SCHEME (using threshold logic)
(2) Switching-over-voting (Fig. ll-A-16)--Replicated
units are grouped in subsystems and are combined
to provide single subsystem outputs; externally
the outputs are selected by a stepping circuit
that advances to a new subsystem when the con-
nected subsystem fails or when it is undesirably
close to failure; internally, subsystems employ
69
adaptive voting redundancy, as in scheme (i),
for fault masking and for indication of the
degree of closeness to failure.
.PuTI ®
VARIABLES
MODULE
v _ OUTPUT
/ FUNCTION
SWITCH
k COMPONENT
TA- 5580-7 _,
FIG. II-A-16 SWITCHING-OVER-VOTING SCHEME
(3) Voting-over-switching (Fig. II-A-17)--A fixed
number of units is selected from a store of
replicas, and their outputs are combined by
majority voting; as a unit dissents from the
system output, it is replaced by a unit in the
store. Several replacement algorithms are
feasible; e.g., the input whose unit was faulty
may be distinguished, and a replacement unit
directed to it, or all inputs may have fresh
assignments, by selecting valid units in a
predetermined order. The latter strategy ap-
pears to be better, because it is iess suscep a
tible to the propagation of faulty decisions
among the subsystems.
The major advantage of these schemes over passive fault
masking is the increased tolerable number of faulty subsystems--approxi-
mately N-2 instead of N/2, for N-order redundancy. Another advantage,
of significance for spaceborne applications, is the economy of power
consumption possible in schemes (2) and (3).
7O
)STATUS
I] DISSENTIDENTIFICATION
SEQUENCER
DELEGATE
VOTES
A B C
i
VOTE- ITAKER MAJORITY
DECISION
D
OUTPUT
TA-5580-74
FIG. II-A-17 VOTING-OVER-SWITCHING SCHEME
There are several disadvantages. First, to varying degrees
among the schemes, certain multiple failures occurring between replacements
may cause the whole system to collapse. For example, in scheme (3), if a
majority of the presently connected units are faulty, all of the units in
store may be invalidated; in scheme (2), a subsystem might simply become
stuck in a fixed, erroneous state if a majority of units in it become
stuck. A system might be externally programmed so as to recover from such
conditions, but such coutrol adds to the system cost. A second disadvan-
tage is that the number of components--hence the unreliability--of the
control and switching is greater than for passive fault masking. Thus
the minimum size o£ functional unit to which the schemes may be advan-
tageously applied is greater. A third disadvantage is the difficulty of
design for proper response to noise. Thus, if a unit has a transient
fault, it may be disconnected. It would be desirable to reconnect these
units, but it would be hazardous to do so without testing each one, since
71
if all were permanently faulty the faulty units might "outvote" the good
units. The solution of this problem requires either a built-in sluggish-
ness of disconnection, a capability for external diagnosis, or provision
for autonomous verification of individual units prior to reconnection.
The latter is feasible if at least two good units remain connected to
serve for testing a candidate unit. The importance of these factors must
be evaluated in the context of a particular system.
The idea of adaptive voting was suggested by Pierce. TM
He and others Iss have proposed implementing the scheme by the use of
special elements, such as variable impedances with memory, or fuses. The
technology for realizing the special impedance elements has not developed
sufficiently to realize devices of adequate reliability for the missions
of interest, and the use of fuses does not allow the reconnection of re-
covered logic units; hence in the discussion of the several schemes to be
described, the use of conventional elements will be assumed.
2) Techniques for the Realization of the
Adaptive-Votin_ Scheme
In this section, a number of techniques for realizing the
Adaptive Voting scheme will be discussed. The first approach assumes the
use of a 2N-input threshold logic unit, for order-N redundancy. The
second approach assumes that only AND, OR and NOT elements are employed,
and a number of alternative schemes employing such elements are described.
Although the number of such elements are far in excess of the number of
threshold logic weights employed in the first approach, the low cost and
high reliability of such elements realized in microelectronic arrays make
the approach worth consideration in future design.
Scheme usin_ a linear-input threshold lo_ic element:
A fixed-weight realization of Pierce's adaptive logic scheme is shown
in Fig. II-A-15. Associated with each input signal, xi, are a gate
controlled by a "status" flip-flop, and two weights: +2 for the gated
input and -i for the ON state of the flip-flop. When the status flip-
flop is ON, the net signal contribution to the threshold logic unit is
+i for x i = i, or -i for x i = O. When the flip-flop is off, the net
72
Dcontribution is O. The threshold logic unit is set so that the output is
1 if the sum of all contributions is 1 or more, and 0 otherwise.
The output of the threshold logic unit is fed back to each
channel, and upon any disagreement between the output and the input of a
channel (indicated by a 1 at the output of an Exclusive-OR gate) the
status flip-flop is reset, essentially disconnecting the channel.
The disadvantage of this realization is that the reliability
of threshold logic circuits with a range of summed input variables of ten
or more is not very high at present. In the following schemes, only binary
switching elements will be employed.
Scheme usin_ nonlinear elements only: The scheme illus-
trated in Fig. II-A-18 performs the same overall function as the previous
one, but internally it uses binary signals only. The flip-flops perform
the same functions of gating according to status, and their control is
the same. Instead of combining the valid input signals so as to generate
a single signal with a multivalued range about 0, a binary vector is
generated. The elements of the vector, T2, T3, T4, ... are the monotonic
symmetric functions of the inputs; that is, the elements are 1 if at least
2, at least 3, at least 4, ..., respectively, of the N inputs are i. The
proper one of these functions is then selected to be the output, depending
upon the number of input channels that are valid. Thus, T 2 is selected if
the number of valid channels is two or three, T 3 if the number is four or
five, and in general T is selected if the number of valid input channels
j+l
is 2j or 2j+l, j = i, ..., m; N = 2m + i.
Thus the combining network may be decomposed into three
parts, as shown in the figure:
(1) A network that realizes the monotonic symmetric
functions
T 2, T 3, .-., Tin+1
(II) A network that realizes the symmetric functions
$2, 3' $4,5' ... S2m,2m+ 1
73
_E MULTIPLE WEIGHT I
x I
x
MULTIPLE
THRESHOLD
xn_ I
lb_.- CLEAR
Tm+l
T I
"HRESHOLD
SELECTION
D
DATA
OUTPUT
TA-5550-75
FIG. II-A-18 ADAPTIVE-VOTING SCHEME (using digital elements)
(III) A network that combines the S and T variables
according to the function
D = $2,3T2 + $4,5T3 + $6,7T 4 + ... S2m,2m+iTm+ 1 •
The symmetric functions $2,3' etc., may be realized very
inexpensively from monotonic symmetric (threshold) functions; hence the
combining network may be obtained as in Fig. II-A-19, using two threshold-
function networks, one slightly augmented, and a simple AND-OR network.
A number o£ approaches to the economical design of
multiple-output threshold networks, employing simple nonlinear gating
elements, are described in Sec. II-A-2-c.
3) Description of the Switching-over-Voting Scheme
This scheme is a simple extension o£ the adaptive-voting
scheme; thus, each subsystem is adaptive, with the additional feature
74
7STATUS
FLI P- FLOPS
MULTIPLE
THRESHOLD
$7,6
$4,5
t2 $2,3
OR
D
INPUT
VARIABLES
MULTIPLE
THRESHOLD
FIG. II-A-19 COMBINING NETWORK FOR DIGITAL-ADAPTIVE SCHEME
that a "warning" indication is given when the number o2 valld £unctional
units is such that one more lailure could not be masked.
The external control 02 subsystem replacement is quite
simple: selection o£ a subsystem is determined by the state 02 a counter
(shown in Fig. II-A-20) which steps upon receipt 02 a "warning" signal.
The designs o£ the component data and control subsystem
are quite straight£orward, and will not be described in £urther detail.
4) Description o£ the Votin_-over-Switchin_ Scheme
The voting-over-switching scheme is illustrated in
Fig. II-A-17 £or the case ol three _unctional units taken at a given
time. The switching and control functions are the costliest ol the three
75
DATA
TURN
r
ON Si.
VOTER
DIFFERENCE
DETECTOR
DATA
OUTPUT
SECOND
ERROR
TURN OFF
rl s I tl si===_. S
COUNTER DECODER , f OUTPUT
rn n |n_ rA-55eO-77
FIG. II-A-20 LOGICAL STRUCTURE FOR SWITCHING-OVER-VOTING SCHEME
schemes described, but the potential saving in power consumption is
greatest.
In this system, the status information for each functional
unit must have four values, indicating connection to one of the three
channels (conveying data to the voter) or to none. The sequencer must
implement one of the several possible strategies referred to in the
introduction to this section.
The design of the signals that identify which channels
dissent from the majority is straightforward, at least for low-order
voting. For example, a dissent variable d. may be defined, with the
l
value 1 indicating that channel i dissents; then
d = x QMajority (Xl, XN)i i "''' '
th
where x. is the i input to the voter. Greater economy is no doubt
1
possible.
76
5) Comparisons and Conclusions
Some reasonable criteria for comparison of the three
schemes described are
(i) The number of faulty functional units that may be
tolerated
(2) The minimum power requirement
(3) The number of components required for the realization
of decision and switching functions
(4) The number of simultaneous faults that cannot be
tolerated.
The performance of the schemes with respect to these
criteria is summarized in the following table, in which a is the number
of units voting together in systems 2 and 8, b = N/a in system 2, and
power cost is given assuming unit power in a functional unit.
System
Adaptive
Voting
Switching-
over-Voting
Voting-over-
Switching
Tolerable
Number of Unit
Failures
N - 1
N- (2b +a +i)
N - 1
Minimum Cost
Power of
Cost Voting
N high
a low
a low
Cost of
Switching
and Control
low
low
high
Intolerable
Number of
Simultaneous
Faults
(N + i)/2
(a + i)/2
(a + i)/2
Practical designs have been described for realization of
several schemes combining voting and replacement redundancy methods.
The decision and switching logic is more costly than the logic employed
in simple fault masking, but technological developments may reduce these
costs to an acceptable level.
The schemes are advantageous in the number of tolerable
unit failures and in the possible reduction in power consumption, but
they are somewhat more susceptible to collapse under simultaneous failures.
77
It is important to note that the scheme has merit even if
it is not operated autonomously; that is, if the switching of a channel
is controlled by an external computer rather than locally.
Further development of this approach should consider the
application of redundancy to protect the decision and switching circuits,
and the development of schemes for recovery from multiple simultaneous
permanent or transient failures.
c. Votin_ Networks
i) Introduction
Although there are many published analyses of voting-type
redundancy of arbitrary order of replication, there are surprisingly few
published logical designs for voting networks, aside from the majority-
of-three function. In the first part of this section, several designs
for majorities of three, five and seven variables will be presented to aid
in the assessment of circuit costs of high-order redundancy schemes and to
serve as starting points for designs appropriate to particular technologies.
The second part of this section is concerned with the
realization of voting-type circuits needed for the all-digital adaptive-
voting schemes of Sec. II-A-2-b. These circuits produce a set of outputs
that are the monotonic symmetric functions of their input variables,
i.e., the functions T. that are 1 when at least j input variables are i.
9
One of these functions is, of course, the majority function, and since
the structures presented are in a canonical form, they provide a simple
means of design for that function for any number of variables. Such a
design_ however, will generally be less efficient for that single function
than the designs described in the first part.
Designs appropriate to several kinds of logic gates are
presented_ including threshold elements and AND-OR gate combinations. The
AND-OR networks use more gates, but they are more easily produced by
current technology, and considering advances in microminiaturization,
their costs are not prohibitive.
78
2) Logical Designs for Simple Majorit 7 Networks
Use of single high-weight linear-input logic elements:
The majority function is a linearly separable switching function; hence
it may be realized by a single linear-input (or threshold) logic element,
as illustrated in Fig. II-A-21. The notation M k will be used for the
majority function of k variables. For 2e + i variables, the appropriate
input weight for each variable is +l, the appropriate bias is -e, and the
range of the summed inputs is from -e to e + 1. The threshold circuit
must thus be capable of resolving one unit in 2e + l, and for increasing
values of e the requirements on circuit precision become increasingly
difficult to satisfy in practice (for example, see Coates and Lewis43).
X I
X2e+l
-e
M2e+l
TA-5580-51
FIG. II-A-21 LINEAR-INPUT-LOGIC
MAJORITY GATE
Furthermore, the circuit technology is not well suited to integration with
non-linear circuits in a single monolithic structure; hence this approach
is probably best used in a system in which all functions are realized by
linear input logic.
Use of multiple low-wei_ht linear-input lo_ic elements:
Amarel, Cooke, and Winder 7 have described designs for high-order majority-
function networks composed of majority-of-three logic elements. Their
designs for the majority functions of five and seven variables are given
in Fig. II-A-22. These networks permit the use of low-weight, low-
precision, linear-input logic circuits. Thus, they would be expected to
have greater noise immunity and operating range than a single-element
realization. The problem of compatibility in fabrication and operating
levels with other system logic circuits remains.
79
x 5
X4
X 2
X 3
X2
M5
x 4
(O) 5- VARIABLE NET
X4
X7
x 6
X 3
x 7
X5
x 2
x
X 5
X7
x
x 3
x 6
x 4
X2
X 5
x 3
x 2
xl M 7
(b) 7- VARIABLE NET
TA- 5580 -66
FIG. II-A-22 MAJORITY-ELEMENT MAJORITY NETWORKS (Amarel, Cooke& Winder)
Use of AND-OR lo6ic elements: Nonlinear logic elements are
much more widely used than linear elements; hence it is of interest to
investigate the design of majority-function networks composed of such
elements, so that such networks may be physically integrated with the
general logic networks of a system. We will consider here the use of
AND and OR elements.
The majority-of-three function may be represented alge-
braically as
M3 = XlX 2 + xlx 3 + x2x 3 •
The well-known network realization is given in Fig. II-A-23(a). A
realization based on M 3 = XlX 2 + x3(x I + x2) employs one less gate input,
at the price of an additional stage of delay.
@ In this figure the variables are represented by numbers.
80
AND
2
I
3 N3
2
3
(o) :5 VARIABLE NET
4
2
M5
3
4
(b) 5 VARIABLE NET
AND A3
I _ A2
2
AI
3
5 B2
6 B3
\
\
, /
/
(C) 7 VARIABLE NET
TA-5580-58
FIG. II-A-23 AND-OR MAJORITY NETWORKS
The majority-of-five function may be represented as
M5 = %%(Xl + x2 + %) + %x2(Xl+ % + %) + (% + %)(x3 + x2)Xl
It is easily verified that the ten combinations of the variables, taken
three at a time, are all represented. By introducing the intermediate
variables
a = x2 + x 3
81
b = x 4 + x 5 ,
the function may be expressed as
M5 = x5x4(x I + a) + x3x2(x I + b) + Xlab .
A network based on this expression is given in Fig. II-A-23(b). This
realization has the lowest number of gates and the lowest number of gate
inputs known to us.
In explaining the design of the M7 function network, it is
convenient to employ as an intermediate function the monotonic symmetric
m which has the value 1 when k or more of its m input variables
function Tk,
are i. Then, grouping the seven variables, we employ the functions
A k = T_(Xl,X2,X 3)
B k = T_(x4,x5,x6), k = i, 2, 3
Then useful representations of M 7 are
(Form i )
(Fo 2)
M 7 = A3(B I + C I) + A2(B 2 + BIC I) + AI(B 3 + B2C I) + B3C I
M 7 = A3B I + A2B 2 + AIB 3 + CI[A 3 + A2B I + AIB 2 + B 3] •
It may be noted that the sum of the subscripts of each product term is
four, signifying that at least four variables must have the value i in
order to make that product term true.
A useful set of intermediate variables is
Yl = x2x3' Y2 = x2 + x3
w 1 = x4x 5, w 2 = x 4 + x 5
Then
A 3 = xlY I, A 2 (x I + Yl)Y2, A I
82
= (Xl + Yl ) + Y2 or x I + Y2
and
B 3 = x6w 1, B 2 = (x 6 + wl)w 2, B 1 = (x 6 + Wl) + w 2 or x 6 + w2 .
A network based on Form (2) above, and employing these intermediate
variables, is shown in Fig. II-A-23(c). The five gates marked by an
asterisk may be combined into three gates in an obvious way; they are
shown separated as an aid to following the realization of the terms of
the functional equation.
It may be noted that this network costs half as much as a
direct AND-OR realization of the majority-element network of Fig. II-A-23.
The cost in gates and number of inputs for M3, M5, and M 7
may be summarized as follows:
Number of Gates Number of Inputs
M3 4 8, 9
M5 8 20
M7 18 44
At least for these cases, the costs approximately double with each increase
in odd-order of redundancy.
3) Canonical Structures for Multiple-Output Votin_ Networks
Use of majority logic elements: The paper by Amarel, Cooke,
and Winder referred to in the previous section also presents canonical net-
work structures, composed of majority logic elements that realize the com-
plete set of monotonic symmetric functions of the input variables. The
structure is illustrated in Fig. II-A-24, where the notation (a/b) is
equivalent to T b as defined in the previous section.
a
Their realization is based on the identity
T b+l . T b
a+l(xl,x2,xb,Xb+ 1) = Xb+ 1 a(Xl,X2 .... ,x b)
_b+l.
+ T a (Xl,X2,...,X b)
83
x I
x 2
X3 I\
x4 I
(t/5) (2/5) (3/5) (,4/5)
FIG. II-A-24 MAJORITY-ELEMENT MULTIPLE-OUTPUT VOTING NETWORK
A logic cell realizing this function could be built using one AND gate
and one OR gate, but since the realization is restricted to majority
gates, a highly redundant logical operation is actually employed, i.e.,
T b+l = Majority T b T b+l)
a+l (Xb+l' a' a
Use of AND and OR _ates: One way of using AND-OR gates is
to employ the structure of Fig. II-A-24, in which each cell contains one
AND and one OR gate as described above. This construction requires that
input variable x drives j gates. An alternate realization, in which
3
each input variable drives exactly two gates, is shown for seven variables
in Fig. II-A-25. Extension to more variables is obvious. The validity
of this network is less obvious than that of the previous network, so a
formal proof of its validity will be given.
We wish to prove that the network construction scheme
shown realizes the desired functions for any number of variables. This
will be done inductively by showing that a valid network for n variables
84
(7/7) (6/7) (517) (4/7) (3/7) (2/7) (1/7)
x7 (I/6) l
X6 _11/51 =
X5
x4
X3
X2
(4/4) (3/4) (2/4)
(3/3) (2/3)
(2/2)
XI _ TA-5580-50
FIG. II-A-25 AND-OR MULTIPLE-OUTPUT VOTING NETWORK
may be augmented to £orm a valid network for n + 1 variables, and that the
augmented network follows the stated construction scheme•
Let Tn be the monotonic symmetric function on n variables;
m
that is, Tn is true if at least m of the variables are true• Let it be
m
assumed that a network A produces the functions Tn(Xl, ..., xn) , Tn_ l •
(xl, ..., xn), ..., Tl(x I .... , xn). We wish to augment the network A
Un+l
with a network B, with inputs YI' Y2' '''' Yn+l and outputs n+l =
Yl Y2 "" Yn+l and• • . • Xn, Xn_l, ..., x1. The x:s also form the inputsl
for the n-variable network A, and are determined as follows:
85
Xn = Yn+l + (YnYn-i "''yl)
Xn-i = Yn + (Yn-lYn-2"''Yl)
x2 = Y3 + Y2Yl
Xl = Y2 + Yl "
(11-34)
un+l n+l
Let the outputs of the combined (n + l)-variable network be n+l' Un '
n+l We must show that
• .., U I •
U n+l = T n+l
m m
Proof:
(i) By construction of the network B,
U n+l = T n+l
n+l = Yn+IYn'''Yl n+l"
(2) Examine U n+l n
m = Tm(Xl, ..., Xn) , and
suppose m or more of the y's are true.
Then by Eq. (II-34) it follows that m
or more of the x's are true; for, in
particular, if Yi = i, so does xi_ I.
(In the special case where Yl = Y2 = i,
we have m - 1 x's = 1 by the above
argument, plus one more for the output
xi = Yi+l + yi...y I, where Yi+l = 0 and
all yj = 1 for j < i + i.) Hence U n+Im
is 1 whenever m or more of the y's = i.
(3) To show that U n+l = 0 whenever m - 1 or
m
fewer y's = i, let at least n - m + 2 Yi'S
= 0 (corresponding to n - m + 1 x.'s);
i
let them be
Yi l' Yi 2' "'', Yin_m+ 2
86
where
iI > i2 > ... > in_m+ 2 •
Then at least n - m + i x. 's = O•
1
Eq. (II-34) they are
F rom
= Yi I + YiI-IYil-2"''Y .... Yl '12
since = 0
Yil = Yi2
= Yi2 + Yi2_IYi2_2...Yi3..•Y 1 ,
since = 0
Yi2 = Yi3
x. " + ........ Yl
in_m+l- I = Yln_m+ I Yln_m+l-i Yln_m+ 2
siuce = 0 .
Yin_m+l = Yin_m+2
But if at least n - m + 1 x.'s = 0, then no
l
more than n - (n - m + i) = m - I of them are
1 and U n+l = O. Hence U n+l = 0 whenever less
m m
than m of the Yi'S = i.
(4) By (2), U n+l = i whenever T n+l = i, and by
m m
(3) U n+l = 0 whenever T n+l = 0; hence U n+l = T n+l
m m m m
and the inductive step is demonstrated.
(s) Clearly the network construction results in the
proper output functions for low values of n,
say n = 2; hence the desired result is demon-
strated for all n.
Use of sequential "diffusion" networks for multiple
threshold detection: Given a set of n binary variables, one way to
realize a threshold function is to sort the O's and l's into two strings
87
and to observe whether or not the string of l's exceeds a given length.
This behavior may be realized in a special kind of shift register, shown
in Fig. n-A-26(a).
0
x4
I
X
0
x I
MODE I
PROPAGATION
J > Sin
h
--_- >4In
>31n
l*t
>2In
n
> lln
;It
.e
Ii+2
li li-I _i-I
I_("
Ir_)
I I I
XI I I
-t-I _i
/
m
h
×
_i ÷2
(b)
MODULE i
MODULE i + I
(_i-I + li+l)
HOLD MODE2 (_,_,I,.,)
(a ) (c ) T_,*_580-'I
FIG. II-A-26 DIFFUSION-TYPE MULTIPLE-OUTPUT VOTING NETWORK
The register has three modes o£ operation. In the first
mode it is loaded in parallel by the input vector. In the second mode
the information is allowed to propagate through the stages asyachronously.
After a time determined by the length of the register and by the switching
times of the stages, the register will reach a stable state, and it may
then be interrogated. In the third mode it is cleared by allowing all
strings to shift out serially. Parallel clearing is also possible at
greater expense.
88
The register is based upon Muller's speed-independent
modules, TM but instead of using a three-state code--0, l, and _ ("null")--
only two states are needed, namely 1 and _. Cascades of such stages have
the property that a data symbol (in this case, just the symbol l) will
propagate by being copied over a _ symbol in a forward module unless a
data symbol is resident in the module beyond--in which case, the first
data symbol is held in place. In other words, the _ symbol serves to
separate 1 symbols of different origins. This rule results in a forward
diffusion o£ 1 symbols, until a string of alternating 1 and _ symbols is
built up in successive modules. In the course of this diffusion, a given
1 symbol may be momentarily replicated, but all such replications will be
in a contiguous string, bounded by _ symbols. In the final resting con-
figuration, all replicas deriving from a given symbol will coalesce into
a single symbol. Thus, in the resting state, the presence of a 1 at the
th
2j module indicates that there were at least two l's at the input.
Figure II-A-26(a) is a block diagram of the register, for
the case N = 5. Each stage contains two modules, which are identical
except that one may be loaded by external data. The output taps are
self-explanatory. Clearing is accomplished by allowing the data to
propagate out of the register.
Figure II-A-26(b) shows a pair of modules constituting a
stage. Each module contains a single flip-flop and two or three gates.
Feedback is provided to ensure safe asynchronous operation. The state
diagram for a module in its propagation mode is given in Fig. II-A-26(c).
Some racing may occur in the transition between modes, but
this is easily handled. It may be noted that the cost of this approach
is linear with the number of variables.
Conclusions: A number of designs have been presented for
single and multiple-threshold combinational logic networks, up to order
seven, using AND and OR gates, majority gates, and flip-flops, and for
sequential networks for multiple threshold detection. The latter scheme
permits an exchange between time delay and amount of hardware that may be
89
advantageous for systems employing high orders of replication. These
designs should facilitate estimates of cost and reliability for fault-
masking systems.
3. Sequential Networks
a. Introduction
The networks discussed in Sec. II-A-2 above are combinational
by virtue of the fact that their outputs are uniquely determined by their
present inputs. That is, their proper operation does not depend in any
essential way on memory, storage or delay. Of course, all physical
devices involve some finite (though small) delay--the distinction between
combinational and sequential networks rests on the point that such delays
are merely incidental to the operation of combinational networks, while
they are an essential part of the behavior of sequential networks. Loosely
speaking, sequential networks--to be discussed in this section--remember
some aspects of their past history (inputs and/or outputs) and make use
of this memory to influence their present (and future) behavior.
Examples of sequential networks used in digital computing sys-
tems are counters, registers, accumulators, sequence generators, sequence
sensors, classifiers, and decoders. Sequential networks are thus capable
of much richer and more varied behavior than are purely combinational nets.
By virtue of this richer behavior, the variety of misbehavior that may
occur in the presence of faults is likewise much richer than with com-
binational nets.
In order to discuss the various possibilities that can arise,
let us consider the standard (Mealy) model for (clocked _) sequential net-
works shown in Fig. II-A-27. Here the combinational portion of the net-
work has been lumped together into one box, labeled "CL," while the
storage functions are embodied in delay units. The vector X = (XI, ...,
X ) represents external inputs to the net; the vector Z = (ZI, ..., Z )
m P
_$ Clocked = synchronous; we shall not discuss asynchronous sequential nets.
9O
x
-------0
: Z
CL
-4P
DELAY ELEMENTS
TA-5580-3B
FIG. II-A-27 MODEL OF SEQUENTIAL NETWORK
represents its outputs to the external world 0 The state vector
S = (Sl, ..., Sn) of delay-element outputs embodies the network's
memory of past behavior•
It is seen that the behavior of the net is governed by the
following logical design equations:
s' = F(x,s)
z = G(X,S)
(Next-State Equation)
(Output Equation)
where the variables X, Z, S, and S I are all vector Boolean variables.
The variable S t represents the inputs to the delay elements (memory-
excitation function) and it is thus the next state of the network,
assuming that no faults occur. Thus, in terms of the discrete time
variable t we have, as an additional relation,
s(t +i) = s'(t) (Delay-Element Equation) .
If our model had used, say, flip-flops instead of delays as storage
elements, this last equation would of course have to be altered to
describe flip-flop behavior. In other respects an entirely equivalent
model would then result.
91
The three equations given above suggest that it would be con-
venient to partition the sequential network model further, as shown in
Fig. II-A-28. Here, the memory-excitation function and the output
function are separated, conceptually at least, into boxes marked F and G
respectively. It must be recognized, however, that in a given realization
of such a network, there may be a good deal of sharing of components
between these two functions. Thus, in particular, a single component
fault could conceivably result in incorrect outputs from both functions
F and G.
XO i
S
_Z
_-- S l
TA- 5580- 39
FIG. II-A-28 PARTITIONED MODEL
OF SEQUENTIAl' NETWORK
b. Classification of Faults
We propose now to classify and analyze the kinds of misbehavior
that can result from faults in various portions of the network. We shall
classify faults according to which of the logic functions they invalidate.
Thus, we have the following types of faults:
(1) Faults affecting only the output function G(X,S) (output-
only faults)
(2) Faults affecting only the storage elements (delay elements)
of the network (delay-element faults)
(3) Faults affecting only the memory-excitation function
F(X,S) (memory-excitation function faults)
(4) Faults affecting both functions G and F (output-plus-
memory-excitation faults)
92
(5) Faults affecting both the combinational and storage
portions of the net. (These maybe combinations of
faults of the other types, or they maybe simple faults
involving overall aspects of the network, e.g., faulty
or missing clock pulses, out-of-tolerance supply voltages,
etc.)
Of course, one may group these kinds of faults according to
whether they result in errors in state-transition behavior or output-
signal behavior (or both). Possibly a closer analysis would be worthwhile.
We give next a brief description of some distinguishing features of the
above types of faults, followed by a discussion of applicable fault-
detection techniques.
i) Output-Only Faults
These faults are essentially combinational in nature. The
state-transition behavior of the network is not affected--only the output
signals which communicate this behavior to other portions of the computer
system. Consequently, this problem can be treated entirely in combina-
tional terms, and the methods of Sec. II-A-2 are applicable.
However, the availability of the memory-excitation logic
and its delay-feedback loops can conceivably act as an aid to fault diag-
nosis, provided it is known _ priori (or can reasonably be assumed) on the
basis of other diagnostic information that there are no state-transition
faults. Even without certainty as to the correct functioning of the
state-transition system, that portion of the network can often be used as
a test input generator for checking the output logic, provided that the
state variables (and the external inputs) can be monitored.
2) Delay-Element Faults
In this case, the memory excitation function F is itself
correctly generated, but the memory units fail to execute the proper
transitions in response to F. It would seem that a highly probable variety
of fault condition in this case is one in which one or more delays (or
flip-flops) are "stuck" at 0 or at i. Note, however, that in some tech-
nologies the occurrence of transient faults (e.g., occasional failure of
93
a flip-flop to trigger) is also quite possible. A paper by Rubio =Ts con-
siders error-correction techniques for such a situation, where "slow"
flip-flops are used (for reasons of economy).
The distinguishing characteristics of delay-element faults
are that the memory excitation is correctly related to the input X and
present-state variables S, and that the output function is likewise
correct for the stated conditions--yet the required state transitions
are nevertheless improperly executed.
3) Memory-Excitation Faults
Though these are faults occurring in the combinational
logic, they make themselves felt in terms of erroneous state transitions,
much as do delay-element faults. Hence they may be much more difficult
to locate ('or even to detect) in some kinds of systems than are output-
logic faults. Also, in all except the simplest kinds of sequential
circuits, memory-excitation faults can lead to much more varied and com-
plicated kinds oi erroneous behavior than can delay-element faults.
However, in general terms, memory-excitation faults and
delay-element faults are alike in that they both lead to state-transition
errors. Much the same techniques are applicable to the detection of either
kind of fault. We will usually be able to discuss these together in the
remainder of this section.
4) Output-Plus-Memory-Excitation Faults
When a fault arises in the logical circuitry common to the
production of the functions F and G (or when several simultaneous faults
are present affecting both F and G), then the diagnostic situation is
complicated by the fact that neither the output Z nor the memory-excitation
function F(X,S) is correct. However, it seems reasonable to expect that
such faults (though difficult to locate) are at least as easy to detect
as faults in F or G alone. This will be the case when independent
checking means are provided on the Z output and on S I for errors that
remain undetected in one case may be detected in the other.
94
5) Overall-Network Faults
We have not yet discussed faults arising in portions of a
sequential network which are common to (or affect) both the combinational
and delay functions. Examples of such are power-supply or clock-pulse
malfunctions, or shorts occurring between these two portions of the cir-
cuit, perhaps rendering two signals invalid. A good many such faults
will be detectable by essentially nondigital tests: excessive circuit
loading, for example, may result in poor rise time or out-of-tolerance
voltage levels; or they may be detected by direct checking of the clock
source at various points in the network.
It is not intended to minimize either the possibility or
the variety of such fault conditions when we state that this section is
not primarily concerned with them. We merely wish to point out that
other means are available for their detection and diagnosis (see Appen-
dices B and C).
c. L0gical-Redundancy Techniques
We turn next to the principal subject of this section, a dis-
cussion of some logical-redundancy techniques for fault detection in
sequential networks. Historically, the first treatment of this problem
was undertaken by Moore. 21s His approach involved the design of a
checking experiment to test a given (presumably nonredundant) sequential
network, rather than the deliberate incorporation of redundancy to facili-
tate checking. Certain assumptions were used by Moore to guarantee the
existence of a finite-length checking experiment. These assumptions,
which involve both the original, unfaulted network and the admissible
faults, are:
• The unfaulted network must be strongly connected--i.e., by
using only signals applied to the external inputs it must
be possible to put the network into an arbitrary state,
regardless of its initial state. (The input sequence that
accomplishes this will in general depend on both initial
and final states.)
• There must be at most a finite number of admissible faults,
and their effects on the state diagram (i.e., on the
functions F and G) must be known or calculable.
95
Implicit in the second assumption is the requirement that no malfunction
occurs during the course o£ the experiment.
Under the above assumptions it can be shown$13' los that in
principle at least it is possible to design a checking experiment that
will determine whether the circuit is behaving correctly, or whether some
admissible fault condition has arisen. This experiment also determines
which one o£ these malfunctions (if any) is present. Thus Moore's test
is not only a fault-detecting experiment, it is also fault-locating.
Unfortunately, the amount of computation involved in designing
a Moore experiment is far beyond the capabilities of even a large com-
puter, except for quite simple sequential networks, and then only if the
set o£ admissible faults is small. For example, if all malfunctions
leading to a state diagram with no more than N = 2n states are to be
treated, and the network has only a single binary input and a single binary
output, then one must consider (2N)2N different state tables. Even if n
is as small as 3, this leads to 264 distinct state tables. One cannot even
list all of these on any existing computer.
It is possible to makesomeprogress, however, by admitting only
a limited set of the most probable faults. The Moore method when applied
in such cases is still cumbersome,but for reasonably simple machines it
leads to checking experiments with a tractable amount of computation.
One of the difficulties with Moore's approach (and with that of
Poage and McCluskey,247 which is essentially a specialized simplification
of Moore's method) is that it requires too much information. Weare
interested in manycases only in detecting a fault condition, without
going into location on the spot. Replacement of the whole unit will lead
to an operable system, and detailed off-line testing can be used later to
locate the source of the trouble if desired. Hennie 127 has developed a
completely different checking procedure which is simply fault-detecting
but which is capable of protecting against large numbers o£ malfunctions.
This procedure does not require the enumeration of the possible kinds of
faults. Although Hennie's technique does not necessarily lead to the
shortest possible experiments, it does seem to lead in most cases to
96
reasonably good ones. Thus, it is often possible to design a checking
experiment whose length is proportional to N2 or N3 (where N is the
numberof states). However, in the worst cases, the experiment length
may increase exponentially with N.
Hennie's methods are most practical when it can be assumedthat
(1) both the correct circuit and the admissible faulted circuit have no
equivalent states (i.e., each has the samenumberN of states as a "re-
duced" machine), and (2) the reduced-state table has a distin_uishin_
sequence such that the circuit produces N different responses according
to which of the N states it occupied at the start of the experiment.
However, the method is applicable also under slightly more general con-
ditions (see Ref. 127). All of these conditions still require that the
original machine be strongly connected, as in Moore's method.
The testing philosophy and methods introduced by Hennie have
been carried somewhat farther and elaborated by Kime. Is°, isi Kime's
principal contribution seems to have been in the direction of transforming
a given circuit which does not possess a distinguishing sequence into a
circuit which does, but also retains the desired overall sequential
function. This is accomplished by (i) the addition of test points which
make some (but not necessarily all) of the state variables S. available
1
as outputs, and (2) the addition of logic and a single input terminal.
It appears at this point in the development of the subject that
the most promising avenues for future research on fault detection for
sequential machines lie in the directions opened by Kime's work, i.e., in
the incorporation of fault-detection features into the design of the
original machines. The more basic work of Moore, Hennie and others is
extremely valuable for its generality and for its indication of the
severe difficulties which beset any really general approach to fault
detection and still more to fault location. Rractical solutions to these
i"
problems that are applicable to networks of realistic size and complexity
seem to require a more pragmatic approach.
Indeed, there is a spectrum of regimes whereby fault detection
(or location or masking) can be employed, depending on the extent to which
97
checking is integrated into "normal" circuit operation. At one extreme
of this spectrum lies the philosophy implicit in Moore's and related
studies, wherein checking is treated as an aspect of the device's opera-
tion entirely separate from its usual functioning. Here the network is
subjected to a distinct sequential experiment whose purpose it is to
determine whether the internal structure is functioning correctly. Since
the length of such an experiment may run to hundreds or even thousands of
input symbols, it is clear that the execution of the experiment may require
appreciable idle periods.
use only in a gross sense.
checking.
The experiment can be "time-shared" with normal
,, • ,!
We might call this "off-line" or intermlttent
At the other extreme of this spectrum lie techniques--to be
described below--which involve continuous checking (on-line) of the device
on a bit-by-bit basis. Here the checking operations are completely inte-
grated with normal circuit functioning, but at a cost in added circuitry.
On the other hand, intermittent checking per se involves a cost in opera-
ting time (and power consumption) though it may be cheaper in terms of
hardware.
Between these two extremes lie varieties of checking which
interleave test operations with steps in the normal functioning of the
circuit, but at different time scales, not at every clock pulse. For
example, a counter can easily be subjected to checking at every kth input
pulse by providing a recycling mod k counter to quiz the main counter's
state so as to test for divisibility by k. Such intermittent checking,
where applicable, involves a lower order of circuit redundancy than is
the case with continuous on-line checking. It also possesses considerable
flexibility of operation to suit a variety of conditions.
d. Schemes for Fault Detection
We next discuss in some detail several schemes for the incor-
poration of fault-detection capability (in the continuous mode) into the
design of arbitrary synchronous sequential networks, such as are typically
found in the central control portions of digital computers. All of the
techniques to be described involve the introduction (in different ways)
98
of redundant states into the state diagram of the sequential network.
The general principle used is that of requiring that only a subset of the
network's states (the "valid" states) correspond to correct functioning,
and any transition to an "invalid" state signals a faulty condition.
Alternatively, it can be the state transitions that are checked. Two
classes of redundant-state logic will be considered: state-parity
checking and state-weight checking.
i) State-Parity Checking
This technique (closely related to Armstrong's error-
correction scheme I° is conceptually the simpler of the two. It draws
heavily on the use of parity-check codes in communications for the pur-
pose of error detection in message transmission. Here we employ sets of
parity checks over the state variables of a sequential network to help
insure the validity of a state, and hence of a state transition.
As a simple example, consider the introduction of a single
extra state variable P1 in addition to the existing state variables
SI, ..., Sn, where P1 is defined to be the modulo-2 sum:
P1 = S1 (_)$2 (_''" (_Sn "
t with combinational logic entirely independent of theThus, we may form P1
logic which generates the memory-excitation signals Sf. Then, any single
l
(or any odd-order) error in the vector (S_, ..., S' ;)n' P will appear as
a parity violation. If we provide P1 with its own delay element, as in
Fig. II-A-29, then on the next clock phase t + i, the signals S;(t), ...,
S'(t)n , P;(t) will appear at the delay-element outputs. By checking parity
over these n + i signals we will be able to detect any odd-order errors,
not only in the combinational logic but also in the delay elements them-
selves. It is evident that by the addition of a single extra delay and
its associated excitation logic we have made half of the possible erroneous
state transitions detectable.
Since parity functions (especially of more than two or
three variables) are awkward to generate in combinational logic, almost
99
COMBINATIONAL
LOGIC s;
REDUNDANT
LOGIC
P,(]
PARITY _- FAULT DETECTEDCHECKER
T&-5580-40
FIG. II-A-29 STATE-PARITY-CHECKED
SEQUENTIAL NETWORK
regardless of the particular technology, @ it might seem that the parity-
checking technique would lead to considerable hardware costs. The situa-
tion is better than it appears at first sight since the memory-excitation
t is not actually generated as a parity check over the otherfunction P1
excitations, S t. (Indeed, if it were so generated the scheme would break
l
down and check only the delay elements, not the combinational logic.)
For example, if we happened to have
t = S2S3S I
I
S 2 = $3S I
$3•I = _IS2
@ Relay-contact logic, cryotrons, etc., are exceptions.
I00
then we see that
=  ls2®s2s3 ®s l
m
= M(s1, s2, ss)
where M denotes the three-input majority function, a function very readily
realized with threshold logic.
Of course, the above example is a very special case. In
general, the complexity of the combinational logic involved in generating
will be highly dependent on the particular formsa parity-check signal Pl
that the other memory-excitation functions happen to have. In order to
increase the detection capability over that provided by a simple parity
check, we need to supply additional check signals, P_, pt checking
"''' r
over independent subsets of the state variables Sf. One scheme for doing
l
this is provided by the Hamming codes. These are, however, not the most
suitable for our purposes, since each Hamming parity check involves almost
half of the nonredundant state variables, thus leading to a large number
of exclusive-OR gates in the checking circuit. Better schemes are avail-
able in the form of low-density parity-check codes; see Gallager. ss These
codes can be designed to use no more than two or three state variables
for each redundant check signal.
It is clear that if, say, r independent checks are used,
-r
then all but a fraction 2 of the states will be detected as invalid.
The probability of detecting incorrect state transitions then increases
to 1 2 -r *
- . We come now to an objection which may already have occurred
to the reader. It may be argued that the addition of redundant delay
elements and combinational logic will increase the number of possible
* Strictly speaking this number i - 2-r is only the fractional number of
invalid state transitions from any given state; it becomes a probability
only if we assume that all faulty transitions are equally likely.
i01
incorrect transitions, and will also increase the probability of such
transitions, so that we are no better off than before we added redundant
states. It is indeed correct that both the number and probability of
incorrect transitions will be increased by the addition of redundancy.
However, adding a single delay element, while it doubles the number of
possible erroneous transitions, does not double the probability of such
an error. That would be the case only if we had to double the complexity
of the whole network every time we added one redundant state variable.
Of course, that is far from the case--typically, adding one delay element
(plus logic) will add one unit of complexity to the network. Thus, it
increases the complexity by about 5 percent if there were 20 state vari-
ables to begin with. We can therefore expect that the overall fault
probability is increased by about 5 percent, but we have cut in half the
probability of its going undetected.
To put the matter more succinctly: the inherent unreli-
ability of the network increases roughly linearly with its complexity (as
measured by the total number of state variables). The probability that
once a fault has occurred it will go undetected, however, decreases
exponentially with r, the number of redundant variables (i.e., as 2 r)
independently of the number of nonredundant state variables. We are
therefore fighting a very favorable battle, rather than a losing one,
when we increase r.
The major remaining questions in this area are:
(1) How best to arrange the parity-check signals
(i.e., which Gallager codes, or others, to employ
for typical sequential networks).
(2) What redundancy ratios, r/n, are most effective.
(3) How best to combine this technique (i.e., in what
proportions) with other, purely combinational,
fault-detection techniques or with intermittent-
checking schemes.
It should be noted also that a number of authors have dis-
cussed the employment of parity checking in sequential networks for auto-
matic error correction, i.e., for sequential fault masking, rather than
for fault detection. Among these are Armstrong, 10 Rubio, _73 and Frank
102
Dand Yau. ss While such possibilities should not be overlooked, it is our
feeling that masking should not be applied except for relatively small
networks. A particularly good discussion of this question can be found
in Chapter VII of Pierce 24s (see also Sec. II-A-I).
2) State-Weight Checking
Another checking technique, in some respects analogous to
state-parity checking but in others quite distinct from it, is implemented
by restricting the valid states to lie among specified weight classes.
That is, we arbitrarily restrict our state assignments in the design of a
sequential circuit to those state vectors containing specified numbers of
_ _ A variant of this techniquel's, say to the weights Wl, w2, ..., w c .
k o
is to restrict the weights of the state transition vectors S GS t, rather
than the states themselves.§ The simplest scheme of this sort uses a
unit-distance code (Gray code) for transitions (i.e., only one flip-flop
is allowed to change state in each transition); for a discussion of these
codes, see Kautz. tee This code seems especially suitable for counting
circuits. More complicated versions are also possible and useful, wherein
each transition may involve a small but fixed number of state variables
changing.
A representative circuit for the implementation of a
state-weight checking scheme is shown in Fig. II-A-30. For simplicity
of description the network chosen is a scale-of-ten counter employing a
distance-two, constant-weight code (two-out-of-five code) whereby five
state variables are used in accordance with the following state assignment.
* The weight of a vector (in particular, of a state vector) is defined to
be the number of l's it contains.
The reader will observe that the simplest case of parity checking (a
simple overall parity check, such as Pl in the preceding paragraphs) is
equivalent to restricting state weights to even values, [0,2,4, ...] .
The resemblance does not go much further, however.
§ This alternative technique was not discussed in relation to parity
checking, since parity checking of the transitions S QS t is logically
equivalent to checking the state vectors, S.
103
SI
$2
$4 $3
G = AND GATE
= OR GATE WITH DELAY =
S_ = (Sl +Ss)(Sz+S4)
etc.
TA-5S10-4t
FIG. II-A-30 TWO-OUT-OF-FIVE COUNTER
104
DState No.
S 1 S 2 S3 S4 S5
i i i 0 0 0
2 i 0 i 0 0
3 0 i i 0 0
4 0 i 0 i 0
5 0 0 i i 0
6 0 0 i 0 i
7 0 0 0 l i
8 i 0 0 l 0
9 1 0 0 0 1
i0 0 i 0 0 i
It is natural to check the operation ol this circuit with
a two-out-of-five validity checker using the symmetric £unction _2(Sl, S2 ,
S3 , S4 , $5). Zeros at the output o£ the validity checker then signal mal-
£unction through the occurrence of an invalid state.
As indicated above, an alternative way to check the opera-
tion of this counter is by monitoring the weights of the "change" signals.
I£ trigger (T) £1ip-£1ops are used for memory, then there will be exactly
two active change signals at each step. If set-reset (R-S) £1ip-flops
are used, there must be exactly one set signal and one reset signal at
each transition. In all cases there are obvious ways to monitor these
changes.
Circuits of the above types are attractive, perhaps,
because of their logical simplicity. They also appear to be fairly com-
petitive with parity-checked circuits in terms of efficacy in reducing
the number o£ undetected faulty transitions. The two-out-of-five counter
above uses l0 out o_ a possible 32 states. Thus, the fraction of un-
detected errors is i0/32 = 0.31 approximately. A binary counter would
have required four nonredundant state variables; and with straight parity
checking, one extra state variable would only reduce the undetected errors
by a factor of i/2. Thus the constant-weight counter is slightly superior
in this case.
105
A more general scheme for state-weight checking is shown
in Fig. II-A-31. Here the weights of the states are not restricted
priori. Full flexibility in state assignment to the nonredundant vari-
ables is available to facilitate economical (or otherwise desirable)
synthesis. Instead, a set of check signals C1, C 2, ..., Cr is generated
(as in the parity-check scheme, though these C are not parity checks).
1
The rule used in their generation is that the binary number represented
by CrCr_l...C2C1, i.e., the quantity Z Ci 2i-1 is just equal to w(S'), the
weight of the state vector S'. These check signals, Ci, are then applied
(as memory excitation) to a redundant set of r delay elements. Thus,
densely encoded information as to the weight of the next state vector is
fed around the combinational logic and delay loop. At the delay element
output terminals, the weight of the new state vector is checked against
the indication provided by the binary number Cr...C1. Any failure of
these numbers to agree results in a fault-detection signal.
The number r of redundant state variables required with
this scheme is bounded above by
r <log 2 (n +i)>*
where n is the number of nonredundant state variables. This relation
follows from the fact that the weight of S can range in general from 0
to n, in the worst case. Of course, if the state assignments are suitably
restricted to employ only a narrow range of weights, then r can be made
smaller in particular cases.
The same principle can be applied to the checking of the
weights of state transition vectors, rather than of the state vectors
themselves. This alternative may be preferable in some cases, particularly
when the state vectors can range over all weights from 0 to n but the
transition vectors are limited to weights of, say, i or 2.
* <x> = smallest integer greater than or equal to x.
106
X 0
SI
Sn
i
C i
Cr i-._,
COMBINATIONAL
LOGIC
z
s:
s'.
REDUNDANT
LOGIC
G
c:
c_
_COMPARISON_: j
CIRCUIT _ql-_ _
r 2|_1
w(S) _ T_ c_
I
FIG. II.A-31 SEQUENTIAL NETWORK WITH STATE-WEIGHT CHECKING
TA-55eO.-42
107
We have suggested a good many schemes for the implementa-
tion of fault detection in sequential networks. Some of these are dis-
cussed only in relation to counting functions, yet they are applicable
much more generally. For example, unit-dlstance coding can be used in
the design of decoding trees. _s What is still lacking is a clear under-
standing of how such schemes compare in cost, in reliability, and in
general efficacy. The problem of efficient redundant state assigumen_
for fault detection in sequential networks is intimately bound up wlth
the general question of (ncnredundant) state assignment; and the latter
has long been known to be an extremely difficult problem on which progress
has been made only recently.
Another large area ripe for future study is the design of
sequentlal networks for easy dlagnosabillty. The work of Kime le° mentioned
earlier may be relevant to this problem (though he was concerned wlth
facilitating detection only).
108
B. Use of Codes for Storage and Arithmetic Operations
i. Introduction
Almost immediately after the use of error-correcting codes was
proposed as a technique for achieving reliable communication over noisy
channels, studies were conducted to determine the utility of codes for
checking computer operations. Unfortunately, it appears that except for
the checking of memory cells and for the checking of arithmetic opera-
tions, the use of codes is not feasible* for achieving reliable computa-
tion, because of the following factors.
(a) Serial checking, as is commonly employed for checking
transient errors on communication channels, cannot be
applied for checking component failures since generally
permanent failures occur.
(b) Parallel checking is theoretically feasible, but in many
cases single component failures result in errors on many
data lines, dictating either the costly realization of
networks wherein all outputs are independent or the use
of codes with extended error-correction capability.
(c) Means must be provided for ensuring reliable operation
of the encoding and decoding circuits, which are in
many cases as complex as the protected networks.
Two cases where single component failures do not result in multiple
data-line errors are in the memory section and the arithmetic section.
Kautz 151 has described a class of codes which combines the properties
of error checking, such as single-error detection, and unit distance--
which means that successive code elements of the code sequence differ
in only one bit. These codes are potentially useful for either an
error-detecting code for an analog-to-digital converter (in the sense
that the most likely binary errors are either detected or result in
negligible analog errors) or as an error-detecting code for an asyn-
chronous counter (in that a failure is detected which causes a flip-
flop to fail to change statewhen it should or to change state when
is should not). These codes have suffered from the unavailability
of efficient encoding and decoding implementation, but Kautz has pointed
out to the authors that by specifying a code which is somewhat sub-
optimal it is possible to describe reasonable efficient encoding and
decoding algorithms. With this advance these codes might prove to
be useful for the indicated applications.
109
Avizienis 14 has proposed a system wherein the same arithmetic code
is used for the checking of arithmetic operations and those storage
operations which relate to the arithmetic data. The advantages of the
scheme arise from the requirement of only a single encoder in the system,
compared with the two encoders dictated by the use of two distinct codes.
However, if the memory and arithmetic processor are checked separately
then two arithmetic-type decoders must be included, each of which is more
costly than a decoder for independent failures. The scheme also suffers
in that it is difficult to specify different levels of protection for
the memory and processor.
In the followin_ two subsections we then discuss the use of different
types of codes which are particularly appropriate for either arithmetic
or storage. Although the decoders are still complex, some progress has
been realized in the implementation of failure-tolerant decoders.
It should be noted that the followin_ discussion is intended mainly
to survey the prior coding research in order to distinguish cases where
coding is pertinent to the realization of reliable computers, and also
to distinguish future problems for research.
2. Codes for Checking Storage
It was indicated previously that most error-correcting coding
techniques, as applied to computer circuits, are not attractive because
of the need for costly extra circuits so as to provide failure-tolerant
encoders and decoders. One possible exception to this tenet is for the
error-control technique distinguished as threshold decoding, s,352'
wherein the estimated value of a particular information bit is determined
by a majority vote of a function of the output bits from the memory. In
this case the minimal implementations of the encoders and decoders are
such that most patterns of failures are correctable, including failures
in the decoders and encoders--not, of course, exceeding in weight the
capability of the code. In this section we will briefly discuss the
properties of codes which are amenable to threshold decoding, and also
discuss the tradeoffs between error probability and additional memory
locations required, attendant to the use of various error-correcting codes.
Ii0
Da. Threshold Decoding
As an example consider the implementation of a double error-
correcting code for a memory-channel byte of two bits, as shown in
Fig. II-B-l(a). In this case the optimum (least redundant) code requires
6 check digits. It is easily verified that all double failures occurring
in either the encoder, the memory channel, or the exclusive-OR gates of
the decoder are correctable. In addition, some of the failures occurring
in the 5-input majority gates are also correctable if the triangular
realization of Sec. II-A-2c is incorporated.
This feature of protection against encoder-decoder failures is
not realized for implementation (such as those described by Kautz, 14s
wherein a syndrome is first calculated and then utilized to uniquely
distinguish the bit or bits in error. A single error in the determination
of the syndrome will generally result in erroneous decoding.
It is of interest to observe the basis for the threshold
decoding algorithm for this (8, 2) code. The code can be conveniently
described in terms of the 6 × 8 parity check matrix (H matrix) shown
below.
H
l l I l l I l l
x I x 2 x 3 x 4 x 5 x 6 x 7 x 8
i o 1 o o o o o-
1 0 0 1 0 0 0 0
1 1 0 0 1 0 0 0
i i 0 0 0 i 0 0
0 1 0 0 0 0 1 0
_0 1 0 0 0 0 0 I_
iii
; -}q
I I
I I
L I
ENCODER
X I
X2
X3
X4
X5
X6
X7
X8
MEMORY
8 CHANNELS
I I I I I I
0 _'= Moj(XI, X3, X4, X 5 Q X7, X 6 Q X 8)
/ / V
:i
×4
/ ' ' ' ' x_®x_)b*= Moj(X2,XT, X 8, X3_X5,
(o) (8,2) DOUBLE ERROR CORRECTING Ta-S58o-78
1
ENCODER
1
O= X I
o_b : X 2
o_c = X3
o_d = X4
o_b_c=X5
o(_b(_d = X6
o®c®d=X7
MEMORY
CHANNEL
(b) (7,3) SINGLE
x( ...._ DECODER
x__ ¢x__
x; x_,...,'.._ /X.LJ
x;I- x_,_
x_F x__
x_I- x;
x;k x_
X_F C_
x-$F x_,.',.,r_/
x_
X/3V .., o. _ o"
x_
x;
ERROR CORRECT ING T.-s58o-z9
FIG. II-B-I THRESHOLD DECODING FOR MEMORY CHANNELS
112
The following set of 6 parity equations then must be satisfied by the
code digits
t i 0
x I Q x 3 =
I I
XlQX 4 = 0
I I I
xI ® x2® x5 _-0
I I I
xI ® x2 ® x6 = o
I I
x2(_)x 7 = 0
I I
x2(_)x 8 = 0
Transforming the above set of equations and appending the trivial identities
'Qx_ = O, x_x_ = 0, we note that we can write a set of 5 equationsx I
I
t and a set for x2,for x I wherein no set variable appears more than once
as an independent variable. These equations are:
x I x I x2
x 1
I I I l
x I = x 4 x2 = x 8
I I I I I 6x 1 = x 5(_)x 7 x2 = x 3_x
I I I I I I
x 1 = x 6(_)x 8 x2 = x 5(_)x 8
t and _ be the estimates of the informationIf we let the estimates of xI x 2
symbols a and b, then it is evident that any single or double errors will
be masked by the majority decoding rule indicated in the figure.
For many codes, such simple encoder-decoder realizations are
not possible, although it is possible to specify a "pseudothresheld"
implementation in which the estimate of at least one bit is dependent
upon the estimate of other bits. As an example, consider the realization
shown in Fig. II-B-l(b) 5 of the Hamming single error correcting code,
with 4 information bits and 3 check bits. As in the previous example
most single failures occurring in the encoder and decoder are masked as
well as, of course, those occurring in the memory channel.
113
At present it is not known what codes are implemented by the
threshold-decoding technique, nor is it known what codes are implemented
by the pseudothreshold technique as illustrated in Fig. II-B-l(b).*
We have developed methods for specifying parity-check matrices based upon
balanced incomplete block designs, for codes which can be guaranteed to
be decodable by threshold decoding; but the research has not proceeded to
the point where reporting is appropriate. It appears that these codes
are somewhat less efficient than codes derived by other algebraic pro-
cedures--for example, Bose-Chaudhuri Codes.
b. Tradeoffs Between Memory Redundant Channels
and Error Probability
Here we are concerned with the subdivision of the memory
information channels into bytes to which we apply the error correction.
Although the reliability of a code group will increase with
the number of errors that are correctable, the corresponding increase in
complexity of the decoding equipment tends to decrease both the speed of
operation and the reliability of the system. Hence, for the present
examination, only single and double error-correcting codes will be
considered.
Thus, for a given number of information channels, we wish to
determine the reliability and the number of redundant channels resulting
from the subdivision of the channels into code groups (bytes) of various
sizes, for single and double error-correcting codes. It is assumed that
channel failures are random and independent, and that all channels are
equally reliable. Furthermore, we shall consider the reliability of a
Massey Ss2 has shown that codes based upon maximal-length sequences can
be threshold-decoded, and he has also shown that all Hamming codes are
implemented by the pseudothreshold.
114
}channel to be high, because only then is the use of redundancy beneficial
for overall system reliability. In the analysis, the following definitions
apply:
Q = probability that system fails
s
Qb = probability that byte fails
q = probability that channel fails
p = l-q
n = number of bits/byte
k = number of information bits byte
n-k = number of check bits/byte
b = number of bytes
w = kb = number of information bits/system.
For single error-correctin_ codes,
Qb,l = P(exactly 2 errors) + P(exactly 3 errors) + ...
(2) 2 n-2 (3) 3 n-3= q p + q P + ....
2For q << 1, we approximate Qb,1 _ ) q "
Then Qs,l = 1 - (I - Qb,l)b_ 1 - (i - bQb,l) = b(:)q 2.
For double error-correcting codes,
Qb,2 = p(exactly 3 errors) + p(exactly 4 errors) + ...
(3) 3 n-3 (4)q4 n-4= q P + P + ...
For q << i, we approximate Qb,2 _ (_)q3.
Then Qs,2 = 1 - (i - Qb,l )b _ 1 - (i - bQb2 ) = b(_)q 3.
Tabulations of Qs,l and Qs,2 are given in Tables II-B-I and
II-B-2 respectively. The reliability Qs' the total memory size nb,
and A, the product of Qs and memory size are given for a number of codes
for bytes containing one to four information bits. The reciprocal of
A is a useful measure (although it is clear that A is not universally
applicable) of the effectiveness of a given redundancy scheme.
115
Table II-B-1
RELIABILITIES AND MEMORY SIZES
FOR SINGLE ERROR-CORRECTING CODES
n k b
3 1 w
5 2 w/2
6 3 w/3
7 4 w/4
8 4 w/4
nb Qs 1 A1 = Qs 1 nb/w2q2
3w
2.sw
2w
1.75w
2w
3wq 2
5wq 2
5wq2
5.25wq2
7wq 2
9
12.5
i0
9.2
14
Table II-B-2
RELIABILITIES AND MEMORY SIZES
FOR DOUBLE ERROR-CORRECTING CODES
n k b nb Qs,2 A2 = Qs,2 nb/w2q2
5 1 w 5w 10wq3 50q
8 2 w/2 4w 28wq 3 112q
i0 3 w/3 3.33w 40wq 3 133q
12 4 w/4 3w 55wq 3 165q
For equal memory size (3w each):
(3,1) single ECC, _ qs,l(3,1) = 3wq 2
(12,4) double ECC, Qs,2(12,4) = 55wq 3
The ratio of probabilities of failure is
Qs,1 (3'1) 3
Qs 2 (12'4) = 55--_
P
It is recalled that an (n,k) error-correcting code (ECC) contains
code words of length n with k information bits.
116
It is clear that for low probabilities of channel failures,
-2
i.e., q < i0 , double error-correcting codes are far superior to single
error-correcting codes, ignoring the costs of encoding and decoding.
For example, considering two schemes having equal total memory size, the
(3, i) single ECC and the (12, 4) double ECC, the ratio of system failure
probabilities is 3/55q.
It was indicated that these studies ignored the complexity of
the encoder and decoder circuitry in establishing a measure of the
effectiveness of various error-correcting codes for memory channels. In
future work measures which reflect these complexity factors should be
established, at least for the codes discussed in this section. Since
no complete theory has been formulated concerning the encoder-decoder
complexity as a function of the code, it will probably be necessary to
establish the logical realizations of the circuits in the process of
comparison. Some initial studies have indicated that the threshold-
decoding scheme, besides providing for the masking of many encoder-decoder
circuit failures, also appears to provide the least costly implementation
(on the basis of a realization in terms of AND-OR gates).
3. Codes for Checking Arithmetic Operations
In this section we review the reliability techniques that apply
specifically to the protection of arithmetic computations. Although
this discussion does not mention the general reliability techniques
that are described elsewhere in this report, it is tacitly understood
that general techniques may be used in addition to or in place of the
techniques discussed here.
Historically, reliable arithmetic computation techniques have
closely paralleled reliable communication techniques. By redundant
coding of operands it is possible to perform a consistency check on
the results of arithmetic operations, much the same as the consistency
For several codes Allen 5 has developed logical realizations of the
encoders and decoders.
117
checks performed on code words that may have been corrupted in transmission
through a noisy channel. The major e£forts, therefore, have been to
discover redundant number systems with good distance properties, to find
easily implemented techniques for encoding, decoding, and performing
consistency checks, and to find ways in which the redundant in£ormation
can profitably be put to use.
Activity can be broadly classified into two areas: separable and
nonseparable codes. Separable-code schemes are characterized by the use
o£ check symbols that are treated as separate entities £rom the operands
that they check. To detect computation errors that may occur during an
arithmetic operation on a pair o£ operands, a special module operates
on the corresponding check symbols of the operand pair and predicts the
check symbol o£ the computation result. At the termination of the
computation, a check symbol for the result is produced and compared to
the predicted result, signalling a detected error if a disagreement is
iound. Figure II-B-2 shows a typical system based on a separable code.
Error correction can be implemented, when code distance conditions
permit, by calculating a correction term as a function of the predicted
b-
CORRECTION I -_
TERM I ERROR .
[ CORRECTORJ
t
I
I
.,___ CHECK SYMBOL _,[GENERATOR COMPARATOR
DETECTED
ERROR ALARM
ARITHMETIC
UNIT
_1 '
v ICHECK SYMBOL I
_i PREDICTOR J
FIG. II-B-2 A SEPARABLE CODE SYSTEM
T&-5580- 43
118
L
check symbol and the actual check symbol. This is indicated in Fig. II-B-2
by the dashed module labeled "error corrector."
Nonseparable codes are characterized by the coding of operands in a
special form such that the results of correct arithmetic operations are
again of the special form. Faulty arithmetic operations have a high
probability of producing results that are not of the required form. To
detect computational errors, the results of computation must be analyzed
to determine whether they are properly coded. A typical nonseparable-
code-based system is shown in Fig. II-B-3. In this system the analyzer
signals a detected error when a result fails to satisfy the code require-
ments. If code parameters permit, the analyzer can also generate a
correction term to correct the error, as indicated by the dashed line
in the figure.
5
a.
CORRECTION
),TERM I
I
I
I
I
I
ANALYZER I DETECTED
I ERROR ALARM
I ARITHMETICUNIT
T&-S610-44
FIG. II-B-3 A NONSEPARABLE CODE SYSTEM
The remainder of this section describes each of the two areas in
greater detail and concludes with a brief summary of the problems that
remain to be solved to aid the implementation of these techniques.
119
a. Separable Codes
The theoretical foundation for separable code schemes has been
formulated by Peterson 239 with the proof of the following theorem:
Theorem: Let C(N) be the check symbol associated with the
number N. For the addition operation N 3 = N 1 + N 2,
let the check-symbol predictor perform the opera-
tion "*" such that the predicted check symbol
C(N 1 + N2) satisfies
i . N2) = I)* c(N2)
If there are fewer symbols in the check-symbol
alphabet than there are integers in the permissible
range of integers, then C(N) must be the residue
of N modulo b, where b is the number of symbols
in the check-symbol alphabet, and "@" is addi-
tion modulo b.
The importance of this theorem is that it completely specifies
the error-detection system given in Fig. II-B-2 when the operation is
addition. The check-symbol generator in the figure computes the residue
of the sum modulo b while the check-symbol predictor is a modulo b adder.
Since addition is the elementary arithmetic operation of a
computer, all error codes must check addition. Checks of the other
arithmetic operations can then be designed to make best use of the coding
scheme which checks addition. Consequently, Peterson's theorem has far-
reaching effects with respect to the design of systems to check all
arithmetic operations.
Garner 89 has extended Peterson's work to show that a coding
scheme that checks addition can be used directly to check multiplication.
The predicted check symbol for multiplication is simply the product
modulo b of the operand check symbols. Consequently, each of the
components (except the error corrector) in Fig. II-B-2 is determined
120
when the operation is multiplication. The results of Peterson and Garner
come directly as a result of the fact that addition and multiplication
of integers in a computer are equivalent to the operations defined on a
mathematical ring. Peterson's proof uses the fact that the check symbols
must lie in a cyclic subgroup of the ring, while Garner notes that the
subgroup must satisfy the stronger requirements of being an ideal of
the ring. This completely determines the form of the operations on the
check symbols that check multiplication and addition in the ring. Since
subtraction in the ring is addition of the additive inverse of the
minuend, subtraction is automatically checked if addition is checked.
Division is not, in general, a defined ring operation. Con-
sequently, it is not possible to check division as directly as the other
operations. The most thorough check is to check each addition and
multiplication step of an iterative division. Since checking during
the course of an iterative operation may substantially increase the time
of the operation, an alternate approach is to perform a consistency
check on the combination of divisor, dividend, quotient, and remainder
at the completion of the operation. Since they satisfy the equation
n I = n 2 • q +r
their check symbols must satisfy the same equation modulo b. Although
this appears to be a valid check, it allows some errors to escape
detection. Whenever one member of the pair n2, q is 0 modulo b, the
product n 2 • q is forced to be 0 modulo b, independent of the value of
the nonzero member of the pair. Consequently, errors in n 2 or q cannot
be detected in this scheme when one member of the pair takes on a correct
value congruent to 0 modulo b. The choice of a method to check division
is still an open question. Neither of the schemes described here is
completely satisfactory. In some systems, the consistency check may
suffice if the cost of undetected errors is negligible. In systems
with more stringent reliability requirements, one is faced with the
cost of step-by-step checking, unless a more satisfactory method can be
found.
121
Thus far, we have discussed how arithmetic operations are
checked with separable coding schemes. We have indicated that the
check-symbol predictor is a modulo b adder and multiplier for checking
addition and multiplication, respectively. Since these are commonly
used devices, it is not necessary to consider how they might be imple-
mented in this section. The check-symbol generator, on the other hand,
is a device that lends itself to further study.
Since the check symbol for an integer n is its residue
modulo b, the most direct way of computing the check symbol of n is
to divide n by b, discard the quotient, and keep the remainder. Since
division is usually much more complex than either addition or multiplica-
tion, the generation of a check symbol by division could be substantially
more costly in terms of time or hardware than the operation to be checked.
However, because of an observation by Rothstein 271 and extended by
other% 10°,14,215,21s it is possible to compute the residue without
actually performing the division. In particular, if the modulus b is
of the form 2a-1 (more generally pC-1 for base p computer representations)
then
2 a m 1 mod b
and therefore
(2a) r - 1 mod b
Consequently,
k • (2a) r m k mod b
so that the extraction of a residue of a number reduces to the sum
modulo b of the coefficients of the radix 2a representation of the
number. (This corresponds to the familiar process of casting out nines
by summing digits in a decimal representation.) Hence, to compute a
residue modulo b = 2a-l, a binary representation is partitioned into
a-bit bytes, each of which is treated as an integer modulo 2a-1 and
122
Dsummed modulo 2a-l. Rothstein 2vI and Germeroth I°° give other algorithms
for the computation of the residue when b = 3. The JPL-STAR computer 14
actually uses this technique in its design for b = 15. The technique
described above makes the calculation of check symbols practical without
the expense associated with division hardware.
This completes the description of the devices associated with
error-detectlon shown in the system in Fig. II-B-2. We now focus our
attention on the correction of errors.
Before discussing the actual implementation of an error
corrector, it is necessary to consider the error mechanism in some
detail. Clearly no system can correct all errors, so it is desirable to
be able to correct the most probable errors. It is possible to define
an "arithmetic distance" such that the most probable errors have a small
distance measure.
The following definitions are due to Peterson. 2Ss
Let the weight of a number n in base p be the least number of
nonzero coefficients required to represent n as the polynomial
r 1
n = a p + ... + a_p + a.
r 1 u
where the coefficients are positive or negative integers with fail < p.
Let the arithmetic distance between two integers n I and n_ be
the weight of [nl-n2].
This definition of distance is somewhat different from the
definition commonly used for transmission codes, but it is purposely
so in order to account for the characteristics of arithmetic operations.
During an addition operation, for example, the result of a single digit
error can account for a burst of digit errors in the sum due to propaga-
tion of an error in the carry process. Such errors will show up as
single errors (errors of weight l) under the definition given above.
In a parallel adder, the errors that correspond to a single component
failure are single errors. Unfortunately, this is not true of serial
19.3
adders in which the multiple use of a single faulty component can cause
a multiple error.
The necessary and sufficient conditions for single error
correction have been derived by Brown 31 for nonseparable codes, but
apply to separable codes also. In simple terms, a nonseparable code
for the integers in the range 0 _ i _ 2m-1 can correct single errors if
and only if each of the 2m possible single errors, _2 J, 0 _ j < m, and
the integer 0 have 2m + 1 distinct residues modulo b. To correct single
errors, it is necessary to compute the difference between the predicted
check symbol and the calculated check symbol. This gives the residue
of the error. Since each of the correctable errors gives a unique
residue, it is possible (usually by means of table look-up) to compute
the correction term. Hence, the error correcter contains a modulo b
adder and a table with 2m + 1 entries.
This completes the description of systems based on separable
codes. Before proceeding to nonseparable code systems, it is worthwhile
to mention some variations on the separable codes that have been described
in the literature.
Although Peterson's theorem completely determines the nature
of the separable-code system, it does not determine the nature of the
number representation in the computer. Hence, the adders and multipliers
need not be conventional. The most competitive choice of number representa-
tion next to radix representations is the residue representation. 91,297
This representation has certain advantages over radix representation with
respect to elimination of carry propagations and the inherent modularity
of the arithmetic-unit logic associated with the representation.
Unfortunately, several operations such as sign detection, overflow
detection, magnitude comparison, and division are much more costly in
residue-based computers than in radix-based systems.29s, 153
Watson 32e,a21 and Moore 216 have investigated error-detecting
and correcting redundant-resldue representations which are similar to
the separable codes described here. The redundant-residue systems are
124
L
characterized by modules like those shown in Fig. II-B-2 except that the
arithmetic operations are performed by logic characteristic of the
residue operations. Check symbols are residues, as in the radix
representation, except that the check symbols themselves are represented
by residues of moduli smaller than the check modulus b. Hence, the
check-symbol predictor is a residue adder or multiplier rather than a
conventional modulo b adder or multiplier. The process of generating
check symbols from the residue representation is commonly termed "base
extension." The simplest known method of base extension is much more
complex than the residue-extraction method described above for radix
representations (see Moore216). At best it requires several additions
and a table look-up.
The use of redundant residues can simplify some of the problems
associated with residue operations. The processes of sign detection and
magnitude comparison, for example, are easier to implement with redundant
residues than without. 21s Nevertheless, residue representation appears
to be far less attractive than radix representation in general, while
the high cost of base extension lends further support to the apparent
unsuitability of residue representations for error-detection systems.
For historical reasons, it is pertinent to mention Garner's
generalized parity scheme s° in which the check symbols are parity checks
on the operands. In this scheme the check-symbol predictor for an
addition operation determines the parity of the sum from the parity of
both addends and the generated carries. Because the carry-generation
process itself is not checked by the code, it is not suitable for
error-detection systems.
b. Nonseparable Codes
In view of material presented above, it is not surprising
that the form of a nonseparable-code error-detectlng system is almost
completely determined by the properties of the addition and multiplication
operations. To see this, let F(n) be the coded form for the integer n,
and note that F(nl) * F(n2) = F(n I + n2) where "*" is the operation on
125
code words that corresponds to addition. But this is precisely the
condition that holds for check symbols of a separable code. Using this
condition and others it follows that the structure required to check
addition and multiplication is that the code words must lie in an ideal
of a mathematical ring, 89 which is the same structure required for the
check symbols of separable codes.
A suitable candidate, the AN or "linear residue" code, was
first proposed by Diamond s° and has subsequently been studied by
Brown, Sl Peterson, 2S8 Henderson, 12s'_2s and Garner. 89 For these codes,
the integer n is represented by the integer A • n where A is a selected
constant. Since An I + An 2 = A(n I + n2) , the arithmetic sum of two
coded numbers is the code of the sum, so that ordinary addition can be
used to sum two code words. Multiplication and division operations
on code words are more complex, however, than the equivalent operations
on uncoded integers. The product of two code words must be calculated by
ordinary multiplication and reduced by a factor of A because An I ' An 2 =
A2nln2 . Division is checked by premultiplying the dividend by a factor
of A and then performing ordinary division. Hence, to perform either
multiplication or division, it is necessary to perform both operations
on the code words. Avizienis 14 has used a clever technique for imple-
menting the premultiplication by A required for division. When A has
the form 2a-l, the product n A • n can be computed by subtracting n from
n " 2 a, where the latter product is obtained by shifting n left a bit
positions. This technique can also be used to encode numbers.
The consistency check for these codes is to determine the
residue of the code word modulo A, and signal an error if the residue
is nonzero. Note that the residue extraction can be performed by the
technique described for separable codes if A is of the form 2a-i for binary
computers. Similarly, a single error can be corrected by a table look-up
or alternate calculation if and only if each possible single error and the
integer 0 have distinct residues modulo A. In this case, as before, the
definition of single error is an error of arithmetic weight one.
126
L
DThe implementation of a nonseparable coding scheme must include
provision for coding and decoding numbers. Since these operations are
multiplication and division, respectively, for the AN codes, the form
of the operations is completely determined. There is special form of
the AN code in which it is unnecessary to perform the division by A in
order to decode a number. This case, the so-called "systematic code,"
is characterized by the fact that a subfield of the coded representation
of a number is the uncoded representation of the number. Hence, decoding
is accomplished by extracting the information subfield from the coded
representation. Garner s9 shows that the format of a systematic code
must be such that an integer n is represented by the left concatenation
of a field c to the representation of n; i.e., such that n is represented
by (c,n), so that decoding requires the extraction of the least significant
bits. This quality also simplifies the coding somewhat in that only the
most significant bits of the product A • n must be calculated in order
to encode n. Several systematic codes have been constructed by Henderson, .25
and the constraints on code parameters for these codes have been described
by Garner. ss
Although we have briefly mentioned how to implement arithmetic
operations with a nonseparable coding scheme, we have not discussed the
problem of computing an additive inverse (negative). In conventional
representations the subtraction of one number from another is usually
implemented by computing the additive inverse and performing addition.
Radix representations allow one to compute the inverse simply by comple-
menting the stored form of the number, either with or without an end-
around carry depending on the representation. The AN code does not, in
general, allow this flexibility. However, it is always possible to find
a constant B such that the coded form of a number is AN + B and the additive
inverse of the number can be determined by complementing its representation.
These codes were first reported by Diamond so with the description of the
AN codes. Notice that arithmetic operations on code words coded in
AN + B form usually require addition or subtraction of a contant B at
127
one or more points in the arithmetic processes that is otherwise not
required for AN codes. Hence, it is not immediately clear that the
advantage of simplified negation is worth the price of more complex
operations.
Garner s9 has shown how to construct a systematic AN code with
complement coding of the negative inverse by applying a special set of
weights to the bit positions of the representations. This approach is
particularly promising because it has the advantages of complement
coding without requiring the inconvenient addition and subtraction of B.
In order to implement the code, carries generated in the arithmetic units
must be distributed according to rules derived from the weights of the
bit positions. This does not appear to be as costly as the alternative
of adding and subtracting B during operations. At least one code derived
by Henderson 125 is a systematic AN code with complement coding, which
uses conventional weighting of binary-bit positions. This kind of code
is the most attractive of the many nonseparable codes described in this
section.
It is apparently not feasible to use An and AN + B codes for
the correction of two or more errors in arithmetic operations for several
reasons, the most important of which is that the hardware required to
correct more than one error is sufficiently complex to decrease the system
reliability unless it can be protected by still more hardware. Peterson 23s
gives a good table of single error-correctin_ AN and AN + B codes, while
Massey's more recent paper 201 is a good up-to-date summary of AN codes.
Garner s9 has the most complete collection of the many conditions on
code parameters that determine separable, nonseparable, and (nonseparable)
systematic codes.
c. Evaluation
To estimate the hardware costs of detecting errors with coded
arithmetic logic, notice that the extra arithmetic units required for
separable codes may be substantially smaller than the units they protect
because operations are performed modulo b, where b is presumably smaller
128
Dthan the system modules. Nonseparable codes do not require separate
arithmetic units as such, but require that the principal arithmetic
units be made to accept larger operands than otherwise required for
the system. In both cases, the redundant arithmetic logic accounts for
about a 30-percent hardware increase. Together with consistency checking
comparison, and control logic, the amount of hardware required is about
double that of an unprotected system. Imposed on the system is the
additional cost of slower operating speed due to consistency checking
operations.
The most competitive uncoded method of detecting errors is to
use two identical arithmetic units and compare the outputs. The uncoded
scheme requires at least twice the hardware of an unprotected system
and results in a slight time penalty to make the comparison. It is
difficult to compare the time penalties for the uncoded and coded
protection schemes because part or all of the consistency-check and
comparison operations may be concurrent with normal arithmetic-unit
operations. The coded system may require less hardware, and could be
used to pinpoint fault locations, provided that the arithmetic code
has minimum distance three. A doubly redundant uncoded system cannot
tell which of the two copies of an arithmetic unit contains a fault,
although it can be used to indicate that a fault exists in a specific
digit or carry line. On the basis of the estimates made here one cannot
eliminate one scheme in favor of another. The point is that coding
appears to be competitive with other reliability methods so that it
should be considered with the other methods in system design.
With error detection, transient errors can be masked by
repeating faulty operations. If repetition does not eliminate the error,
then a faulty element is indicated. Hence, detection and repetition
constitute a self-diagnosis system which, with replacement capability,
constitutes an attractive approach to achieving system reliability.
Since error detection alone cannot mask faults, it is necessary to
include repetition, replication, or a suitable alternative in order
to achieve high reliability.
129
Error correction is more difficult to evaluate. With error
correction, it is possible to mask faults, except that the error-correction
hardware itself must be checked. With the necessity to use redundant
error-correction hardware, the attractiveness of the scheme is somewhat
diminished. In order to decide whether or not error correction is
desirable in a particular system, the designer should go through the
exercise of comparing at least two competitive designs, one based on
error correction and the other on replication or repetition with
majority decision logic.
Of the coding schemes described here, the separable codes and
the systematic AN codes are the most attractive. Both schemes offer the
advantage of being able to identify the binary representation of a number
in its coded form. This simplifies masking and shifting operations on
numerical quantities. The principal difference between the two schemes
is how the check symbol is formed. With separable codes, a predicted
check symbol is calculated by a separate module. With systematic AN
codes, the check symbol is integral to the coded form of the number,
and is formed by the arithmetic unit during a normal operation. The
two schemes require about the same amount of hardware to form the check
symbol. Since the consistency check can be performed similarly for
the two systems, they both require roughly the same amount of total
hardware to implement. Separate code schemes offer more inherent
modularity than the systematic AN codes, which may be an asset in systems
that permit replacement of faulty modules. Again, the method that is
best for a particular system must be selected on the basis of a detailed
comparison of the two coding schemes against other methods of achieving
reliability.
Among the outstanding questions that relate to the use of
arithmetic coding are the following:
(i) Division checking is particularly unwieldy for
separable codes. Are there techniques that can
simplify this process?
(2) Are there ways to simplify the checking of
multiplication and division for the AN codes?
130
(3) How can faults in serial operations be detected
and/or masked?
(4) How can arithmetic codes be used to check non-
arithmetic computer operations?
(5) Can arithmetic codes be profitably used to
protect memory and input/output modules7
(6) The inherent modularity of residue number systems
is still sufficiently attractive to keep interest
in the idea alive. What techniques can be used
to eliminate the problems of residue interacting
operations?
(7) Several detailed systems designs should be developed
using different methods of achieving reliability.
This should shed light on the performance and cost
of ultrareliable systems.
(8) It appears that the use of arithmetic coding is
attractive for the location of failures in an
arithmetic processor. Of course, error-correcting
codes can be used which locate an error at a bit
level, but the circuitry related to a bit of
computation represents a circuit block which is
not complex enough to conveniently reconfigure. A
class of codes is required which can locate errors
to within one of several consecutive bits.
131
III TECHNIQUES FOR DYNAMIC ERROR CONTROL
This chapter is concerned with techniques of logical analysis and
design that are needed for the realization of computers in which the
control of errors is dynamic, i.e., in which the logical interconnections
among the components of the computer may be altered. In the case of
autonomous error control, the error state of the computer is a subject
of computation and control by high-level processes within the computer
itself.
In this chapter we shall consider the various problems of analysis
and design that arise in such systems. The first section deals with the
overall design of the computer system, including the design of its structure,
and the coordination of the various maintenance and computational processes.
The second section deals with the design of tests for the detection and
diagnosis of faults within the subsystem networks of the computer. The
third section deals with the design of networks of the special kinds
needed for the composition of the computer systems of interest.
In each section we attempt to characterize the problems of design
in terms of their relevance to the overall objectives of system performance,
in order to determine how well present engineering techniques satisfy the
given design requirements, and to indicate what problems require further
study. In some of the cases we present analytic solutions to several of
the problems that were uncovered, and in others we present rough logical
designs, in order to give concrete illustrations of general design approaches
and to uncover problems of detailed design.
A. Problems of System Organization
This section considers the design problems relative to the structure
and the operating modes of an advanced computer from the overall system
point of view. In the first part, we examine the basic computational
and maintenance processes that are desired and distinguish certain struc-
tural features that follow from these requirements. In the second part,
133
we examine the major maintenance processes in further detail and consider
the problems of coordinating these processes. In the third part, we
examine the major aspects of system structure and attempt to distinguish
the problems of design of various components of the structure.
1. Basic Behavioral and Structural Characteristics of an
Advanced Spaceborne Computer
In Sec. I-A-1 it was noted that computers for advanced, long-duration
space missions will have to perform computations of great variety and
complexity and with a range of priorities; that many of the computations
will have to be performed at high speed and with large memory capacity;
that the reliability of an electronic system with the required computational
capability and mission time cannot be ensured without some degree of error
control; and that the amount of human error control available will be
very limited. We wish to determine how these characteristics affect the
structural characteristics of a computer.
The requirements of complexity of computation and high performance
clearly indicate the probable need for a high degree of parallelism of
logical operations, although the degree of parallelism needed is not known
at this time. It is appropriate to note the different kinds of parallelism
that may be employed in future spaceborne computers. Some conventional and
feasible domains of parallelism are: (1) the bits of a computer word,
(2) the set of words in a vector, (3) the phases of an instruction cycle,
(4) the set of instructions in a single program segment, and (5) several
program segments belonging to one or more computations. The parallelism
may have several forms: for example, the concurrency of operation may
apply to all the elements of a single entity in the domain, e.g., all the
bits of a single word; or it may apply, say, in an overlapping manner,
to elements of several entities in a domain, e.g., the address calculation
of one instruction and the arithmetic of a second instruction.
The need for a high degree of autonomous error control can have
a substantial influence on computer structure. In Sec. II it was noted
that logical fault masking, either fixed or adaptive, could be employed
locally within a system to increase the reliability of a system without
any need to substantially modify the system's basic logical structure.
134
However, there are two basic limitations inherent in the exclusive use of
fault masking, both of which may be considered as inefficiencies in the
use of redundancy. It will be seen that substantial modifications in
system organization are needed to achieve error control that overcomes
these limitations.
The first limitation of local fault masking is that it does not
provide for the transfer of redundant equipment between functional
locations; thus a functional location, such as a program-counter subsystem,
may exhaust its fault-masking capability, while another location, such as
a time-counter, may have a surplus of perfect parts. In general, there
are many such functional locations in the central portion of a computer
where failure is catastrophic for the system as a whole.
The second limitation of local fault masking is that it does not
provide for soft failure, i.e., for the reallotment of computational
resources among tasks according to their priority for the mission. It
is well known as that the logic of general-purpose computation can be
realized with a much smaller number of logic elements than is found in
a modern computer. It may also be expected that a great range exists
both in the value of the set of computational tasks and in the usable
precision of their computations. It is thus seen that there exists a
wide useful range for the exchange of equipment and performance in a
complex spaceborne computer, and it is submitted that the design of an
advanced computer should attempt to exploit this range to a high degree.
Translated into system terms, the overcoming of these limitations
requires that the structure of the computer be reconfigurahle. Thus,
overcoming the first limitation calls for the capability of reassigning
equipment among the functional locations within a given computer structure.
Overcoming the second limitation calls for the capability of reorganizing
the available equipment into a new general-purpose structure, and of
modifying the programs so as to maximize the value of the computations
performed. The key problems of design in achieving such capabilities are
flexibility of structure, simplicity of diagnosis, and reliability of
control.
135
Flexibility requires that the hardware should be capable of being
logically interconnected in manyuseful ways in order to accommodate
many fault patterns. This capability is enhancedin turn by a high
level of modularity amongfunctional units and by a high degree of
uniformity in the structure of interconnections amongthe units.
Modularity, i.e., the use of a small number of different kinds of
functional units or modules, increases the number of locations at which
redundant equipment of a given type may be employed. Modularity is also
consistent with technological considerations of reliable fabrication,
as discussed in Sec. I-A-2. With the advent of complex monolithic
arrays, it may be advantageous to employ a small number of complex module
types that can be programmed by stored information or by external connection
to perform one of a number of different functions in different functional
locations. Uniformity of interconnection structure, e.g., as in cellular
logic, would help to maximize the number of different possible configurations
of functional units.
Simplicity of diagnosis requires that the fault status of the
functional units of the computer should be accurately diagnosable in a
short time and that the size of the program needed for such diagnosis
should be small enough to be compatible with the combined resources of
local memory capacity and the data link to a remote diagnosis facility.
Reducing the number of module types also has the beneficial result of
reducing the total size of the diagnostic program.
Reliability, in this instance, means that the control of such re-
configuration must be either fault-free or fail-safe, and that the
reliability benefits of the reconfiguration scheme exceed the reliability
losses produced by the added equipment. The reliability of switching
and control is crucial to the whole approach of reconfigurability. It is
well known that a reconfigurable system with perfect control and switching
is superior to a fault-masking system, but the potential faults in the
equipment needed for such functions may make the system less reliable
than one in which the same amount of equipment is used in fault masking.
High reliability of switching and control may be achieved by minimizing
136
the amount of equipment needed for a given complexity of function, and
by the use of static or dynamic error-control techniques. It should also
be noted that static error-control schemes may be useful in increasing
the basic reliability of the modules and their intercommunication paths.
The application of such means to particular control and communication
structures is itself an important design problem.
In summary, it is suggested that in order to achieve the highest
levels of reliable performance, an advanced spaceborne computer will
need the following structural features to a high degree: parallelism
of logical operation; modularity and programmability of functional modules;
regularity and programmability of interconnection; and autonomous capability
for fault diagnosis and control reconfiguration. It is also suggested that
a number of error-control techniques, both static and dynamic, will need to
be employed to enhance the reliability of basic functional units. It is
not clear at this time that the use of redundant equipment in a reconfigur-
able structure will result in a more reliable system than the use of
redundant equipment in a localized fault-masking system, nor is the optimum
degree of reconfiguration for a given technology known. New schemes of
system organization and network design for such reconfigurability are
needed to permit a proper evaluation of this approach.
2. Organization of Basic Processes
In this section we wish to review the basic processes of general
computation and maintenance computation that must be realized in space-
borne computer employing dynamic error control.
The basic computational processes in a general-purpose computer may
be grouped as follows:
(G1) Executive: including the steppin_ of the major phases of
an instruction cycle, control of "interrupt" action,
communication with maintenance processes (i.e., for alarm)
try-again, roll-back $ and return.
"Try-again," as suggested by the name, is an attempt to correct an error
in a computation by repeating it, and "roll-back" is a return to a pro-
gram step that preceded an error in order to regenerate (to the degree
possible) information that was lost because of the error.
137
(_2)
(G3)
Instruction: including the determination of and access
to an instruction, and the transformation of the address
portion, e.g., by indexing or table look-up.
Operation: including the retrieval, computation, and
distribution of operands.
Input-output: including selections or recognition of an
active terminal, receipt of transmission of information,
buffering, formatting, and preprocessing (e.g., integration).
The basic maintenance processes for a highly reconfigurable computer
may be grouped as in the following list, which proceeds in order of
increasing degree of system modification.
(MI)
(M2)
(M3)
(M4)
(M5)
(M6)
(M7)
Passive error control: including localized fault masking and
error correction.
Fault indication: including detection that an error has
occurred, and the location of the general area of the fault
that produced the error.
Transient discrimination: including attempts to correct an
error by repetition of the general computational process.
Fault diagnosis: including the selection of and access to a
subject logic network, presentation of test patterns, and
retrieval and interpretation of responses, in order to detect,
locate, and characterize faults.
Reconfiguration: including the generation of schemes for the
re-allocatlon of hardware resources, the assignment of function
for multi-functional modules, and the setting of interconnection
paths.
Reorganization: including the generation of an alternative
schemes system organization for realizing general-purpose
computation, the relocation of data in storage assignment of
functions and interconnections among modules, and the modification
of program subroutines.
Alteration of tasks: including the determination of an appropriate
allottment of the available hardware to the set of computational
tasks.
In the order listed, the maintenance processes involve increasing
losses in time, corresponding to the increasing seriousness of the fault
conditions for which they are appropriate. It is therefore sensible to
organize these processes in a hierarchy, so that the capability of
accommodation of a given process may be fully utilized before employing
a process of a higher order.
138
Process MI is valuable in enhancing the basic reliability of the
functional units of a computer, and it is especially needed for protecting
those circuits that control the execution of the higher-order maintenance
processes. Processes M2, M4, and M5 are essential to dynamic error control,
and process M3, which must follow M2 if it is employed, is probably of
value in space missions, in order to accommodate nonpermanent faults,
such as transient errors in logic networks due to radiation bursts and
power interruption, and data-sensitive errors in memory networks. Process
M6 represents a higher-order capability that is not essential to dynamic
error control, but which provides accommodation for more extreme error
conditions. Its employment inevitably calls for some reduction in
performance, hence process M7 must also be employed to some extent.
Under some circumstances, process M7 may stand alone; for example, if a
particular machine order is inoperable, it may be expedient simply to
reduce performance for some task, rather than reconfigure the machine
structure.
The addition of new processes brings new possibilities for error.
Some policies that may help reduce errors due to failures in the mainten-
ance processes are:
(I) Employ a maintenance process only when it is needed.
(2) Provide for remote human control of at least the initiation
of a maintenance process.
(3) Provide many easy exists from a maintenance process to some
stable (perhaps imperfect) operating configuration.
(4) Subdivide maintenance processes into small steps such that
each one has only a limited effect on the system.
The design and organization of general computational processes is,
of course, a highly developed art; but the design and organization of
the maintenance computational processes is not well developed, especially
for the present case, in which a high degree of autonomy is required.
Further research is recommended to develop techniques for the design of
these processes and the coordination of these processes with general
computation.
139
3. Approaches to System Structure
a. Introduction
In this section we consider a number of possible approaches to
the design of system structure, i.e., the assignment of functions to sub-
systems and the ordering of communication among subsystems.
The dominant qualities of the computers of interest, from a struc-
tural point of view, are the need for parallelism of computation within a re-
configurable structure and the distinctness of maintenance control. For
the various functions of both general and maintenance computations, there
is a choice as to the extent to which a given function is performed
exclusively in a given network type. We shall consider how this choice
appears in system structuring.
b. Approaches to Structural Parallelism and Functional
Specialization for General Computation
In order to see how parallelism may be employed both for compu-
tation and _or error control, it is instructive to examine the known schemes
of parallelism for computation alone.
In the design of the conventional, serial computer (due to
yon Neumann), specialization of function is carried out to a high degree.
Thus, as illustrated in Fig III-A-I, the functions of storage, processing,
input-output, and control are realized in special networks (or subsystems).
Figure III-A-2 illustrates three schemes for increasing the parallelism
of some of these functions that have been realized in machines built
within the past eight years. Figure III-A-2(a) illustrates parallelism
in storage units. An early example of its use was in the Larc computer, 7°
in which a number of units operated with overlapping access cycles.
Figure III-A-2(b) illustrates parallelism in processors, and the distribu-
tion of control among processing units; an early example of this scheme
is the Gamma 60 computer, ss Figure III-A-2(c) illustrates parallelism
in a combined storage and processing function. Such a system is often
called an "associative" or "logic-in-memory" processor; Lee 17s conceived
a machine in which the combined storage and processing elements connected
essentially in a one-dimensional array, and Slotnick 2as conceived a machine
140
L
i,o! s iI NPUT- STOREOUTPUT
\ /
/ \
CONTROL PROCESSOR
T&-$610-61
FIG. Ill-A-1 SERIAL COMPUTER
(yon Neumann)
(o) STORAGE-MULTIPLEX
II II(e.g. I LARC )
I I
llillllIc, l
(b) PROCESSOR-MULTIPLEX (c) '_,SSOCIATIVE" PROCESSOR
(e.g.,"GAMMA 60") S, P: I DIMENSIONAL--LEE
2 DIMENSIONAL--
"SOLOMON"
T&-5_lO-I2
FIG. Ill-A-2 SCHEMES WITH LOCAL PARALLELISM
Z/O S
C P
(o) POLYMORPH IC- PROCESSOR
( "MULTI PROCESSOR" )
E ... •
(b) ITERATIVE-STRUCTURE
PROCESSOR (HOLLAND)
T&-ISlO-I$
FIG. Ill-A-3 SCHEMES WITH GENERAL PARALLELISM
141
(SOLOMON) using a two-dimensional array. In all of these systems, either
storage or control operates in an essentially serial-by-instruction mode.
All of these schemes, of course, may be realized with different degrees
of parallelism at the bit level.
Only two schemes of fully parallel processing have been
discussed in the technical literature. One is the "polymorphic" scheme $49
(usually the structure connoted by the term "multiprocessor"), illustrated
in Fig. lll-A-S(a), in which the functional specialization of the yon Neumann
computer is preserved. The other, illustrated in Fig. lll-A-S(b), is the
iterative-structure processor (often called "the Holland machine"), Iss
in which storage, processing, and control functions are combined in a
cell and all cells are identical and regularly connected. In the poly-
morphic scheme, particular functional units that are to be combined may be
chosen freely and the paths connecting them will usually be fixed for a
whole computation, while in the iterative-structure machine the building of
new paths among cells occurs at every instruction in order to retrieve the
operands needed. The polymorphic scheme is superior with respect to
efficient use of hardware, and the iterative-structure scheme is superior
with respect to flexibility of reconfiguration. In both schemes, the
parallelism in structure may be employed either for concurrency of inde-
pendent or redundant computations, for reservation of spare parts, or for
combinations of these functions.
c. Factors of Module Size and Specialization
A critical factor governing system design is the size of the
basic unit of reconfiguration. It would seem to be an unnecessarily
artificial constraint to assume that this unit should be identical to some
traditional whole-function entity, such as a central processor or a register.
Furthermore, there is no need to make such a unit identical to the contents
of a single device package, since, in a multifunction monolithic device of
the size to be expected within the next two years, discarding an entire
module because of a single faulty output would be very wasteful of reserve
logical capability.
142
L__
d. A Suggested Model
Because of the high storage capacity required for the missions
of interest, it is probably necessary to assume the continued use of
specialized memory arrays, in order to achieve high density and low power
consumption. These benefits apply both to magnetic memories and to
monolithic semiconductor memory arrays. (The latter appear increasingly
attractive for use in systems where volatility of information--i.e.,
loss of information with removal of power--is tolerable). A variety of
memory types is likely to be needed, including variable destructive-read
memories, variable nondestructive-read memories, and fixed-read memories.
In addition, a special processing memory as shown in Fig. III-A-2(c) may
be needed for some missions.
It is not clear to what extent a single structure can cover all
computational functions, but it is clear that certain basic operations
such as storage, counting, and addition occur both in processing and in
control functions. Also, there is usually substantial freedom in the
structuring of the networks that realize control functions, so that there
could be a substantial similarity in the use of basic operations. There
could therefore, be a substantial sharing in equipment; hence, an a priori
separation of processing and control functions is not justified.
Since input-output functions have very special characteristics
related to selection, formating, and buffering, a separation of input-
output functions from other system functions appears justifiable.
The above considerations are embodied in a simple model of a
parallel, reconfigurable computer illustrated in Fig. III-A-4. Sets of
storage modules of various types S1, $2, ... and, optionally, a processing
store SP are connected to X , a central exchange, by a multiple-channel
c
switch or directory network. Similarly, sets of logic modules are
connected to X by an interconnection network The sizes of the storageC
and logic modules are not specified at this point; thus, for example, a
given memory address range may cover a number of storage modules. Finally,
a number of input-output interface modules are connected by a local inter-
face exchange _ to Xc, and by a terminal exchange X T to the external terminals.
143
SWITCH/DIRECTORY I
IIII
_X_ INTER FACE "_:Z_Z7 rEXCHANGE _ Xc CENTRAL EXCHANGE)
I i , IIII
' '! .........
TERMINALS TA-sno-*o
FIG. III-A-4 SCHEME WITH SMALL-MODULE PARALLELISM
This model is closer to a polymorphic structure than to an
iterative structure. It may be expected that iterative structures will
be advantageous for the realization of the Interconnection networks
and for other inherently iterative logical functions, due to their
simplicity of testing and reconfiguration. The model does not yet reflect
consideration of the maintenance computations. We consider these next.
e. Approaches to Structural Specialization for Maintenance
Computation
Investigations have been made (e.g., by Manning) 197 of the
extent to which a conventional computer is capable of diagnosing its
internal faults. It has been found that the fraction of self-dlagnosable
faults is high, but that there are some faults that escape diagnosis
either because the machine is blocked by the fault or because a given unit
is logically essential to its own diagnosis. For autonomous operation it
is therefore necessary to provide some equipment for the execution of the
maintenance processes (i.e., diagnosis and reconfiguration) whose operation
is not dependent upon the equipment being maintained.
144
A straightforward approach studied by Terris SoS is a system
composed of a conventional serial-process computer, employed for general
computation, in combination with a special primitive computer (called by
Terris the "master machine"), employed for diagnosing faults and switching
in spare parts within the first computer. Terris's design may be repre-
sented as in Fig. III-A-5, in which M is the master machine, PG is the
general computer processor and (SG, SM) is a single memory storing both
general program variables and the maintenance program. The general
computer accomplishes much of its own diagnosis, and the master machine
serves to diagnose and remedy failures in a few critical operations in
the general computer. The maintenance process may be considered a form
of bootstrapping.
An interesting variation on the basic idea is described by
Forbes, et al.,S4, 1 in which a single bit-parallel computer can be
partitioned into identical bit-parallel computers, each capable of serving
as a diagnosing computer, and each with its own master machine.
As the complexity of the general computer grows, the size of
the diagnostic and repair program will also grow. Furthermore, it is
desirable to give special protection to the maintenance program, to avoid
both accidental destruction of information and blockage of access to the
information by failures in the general computer. Therefore, it would seem
prudent to provide the maintenance section with its own program store,
perhaps with most of it in the form of a nonvariable memory. Such a
system is shown in Fig. III-A-6. In the figure, the various kinds of
signals exchanged by maintenance logic and general computer logic--i.e.,
error alarm, test data, response, and configuration control--are dis-
tinguished as separate channels.
In extending the scheme to polymorphic (multiprocessing)
computers, there is a choice as to whether the maintenance computer should
exist as a distinct entity, or whether the assignment of maintenance
functions to functional units should be subject to change even to the
degree of flexibility that is provided for general computational functions
145
S G , SM
M
STIMULI
1-- L
MONITORED
SIGNALS
PG
TA-5580-52
FIG. III-A-5 SYSTEM WITH SELF-DIAGNOSTIC
COMPUTER AND MASTER
MACHINE CONTROLLER
[SMIISGJ
III IIII
[_ALARM I
FTEST _J
L M I_ESPONSEJ L G
FCONTROL_
TA-5580-53
SM SG
III IIll
I ×s i
Ilillilltll'l
L 't L , L ! L J
, , ; I
IIII'IIII'IIII'IIII'
-)\ /S"
REFEREE.)
TA-5_80-54
FIG. III-A-6 DISTINCT-MAINTENANCE-CENTER
SYSTEM WITH SEPARATE WORKING
AND MAINTENANCE COMPUTERS
FIG. III-A-7 POLYMORPHIC SYSTEMS
WITH FLOATING
MAINTENANCE CONTROL
146
L
in modern polymorphic designs. A number of discussions of this idea
have appeared (e.g., Joseph) 14S and a reliability model has been studied by
Welch. S2s A sketch of such a system in terms of the polymorphic model
previously developed is given in Fig. III-A-7. A special subsystem
labeled "Referee" is provided, which has the task of assigning the main-
tenance role to a particular subcomputer. In this scheme the distinction
between storage modules for maintenance and for general computation is
preserved, but free access to all storage modules is provided to all
logic subprocessors. Separate communication between modules for computation
and for maintenance is provided by exchanges XS and XM respectively,
although it is not clear that such separation is essential.
The choice between the two approaches is not an obvious one.
Use of a special machine permits use of special measures to increase the
reliability of the very critical maintenance function, i.e., reduction of
its size (e.g., by making it highly serial) and application of high-order
redundancy fault masking. Allowing the maintenance function to "float"
among a set of identical processors has the advantages that maintenance
computations may be performed with a higher logical power than in a
primitive master machine, that a high order of redundancy is still
available for protection of maintenance control, and that the computer
has a homogeneous structure. Two design problems exist for which the
costs of solution are not presently known; these are (i) the problem of
protection against those faults within a processor that can block the
transfer of maintenance authority, and (2) the provision of intercommuni-
cation among all processors for the special diagnostic and control
information.
Future investigations of system organization should consider
how the merits of the two approaches may be realized in an integrated
structure.
f. Coordination of Information Types
It has been indicated that a self-diagnosing, reconfigurable
computer should have a very uniform structure. At the same time, the
number of information types that must be processed is very large. Some
147
of these types are: instructions, operands, memory, arithmetic, control,
error indication, diagnostic tests, test responses, and configuration
control.
Special care must be taken in order to avoid a proliferation
of special codes, formats, and data paths within a computer. Avizienis 14
has indicated the benefits of uniformity of coding for transfer of operand
information among the various functional sections of a diagnosable computer.
Further work is needed to achieve uniformity among all the various types.
In addition to the problem of checking the correctness of
information transfer, there is also the problem of ascertaining that a
desired transfer did in fact occur. This may be facilitated by combining
information messages into higher-order strings, perhaps containing a
mixture of information types.
g. Problems of Subsystem Design
The novel structural features described in previous sections,
together with new constraints and freedoms associated with developments in
device technology, bring new problems of subsystem and network design.
In the next sections of this chapter several of these problems will be
explored. It may be expected that further investigations of these and other
subsystem problems will place new requirements on the overall system
structure that cannot be anticipated at this time.
Some of the important questions about subsystem structures are:
(i) In Module design: What shall be the sizes and the
functions of the various modules? How shall fault-
masking, error-detection, and fault-dlagnosis aids
be incorporated? How shall the reprogrammability
and reconfigurability of modules be accomplished?
(2) Intercommunication-network design: How shall the
network be designed so as to achieve high flexibility
and programmability? How shall fault avoidance and
fault masking be incorporated?
(3) Maintenance-control network design: How can the size
of the network be reduced, while preserving capability
for adequate bootstrapping of a complex computer? What
is the best combination of static and dynamic fault
masking to apply within the network?
148
B. Tests for Diagnosis of Fault Conditions
I. Introduction
A high degree of equipment reconfigurability implies the availability
of accurate information as to the actual functional capability of the
equipment. In ground-based computers, maintenance procedures may be
conducted by intelligent technicians equipped with catalogs of fault
syndromes, and capable of probing the structure of the computer at a
great number of points. The spaceborne missions of interest to NASA may he
manned or unmanned, and some useful communications with Earth may or may
not exist. In all of these cases some autonomous on-board capability for
the diagnosis of faults is either essential or extremely helpful, because
of limited accessibility to test-points or because of a shortage of time.
Limitations in storage capacity and accessibility in a spaceborne computer
put high requirements on the completeness and efficiency of test schedules.
Unfortunately, the test procedures that can be designed with present
knowledge cannot be considered adequate.
In this section, the problems of designing test schedules for
fault diagnosis will be considered in some detail. In parts III-B-2 and
III-B-3, tests for combinational networks are considered, in which the
choice of successive tests is unconditional or conditional on the responses
to test inputs. A number of new results are included. In part III-B-3,
the present state of the art of diagnosis is assessed and directions for
further development are suggested.
2. Fault Diagnosis in Combinational Circuits Using Fixed Test
Schedules
a. Introduction
To determine whether a network of digital-logic and storage
elements is working properly, one may apply to the network all possible
input combinations and sequences, and compare the resultant outputs with
the corresponding correct outputs--using, for example, a faultless version
of the same network. Any discrepancies indicate the presence of a fault.
149
Moreover, if the user is armed with a table showing which faults give
rise to which patterns of discrepancies, he can readily distinguish
any fault from the others--at least within a subset whose effects on
the network output are identical. This procedure is perfectly valid for
all types of digital networks--combinational and sequential, single and
multiple output, gate-type and branch-type, binary and nonbinary, etc.--
and all families of faults which have a more or less permanent effect
on the behavior of the network.
Such exhaustive tests as these are usually much too long to be
practical, however, and except for a few exceptional cases they are not
at all necessary. It is normally possible to test a network for the
presence or the presence and location of faults, by a schedule of tests
which is shorter by one to several orders of magnitude than an exhaustive
test.
In this part we consider the problem of devising economical
test schedules for the diagnosis of fixed (i.e.,nontransient) faults in
an arbitrary combinational switching network. By "diagnosis" we mean to
include the three separate cases when (a) any of a prescribed list of
faults is to be merely detected, (b) the particular fault is to be located--
that is, we are to determine which fault has occurred--and (c) the fault
is located, but only to within the module (package on subnetwork) in
which it occurred.
After defining these three minimization problems in mathematical
terms, we proceed to a formal solution of each, for the case where it is
assumed that the test schedule is flxed--that is, when the choice of the
succession of test inputs which are applied to the network does not depend
in any way upon the outcome of the tests. It is then shown (Part 3) that
shorter test schedules can be expected for fault location when this
assumption is not made--that is, when the choice of which test input to
apply at each step in the testing is allowed to depend upon the outcomes
of previous tests. A solution is offered for this case of serial testing.
In describing these solutions, principal attention is given to single-
output, binary networks; extension to the multloutput case is not difficult,
and is described later. Nonbinary networks can also be handled easily.
150
IThe fixed-schedule solutions offered here may be considered
to be satisfactory for derivation on a digital computer, for any
combinational network having up to eight or ten inputs, several outputs,
and about i00 faults. While some much larger networks can also be
handled, procedures are presently lacking for generating even reasonably
good test schedules for very large arbitrary networks. It is this
remaining problem, as well as the problem of fault diagnosis in sequential
networks, which may be considered to be the most important subjects for
further research in this area. For a discussion of this problem, see
Sec. II-A-3.
Most of the procedures to be described below are contained
in the literature, but in a context having nothing to do with fault
diagnosis. Consequently, the pertinent parts of them have been collected
here, using a common notation and viewpoint with some original extensions
and evaluations•
b. Formulation of the Problem
Given a single-output combinational network, there is no
conceptual difficulty in imagining that an analysis of it has been con-
ducted, in order to determine the effect on its output of each of various
hypothetical faults. The results of such an analysis may be expressed
in a multi-output table of combinations such as the one below•
Xn "'" x2 Xl f fl f2 "'" fj "'"
0 ... 0 0
0 ... 0 1
0 ... I 0
1 ... 1 1
0 1
1 1
0 1
0 0
0 .,, 0 ,i.
0 ... 0 ...
0 ... 1 ...
0 .o.
151
The Xl, x2, ... Xn are the input variables to the network; f = f(x I,
x2, ... Xn) is the fault-free (correct) output; and fl' f2' "''fj " "
are the erroneous outputs, each corresponding to one of the possible faults
which the desired diagnosis schedule is supposed to check. The left side
of the table simply lists all 2n possible combinations of the input
variables. Note that no assumptions have (yet) been madeabout the nature
of the faults--whether they are due to isolated or multiple component
failures, to open or short-circuited devices or conductors, to short cir-
cuits betweenseparate parts of the network, or to either sudden failure
or slow degradation.
It will be convenient to reduce this table somewhat before
stating the problem formally, as follows.
(I) Suppose some column fj is identical to column f. This
indicates that the jth fault has no effect on the net-
work output, so that there is no way--and in fact, no
need--to detect its occurrence. The column fj may
therefore be deleted from the table. This type of
condition can occur either if the network is redundant
or if certain of the faults cause local logical changes
which leave the output the same.
(2) Suppose that two columns fj and fk are identical. This
indicates that two different faults have the same effect
on the network output, and for purposes of detection and
location they must be treated together. One of the two
columns may therefore be deleted from the table. It is
easy to imagine how this condition could arise in practice.
After any such deletions, all of the columns f = fo' fl'
f2' "'" fm (say, for _ distinguishable faults) will be different. We
may collect these m + 1 columns into a 2n-row binary fault matrix or
fault table F:
F
-0
i
= 0
0
1 0 ...
1 0 ...
1 0 ...
0 0 .,•
152
If a fixed schedule of input tests is to be employed to check
a possibly faulty network, we are interested in economizing on the number
of different test inputs--i.e., the length--of such a test. The problem
is therefore one of selecting a minimal subset of rows of the matrix F
that preserves a certaindegree of distinguishability among the columns.
More precisely, for the detection of the presence of any of the m faults,
we want to delete from F as many rows as possible, so that:
The first column is different from all other columns.
For the location (as well as the detection) of the m faults, we want to
delete from F as many rows as possible so that:
Every column is different from every other column.
Finally, a third minimization problem of interest arises from
the common situation in which the faults are classed according to the
module, package, or subnetwork in which they occur. Thus, if it is
desired to locate a fault (column of F) only to within its preassigned
module class, we want to delete from F as many rows as possible so that:
Every two columns which fall in different module classes
are different.*
For this condition, the f column should he treated as belonging to a
separate module class.
These three problems of fault detection, fault location, and
fault location-to-within-modules correspond conceptually to the problems
of error detection, error correction, and error location, respectively,
in error-checking codes. Unfortunately, this appears to be about as far
as this analogy can be carried.
One assumption must be made about the nature of the faults if a
test schedule is to have meaning: we must assume that any fault which is
to be detected or located has a duration in the network which is no less
* If columns fj and f. in simplification (2! above fall in separate
module classes, neither should be deleted but this requirement should
then be relaxed to exclude this particular column pair.
153
than the interval of time over which the pertinent test inputs are applied•
In practice, this meansthat the diagnostic methods described here are
limited to fixed (i.e., permanent and semipermanent) faults. Someother
meansmust be employed to protect the network against the effects of
any transient or intermittent faults which are deemedlikely to occur.
c. Formal Solution Using the G-matrix
We now show how the fault-detection problem and both fault
location problems may be converted to familiar switching minimization
problems•
Since it is only the distinguishability of certain columns of F
which is at stake, we may conveniently express the distinctness condition
in terms of a matrix G, each of whose columns is the modulo-2 sum of a
different pair of columns of F that are supposed to remain different.
That is, considering the same row of both F and G, a 1 is entered in the
column of G labeled with the pair (i, j) if the digits in the two columns
of F labeled f. and f. are different, and O otherwise. Under deletion
1 j
of corresponding rows of F and G, two columns of F will then remain
distinct if and only if the corresponding single column of G does not
become a column of all O's• Thus, the three conditions on F stated in
the last section for fault detection, location, and location to within
modules may be expressed as a single condition on the G-matrix; namely:
Delete from G as many rows as possible, so that_condJtion X
every column is non-zero J
In the case of fault detection, the G-matrix (GD, say) has
just m columns, one for each column pair (So, _) in F (j = 1, 2, .. m).
For the example used above, we have
GD =
01 02.. Om
-1 0 ... -
0 1 ...
I 0 ...
o
0 0 ...
154
For fault location, the matrix G L has (m + 12 ) columns, one for
each column pair (fi' fj) in F (i, j = O, i, 2, .. m; i _ j):
G L
010212 m-l,m
-I01...
011...
101...
000 ...
In similar fashion, the matrix G M for fault location to within
modules is of intermediate width, having one column for each pair of
columns in F which belong to different module classes•
Condition X expresses precisely the problem of finding a
minimal prime-implicant cover of a given switching function from its
prime-implicant table. 246 Good solutions to this problem are well known,
and have been programmed for execution on computers for quite large
tables.254, 1°5,44 We describe here a version of this procedure which
is adequate for solving only simple problems by hand, but which never-
theless illustrates well the two main steps in all of the programmed
algorithms:
(a) Simplification of the table to delete certain superfluous
rows and columns
(b) Final selection of one or more minimal row subsets from
the residual table. 2°6
The justification of the following simplifications (a) is
fairly obvious:
(i) Delete any row whose l's all fail in the same columns as
the l's in some other row. That is, delete any row which
is covered by, or is the same as, some other row.
(2) Delete any column which has l's in all of the rows in
which another column has l's. That is, delete any column
which covers, or is the same as, some other column.
155
These steps may be applied in any order until neither is
applicable. The resultant matrix (G*, say) has distinct rows and distinct
columns; also, no row covers another row, and no column covers another
column.
The selection (b) of a minimal row subset S is made by first
labeling the rows of the simplified matrix _ with binary variables a,
b, c, ... each of which is to have the value 1 if its row is to be in-
cluded in S, otherwise O. We now form a Boolean function L(G @) as a
product of sums, one sum per column of G*, such that each sum contains
just those row variables assigned to rows in which the corresponding
column of G @ has l's. The function L(G @) will therefore have the value
1 when and only when a sufficient subset of row variables a, b, c, ...
have the value 1--namely, when every column of G* is represented.
Expansion of this product of sums into a sum of products then
expresses as individual products all of the alternative row subsets which
satisfy the column condition. This allows one to select for S any one of
the products which has the least number of variables.
As an example, consider the following matrix for n = 3, m = 7:
0 1 2 3 4 5 6
-0 1 0 1 1 0 1
1 1 0 0 0 0 1
0 1 0 1 0 1 0
0 0 0 i 0 1 1
1 0 1 1 1 1 1
1 1 0 0 1 0 0
0 0 0 0 i 1 1
0 0 0 1 1 1 0
7
f a
0 b
1 c
1 d
1 e
0 f
1 g
0_ h
156
For fault detection we list the column sums (modulo 2) for the seven
column pairs (0,1), (0,2), ... (0,7), to get
1 2 3 4 5 6 7
-1 0 1 1 0 1 1-
0 1 1 1 1 0 1
1 0 1 0 1 0 1
0 0 1 0 1 1 1
1 0 0 0 0 0 0
0 1 1 0 1 1 1
0 0 0 1 1 1 1
_0 0 1 1 1 0 0_
a
b
c
d
e
f
g
h
The simplification step (I) first allows deletion of rows d and _, which
are covered by rows f and a, respectively; then columns 3, 5, and 7, which
cover column 2, may also be deleted by simplification step (2). This
leaves:
-i 0 i i- a
0 1 i 0 b
i 0 0 0 c
0 i 0 I f
0 0 I i g
0 0 1 0 h
Rows _, g, and h (covered by row _), then the last two columns, may also
be eliminated:
b = f
157
With the rows labeled as indicated, we obtain simply
L(G ) a (b v f)
= ab v af
One minimal set S therefore consists of rows a and b. The minimal
F-matrix is just
FDmin =
0 I 0 i I 0 i i
1 I 0 0 0 0 i 0
and the list of input tests to be applied is simply
x 3 x 2 x 1
a 0 0 0
b 0 0 1 .
For fault location, we sum all possible column pairs of F. The
same example of an F-matrix yields the following G-matrix:
GL =
Ol 0203 ...071213 ... ... 565767
-1 0 1 1 0 1 1 1 0 0 1 0 0 1 1 0 1 1 0 1 0 0 1 0 0 1 1 0-
0 1 1 I 1 0 1 1 1 1 1 0 1 0 0 0 1 0 0 0 1 0 0 1 0 I 0 I
1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1 1 0 1 0 1 0 1 1 0 1
0 0 i 0 1 1 i 0 1 0 1 1 i 1 0 1 1 1 I 0 0 0 1 1 1 0 0 0
1 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 0 1 1 1 1 1 0 1 1 1 0 1 0 0 0 1 0 0 0 1 1 1 0 0 0
0 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0
0 0 1 1 1 0 0 0 1 1 1 0 0 1 1 1 0 0 0 0 1 1 0 1 1 1 1 0
158
Using first the columns of lowest weight, most of the other columns may be
deleted. Then rows d and e may be eliminated to give:
-i 0 0 1 0 0 1 0-
0 1 0 0 0 0 0 1
1 0 1 0 0 1 0 1
0 1 1 0 0 1 0 0
0 0 1 1 1 0 0 0
_0 0 0 0 1 1 1 0
a
b
c
f
g
h
Thus,
L(G_) = (avc)(bv f)(cv fvg)(avg)(gvh)(cv fvh)(_vh)(bvc)
= (_ v cgh)(b v fc)(c v f v gh)(g v h)
= abcg v abch v abgh v ...
A minimal subset S of input tests for fault location therefore consists
of rows a, b, c, and g. The minimal F-matrix is therefore:
1 0 1 1 0 1 1
1 0 0 0 0 1 0
1 0 1 0 1 0 1
0 0 0 1 1 1 1
which corresponds to the schedule of test inputs:
x 3 x 2 x 1
a 0 0 0
b 0 0 1
c 0 1 0
g 1 1 0 .
Both McCluskey 2°s and Gill *°4 have provided solutions to the
fault-location problem, in the course of solving a seemingly unrelated
problem in pattern recognition. The solution described above follows
that of McCluskey. Gill's procedure is recursive. He shows how to
159
generate all possible solution subsets of rows of F for the first k
columns only, from a listing of all such subsets for the first k-i
columns. (Nonminimal subsets are included, but any subset which contains
another is not listed.) The list for k = 2 is easily formed by inspection,
and the procedure is repeated successively for k = 3, 4, ... m + I. The
method involves a tremendous amount of bookkeeping, even for small problems,
and cannot be considered to be very practical for present purposes.
For location of faults to within module classes, suppose that in
the above F-matrix faults I, 2, and 3 are associated with the same module,
as are faults 4, 5, 6, and 7. The G-matrlx for this case (GM, say) is
therefore the same as GL, except that certain columns representing pairs
of faults within the same module need not be included: (12), (13), (23),
(45), .. (67). This leaves a narrower G-matrix:
010203 ... 3637
G M = 1 0 1 1 0 1 1 0 1 0 0 1 0 1 1 0 1 0 0
0 1 1 1 1 0 1 1 1 0 1 0 0 1 0 0 0 1 0
1 0 1 0 1 0 1 1 0 1 0 0 1 0 1 1 0 1 0
0 0 1 0 1 1 1 0 1 1 1 0 1 1 1 1 0 0 0
1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
0 1 1 0 1 1 1 0 1 1 1 1 0 0 0 1 0 0 0
0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0 0 1 1 1 0 0 1 1 0 0 1 1 0 0 0 0 1 1 .
After one cycle of column and row deletions, we obtain the matrix
1 0 0 1 0
1 0 1 0 0
0 1 1 0 0
0 0 1 1 1
Subsequent simplifications yield
@ =
G M
1 0 0
0 1 0
0 0 1
a = c
f
g
160
Hence
Thus, a minimal F-matrix is
FMmin = 0
I
0
and a minimal test schedule is
= (a v c)fg
= afg v cfg
1 0 1 1 0 1 1
1 0 0 1 0 0 0
0 0 0 1 1 1 1
x 3 x 2 x 1
a 0 0 0
f 1 0 1
g 1 1 0.
It should also be pointed out that the procedure described
above involving the G-matrix can also he applied to a fourth problem in
fault diagnosis--namely, the problem of testing for the presence of one
particular fault (with output fk' say) which one might suspect to have
occurred, it being already known that some fault has occurred. 41 In
this case it is only necessary to form G from those column pairs (i,k),
i = I, 2, .. k - 1, k + 1, .. m. The matrix G will then have m - I
columns. This situation is identical to fault detection, except that
attention is focused on column k instead of column 0 of F.
^
d. Simplified Solution Using the W-Matrix for Fault Location
The above example seems to be typical of most fault-location
problems, in that considerable simplifications can be made in the matrix
G. This is to say that the distinctness of most of the column pairs in
F is taken care of automatically, if only a certain smaller subset of
column pairs can be guaranteed to be distinct. One way to identify
161
these critical column pairs is to display the weights (number of l's)
of the columns of G in the (m + l)-by-(m + I) matrix
W = FtF,
in which F t indicates the transpose of F and matrix multiplication is
carried out with exclusive-OR (modulo-2) addition used in place of
element multiplication. Each off-diagonal entry w. of W is therefore
13
the number of l's in that column of G corresponding to column pair (i,j)
in F. (The matrix is symmetric, so only the upper triangle of entries
need be calculated.)
For the running example, this W-matrix takes the form
W -3264645
-555755
-44444
-4242
-444
-42
-2
The column pairs corresponding to low-weight entries in W are certainly
good candidates for the subset of critical column pairs for G _. Inspec-
tion of F now allows many of the noncritical column pairs to be excluded
from further consideration; this exclusion may be conveniently indicated
by simply deleting the corresponding entries in the W-matrix. In this
manner, the formation of the G-matrix may be postponed until its width
has been reduced well below the large value of (m + 1 1 columns.
\ 2 ]r
As a computational convenience, it may be desired to enter
not just the number w.. of differing digit pairs between columns i and
ij
j, but the labels of the particular rows in which these columns differ.
Selection of an entry as "critical" then allows one to delete immediately
by inspection all other entries which contain the same row labels.
162
For the example, this labeled W-matrix (_, let us say) becomes
^W ace bf
- abcef
m
abcdfh abgh bcdfgh adfg abcdfg
bdefh bcegh abdefgh cdefg bdefg
acdh afgh cdgh abdg acdg
- cdfg ag bcgh gh
- acdf bdfh cdfh
- abch ah
- bc
(The compact listing of row labels in each entry is not meant to have
any significance as a product.) Selection of the shortest entries bf,
ag, gh, ah, and bc as critical now permits numerous deletions, leaving:
^
W = ace bf _ -
m
- cdfg
D
m
- cdefg -
ag - gh
acdf - cdfh
- - ah
- bc
Similarly, entry cdefg may be deleted, since it includes entry cdfg.
A deletable row of G (like _ or 2) is readily identified in W as a letter
which always occurs along with some other letter; such letters may be
removed:
^
W = ac
163
+_ may now be formed from the remaining entries in _. Actually,
GL
however, it is easier to write down L(G_) directly from _, bypassing the
formation of G_:
a:, before.
T,(c_) = (a v c) (by f) (c v fvg) .. (by c)
= abcg v abch v ...
If faults within the same module class need not be distinguished,
^
the'_ large blocks of entries in W may be deleted (or simply not calculated)
r2ght at the start:
--- . ace bf abcdfh "abgh bcdfgh adfg abcdfg
,o,..,,,,,,+,+,,++.+,,elll,+.lllll,+l+ll,+l,+l+,l,ll
- - - bcegh abdefgh cdefg bdefg
- - "afgh cdgh abdg acdg
- :cdfg ag bcgh gh
.l+,.+.+l+e•.++lll,,.*.,+.el,i++l
^
W =
Selection of the shortest entries bf, ag, and gh as potentially critical
now yields
= ace bf ..... --
- - cdefg -
- cdfg sg - gh
164
Entry cdefg still includes entry cdfg and may be deleted. Also row
labels b, d, e, and h may be deleted, since these letters always occur
with letters f, c, a, and g, respectively.
^
W = - ac f -- --
- cfg
This yields
ag - g
Hence
L(G_) = (a v c) f (c v f v g) (a v g) g
= fg (a v c)
= afg v cfg.
Chang 4° recognized the importance of the problem of fault
location to within modules problem, and offered a solution which he
claims "tends to give a 'fairly good' set of test patterns." He does
not prove this assertion, however. His method will be sketched in the
next part of this section, since it appears to be more successful when
adapted to serial test schedules than when used for fixed test schedules,
as originally proposed.
e. Some Bounds on the Number of Tests Required
Let N D, NL, and N M be the number of test inputs in the minimal
solution subset S, for the cases of detection, location, and location-
to-within-module-class, respectively. A tight upper bound *°4 on all
three of these quantities is provided by a particular case of the matrix
F--namely, the case when all columns of weight zero and one are present.
Thus, F is just an identity matrix of order m, bordered by a single
column of O's:
165
F = FO 1
0 0
0 0
0 0
0 0 0 . . .
1 0 0 .
0 1 0 . .
0 o 0
o-
0
0
1 .
It is easily seen that no rows may be deleted without leaving a resultant
matrix in which some column is the same as the first column• Moreover,
the presence of even one additional new row in the matrix would allow at
least one row to be deleted. Thus we have
ND._< m, NL_.< m, NM_< m •
Tight lower bounds may be obtained as follows. For the case
of fault detection, the possible presence of a row such as
(0 I 1 1 ... 1 i) in F immediately yields
ND_ 1
For fault location, *°4 the most favorable case arises when F contains a
subset of rows which constitute a binary coding of the columns. NL
such rows can generate as many as 2 NL different columns, so
NL _> 1 + [log 2 (m)]
where the brackets denote the integer part of the quantity within.
Finally, location of a fault to within one of p module classes requires
only that each module class have a distinct column coding; all columns
in one class may be alike. Thus,
i + [iog2 (p)] .
166
Our experience with examples of random matrices tends to
indicate that actual minimal test-schedule lengths fall closer to the
lower than the upper bound. Whether or not these samples of matrices
are truly representative of the patterns of faults in typical switching
circuits is another, more difficult question.
f. Reductions in the Size of the Fault Table
In order to be able to handle networks of a practical size,
it would be very desirable to reduce the width and height of the original
fault table (F-matrix) below the values m + 1 and 2 n, respectively. We
now show how some reductions can be made for the case of fault detection
at a cost whose exact value is not presently known, though felt to be
small.
If both the width and the height of the fault table are to be
reduced appreciably, some sort of analysis of the internal structure of
the network will be necessary. This is in contrast to the point of view
taken so far, in which the network has been viewed only from its terminals.
That is, the method used so far is one of testing the behavior of the
network for every possible input and for every possible fault. It must
be replaced by a method which asks: for which inputs will particular
faults manifest themselves at the output terminal of the network, and
(for fault detection) which other faults will have the same effect as
these faults?
The literature contains a few pertinent contributions on this
matter.246,11, 41, 192, 87 Armstrong 11 (following a suggestion of
Muller's) and Maling and Allen 192 propose approaches which, taken
together, suggest that one may check for a fault in a single gate within
the network by (a) applying those network inputs which will "sensitize'
to signal changes the complete path from the gate in question to the
network output (see Fig. III-B-I), and then (b) flex the remaining input
variables through whatever sequence of combinations of values is
necessary to check the proper operation of the gate. For most common
types of gates the number of such combinations is just one more than
167
NOT
x, tO
xz
• o s ,oN
r ii
(o)
b
I
R)_m,C=ob+{o+b)c
¢
( b ) ,O-_SOO-SS
FIG. Ill-B-1 PATH-SENSITIZING TESTS IN GATE NETWORKS
the number n of free inputs at a particular gate; with more complex
gate circuitry it may become as large as 2ng, the total number o£ input
possibilities for the gate. ls2 The complete test schedule for fault
detection therefore consists of the union of these individual gate tests,
taken over all gates which have one or more inputs driven by network
input variables.* (Purely internal gates are checked automatically.)
Any duplicate test inputs may be deleted, of course.
For example, for the size comparison cell in Figure III-B-l(b),
the following test schedule is obtained:
a b c
Gate AI 0 0
1 0
Gate O 1 1 1
0 1
f0 1 0
Gate A 0 1 1
* Actually, this procedure must be modified somewhat for those gates
whose fanout exceeds unity, in order to cover all of the multiple
paths to the network output. See the Appendix to Armstrong's paper.
11
168
For the first three tests, the path through gates A1 and 02 is
sensitized by setting c = 0. Gate A 1 is tested by applying inputs
a = b = 1, to give an output 1, then a = l, b = 0, followed by a = 0,
b = l, which should give an output 0. (In a conventional AND-gate,
there is no need to apply the input a = b = O, since there is no
reasonable way the circuit could fail so as to make it behave like an
inequivalence gate.) For the second group of three tests, the path
through gates O1, A2, and 0 2 is sensitized, by setting c = I and holding
ab = 0. Gate O 1 is then flexed with input combinations a = b = 0;
a = 1, b = 0; and a = 0, b = 1. Finally, the last two tests (which
happen to be duplicates of prior tests) arise from the flexing of gate
A 2 by means of its free input c, having set ab = 0 and a + b = 1 to sen-
sitize the path from A2 to 0 2 . Thus, six different test inputs are
required in the schedule. This may be shown to be the minimum number for
this example, by the procedure described earlier in this memorandum.
Actually, Armstrong assumed an even simpler set of fault
conditions in describing his proposed method--namely, that the only
failures which can occur are those which result in a gate output being
"stuck( at 0 or 1, for all possible gate inputs. For this case, each
gate may be tested by applying only two tests, one of which tends to
make the gate output 0, and the other of which tends to make it 1.
All gates along the sensitized path are then checked automatically;
therefore, the procedure need be applied only to those gates whose
inputs are all network input variables. As a result, the derivation of
a test schedule is simplified considerably.
For the above example, all single "stuck at" faults may be
readily detected with just four test inputs, corresponding to the flexing
of gates A1 and O1, each with its own path to the network output
sensitized:
a b c
A1 I1 1 01 0 0
0 0 iOl i I .
169
Depending upon the network form, there may be a degree of
arbitrariness in the selection of the input-variable combinations
which are used to sensitize a particular path or to test a particular
gate. The length of the overall test will in general depend upon these
choices made for each path and each gate, since they will determine the
extent of duplication of test inputs in the formation of the schedule.
Armstrong 11 gives an algebraic procedure aimed at increasing the number
of duplicates. At worst, however, one may proceed by listing the
alternative tests which are valid for each path and each gate, and
then making desirable selections from this list when the complete
schedule (with multiple choices) has been compiled.
To summarize, the path sensitization method introduced by
Armstrong provides a means for generating a test schedule for fault
detection directly from the network, thereby bypassing the formation
and manipulation of the fault table itself. The method appears to be
an efficient one, although it is not guaranteed to lead to a minimal
test schedule.
For fault location and fault location to within modules, no
satisfactory procedure exists. Armstrong's procedure might be augmented
to allow one to go back and include additional gate-input combinations
that may be necessary to distinguish otherwise identical faulty outputs.
However, what is most needed for fault location is a condition for the
sufficiency of a trial test schedule, to be certain that all pairs of
possible faults are indeed distinguishable on the basis of their output
patterns. This problem remains unsolved.
Galey, Norby, and Roth 87 have also proposed a method for
deriving a test schedule for the case of fault detection, again based
upon analysis of the network structure. Their algorithm is similar to
but longer than Armstrong's, but is probably better suited for execution
as a computer program. Again, it generates the test schedule directly,
without the necessity of deriving the fault table first, and may be
extendable to fault location.
170
g. Implementation of the Test Schedule
After a test schedule has been derived by one of the methods
discussed above, it is necessary to arrange for it to be applied on
demand to a network under test. In addition, the results of the test
must be condensed into a form suitable for evaluation and for trans-
mittal either to a human or to an automatic switchover mechanism, in
order that the appropriate repair action can be initiated. These
tasks which precede the actual repair can be assumed to be performed by
a device which we will call a diagnoser, as shown in Fig. III-B-2. The
diagnoser can be realized in the form of either a special digital circuit
or a computer program, and it may be locai_d either physically adjacent
to the network (i.e., in a spacecraft) or remote from the network (e.g.,
on the ground).
When the diagnoser is in the form of a circuit, it is composed
of an autonomous sequence generator, which produces in sequence at its
output each test input in the test schedule. For the case of detection,
it also produces with each test input the corresponding correct network
output. This output is applied to a comparator, that checks the actual
network output against the correct output, noting any disageements.
The sequence of comparator outputs is then merely "OR-ed" together, by
applying it to a flip-flop (for example), the result of which is then
available for repair purposes. For fault location, the sequence of
network outputs must be decoded in accordance with the columns of Fmin,
in order to determine which fault has occurred. For human repair, and
SEQUENCE
GENERATOF
_ NETWORK t
S_" ERROR"
(O) FAULT DETECTION
SEQUENCE
GENERATOR
NETWORK
(b) FAULT LOCATION
FIG. Ill-B-2 DIAGNOSER STRUCTURES
...._ OK
,--_1
"_2
DECODER o
(or P)
Tl-|_lO-61
171
if time permits, a code book is probably the simplest way of decoding.
Otherwise, a special decoder network must be constructed.
In any case, the specifications on the special circuitry needed
can be precisely stated:
Sequence generator:
Given: A list of test inputs for the network under test,
and (for fault detection) the corresponding out-
puts. This is just a truth table for the function
f, with rows deleted in correspondence with the
minimization of the matrix F, as described in the
preceding sections. This list has N rows (where
N = ND, NL, or NM) and n or n + 1 columns.
Synthesize: An economical, autonomous sequential network
which produces at its n or n + 1 outputs the rows
of the given list, in any convenient order.*
Decoder:
Given: The minimum F-matrix, Fmin, with the rows ordered
in accordance with the row permutation used in the
design of the sequence generator.
Synthesize: An economical single-lnput sequential net-
work which produces a 1 at a unique one of its
m + i outputs, for each N-digit input sequence
which appears as a column of Fmin. Other possible
input sequences of the same length may be treated
as "don't cares."
Both of these circuits are assumed to start from a unique starting
state, which implies some means of reset to this state, and will probably
require a synchronizing clock when incorporated into the rest of the
system.
In view of the stringent reliability requirements on the
diagnoser, one might seriously consider realizing it with multiaperture
magnetic devices. These devices are also naturally well suited for
sequence generation and decoding operations.
@ Goldberg suggests generating first all the set of inputs which corres-
pond to f = O, followed by the set of inputs for which f = I (for
fault detection).
172
When the diagnoser is to be realized as a computer program,
the operations of sequence generation and decoding are described by flow
charts, and the design and execution of the program present no difficulty.
The process of sequence generation is no more than the successive
retrieval from memory of N binary words of length n or n + I. The flow
chart for the decoding is a N-level decision tree, having 2, m + i, or
p + 1 outputs for fault detection, fault location, and fault location to
within modules, respectively. The decision trees for these three cases
are illustrated in Fig. III-B-3 for the running example.
While one could probably not justify a spaceborne computer
solely for the purpose of fault diagnosis, the fact that a computer
may be available anyway, and the flexibility it offers for diagnosing
a large number of different networks, makes it an attractive solution.
h. Tests for Multiple-Output Networks
If the network for which a test schedule is being determined
has q (>i) outputs instead of a single output, the problem-formulation
and solution procedures described above remain essentially the same but
are modified in detail, as follows.
(i) The entries in the fj-columns of the fault table
F, as well as in the entire F-matrix, are q-digit
binary numbers instead of single binary digits.
(2) Two columns of the fault table should be considered
to be identical when and only when all of the
corresponding q-digit entries are exactly the same.
TO 7
5 o
0
0 0 O 0 0
3 I I
(b) Cc)
4,5,6,7
(o) TH_,o-7o
FIG. Ill-B-3 DECISION TREES FOR FAULT LOCATION
173
(3) The matrix G should have only single-binary-digit
entries, according to the rule: if two multidigit
entries in the same row of F differ in any of their
digits, then the corresponding entry in G for these
two columns is a I; otherwise it is a 0. This rule
is a direct reflection of the fact that a fault may
be detected on any one or more of the q network
outputs. The rest of the minimization procedure
is carried out on G just as for the single-output
case. Note that since the matrix G now usually
has a greater number of l's in it, the length N of
the test schedule can be expected to be smaller,
assuming that the other parameters remain the same.
Similarly, in forming W = FtF, element multiplica-
tion is defined to produce a 1 when and only when
the q-digit elements are different in any digit.
(Clearly, this reduces to exclusive-_R when the
entries are single-digit numbers.) W is defined
as previously.
(4) The Armstrong path-sensitizing procedure remains
unchanged, so long as it is kept in mind that a
path to any one or more of the q outputs is
adequate to render a fault conspicuous.
(5) The sequence generator for fault detection now
has n + q instead of n + 1 outputs, the decoder
has q-inputs, and the number of exclusive-OR
gates in the comparator is q instead of one.
Similar increases apply to the realization of
the diagnoser as a computer program.
It should also be pointed out that one possibility for reducing
the length of a test schedule is the extraction of one or more selected
nodes of the network as test points, which are then treated as separate
outputs as far as the diagnosis is concerned. The key problem here is
the selection of those circuit nodes which will result in the greatest
reduction in the length of the test schedule for a given network. This
and other problems worthy of further investigation are listed at the end
of the next part of this section.
174
3. Fault Diagnosis in Combinational Circuits Using Serial Test
Schedules
a. Introduction
The preceding part of this section presented a statement of the
fault-diagnosis problem for combinational switching networks, and solutions
for the most important cases when the test schedule is fixed.
We offer here corresponding solutions for the same cases when
the test schedule is serial--that is, when the selection of successive
tests depends upon the outcomes of previous tests in the schedule. In
particular, we show that by using a serial test schedule there is nothing
to be gained for fault detection, but for fault location and fault
location to within modules the possible reductions in test-schedule length
are substantial. A solution procedure is given which is easy to carry
out and is reasonably effective, although it does not necessarily lead
to a test whose length is absolutely minimal. Bounds on the length of
the test schedule are derived. Finally, the principal problems remaining
for further research are itemized.
It may be observed that the test schedules derived in the pre-
ceding part of this section are completely independent of the outcome
of the individual tests in the sequence; moreover, the length of the
schedule is independent of the order in which the tests are performed.
It is quite conceivable, however, that after the first test input of
a schedule has been applied and the output noted, the residual test
schedule which is minimal with respect to a 0 output is not the same as
that which is minimal with respect to a 1 output. Similarly, after two
test inputs have been applied, the four partial test schedules which
should follow may be all different in content and length; and so on for
successive test inputs. We consider here the economies to be achieved
by choosing each test input to be applied to the network on the basis
of the outcomes of all previous tests in the schedule.
Solutions to this problem for the three cases of fault
diagnosis, fault location, and fault location to within modules are
best represented in the form of decision graphs, such as were used to
175
describe the f_xed-schedule solutions in the preceding part; see for
example Fig. III-B-3. For serial schedules, however, the row labels
a, b, c, ... which are attached to the nodes of these graphs are no
longer restricted to be identical over all of the nodes in the same
level of the graph. The serial solutions which will be derived below
for this same example are shown in Fig. III-B-4. It may be seen in
Fig. III-B-4(b) that for fault location a shorter schedule results
than was required in Fig. III-B-S(b). However, one now needs to know
in which way the first test (g) turned out before the second test
(f or h) can be applied.
The minimization of decision graphs of this type has been
considered by Lee 17s and by Short. 2s3,2s4 Short showed that the problem
is equivalent to the minimization of an important family of transfer-
contact networks called disjunctive and exhaustive networks. The
transfer-contact networks corresponding to the solutions shown in Fig.
III-B-4 are given in Fig. III-B-5 in the same order. The direct graphical
correspondence can be readily seen. While this analogy is useful
theoretically and in a few specific problems, good general procedures
are unfortunately lacking for obtaining absolutely minimal networks
of this type.
Nevertheless, several methods are known for deriving economical
networks which sometimes turn out to be minimal. All of these methods
consist of a successive selection of the node labels of the decision
graph (row labels of the fault table), working from left to right.
After selection of the left node label on the basis of some criterion,
the O's and l's in the corresponding row of the fault table effectively
separate the table into two subtables, each of which corresponds to one
of the two subgraphs (subtrees) to the right of the leftmost node in the
decision graph. Each of these two subtables may now be attacked indepen-
dently by exactly the same selection and reduction process to generate
four smaller subtables (subtrees), and so on. The only difficult aspect
of this method is the particular criterion employed to select at each
step that row of F which ultimately causes this iterative procedure to
terminate after a minimal (or reasonably small) number of steps.
176
25IT07
I
4
_5
"_L "-'_
Cal (b) ( C ) TA*S5e0-56
FIG,III-B-4DECISION TREES FOR SEQUENTIAL TESTS
2
c L--" 3
0
I
o___00 1T07
g
7
h 4
e u.--_ 5
(a) (b)
g
(C) TA-55e0-57
FIG. Ill-B*5 CONTACT-TREE ANALOGS OF DECISION TREES
177
Further discussion of this method, as well as examples, will
be taken up separately for fault detection, fault location, and fault
location-to-within-modules.
Serial fault location was first proposed by Brule et al., 34
under some rather restrictive assumptions, and without giving any pro-
cedure for deriving a test schedule for a given network.
b. Fault Detection
The problem in fault detection is one of choosing a minimal
subset of rows of F to distinguish the first column from all others.
Thus, each step of the procedure results not in two but in only one
residual subtable (subtree). Consequently, there is no advantage to
be gained by performing the tests in any particular order, and the
fixed-schedule solution having ND tests is also optimal with respect
to the minimal number %D of levels required. In fact, %D = ND"
Thus, serial testing offers no improvement over fixed-schedule
testing for fault detection, and the same decision graph _Fig. III-B-3(a)
and Fig. III-B-4(a)] solves both problems.
c. Fault Location
One way to select the appropriate row label at each step of
the procedure is to try all possible remaining row labels. For even a
small fault table, however, the total number of possible graph labelings
which need to be tried to determine the minimal number of levels is
astronomical. This approach is therefore impractical.
Sindeev 2s7 and Chang 4° employ two different criteria for
selection of successive row labels of the decision graph for the
fixed-schedule case. Their methods can also be applied to the individual
steps in serial testing, however, and in this case yield exactly the
same results. This extension of the Sindeev and Chang methods proceeds
as follows.
178
Let the numbers of O's and l's in row i of a fault table or
subtable be Wio and Wil , respectively. Sindeev proposes the selection
of that row _ which maximizes the amount of information (in an information-
theoretic sense) which is gained by that row decision regarding the
particular column which was "transmitted:"
J = -Po l°g2 (Po) - h l°g2 (Pl)
where PO = Wio/W' Pl = Wil/W' and w = Wio + Wil. Manipulation of this
expression for J reveals that its maximization is equivalent to minimiz-
ing the expression
(Wio)wiO Wil(Wil)
Chang, on the other hand, suggests the selection of that row i which
maximizes the number of (0, 1) pairs between digits in that row--that is,
which maximizes the expression (WioWil) .
These two quantities in parentheses are simultaneously optimized by
selecting that row which has the most nearly equal distribution of O's
and l's; that is, that row (or one of the subset of rows) for which
[Wio - Wil[ is minimal.
The use of this criterion appears to work very well for most
problems. Applied to the F-matrix used as a running example in the
first part of this section, namely
F= -0 i
i I 0
0 i 0
0 0 0
I 0 1
1 i 0
0 0 0
0 0 0
0 1
0
1
1
1
0
0
1
1 0 1 I- a
0 0 1 0 b
0 1 0 1 c
0 1 1 1 d
1 i 1 1 e
1 0 0 0 f
I 1 1 1 g
1 i 0 0 h
179
any one of the rows c, d, or g should be chosen first, since these rows
have Wio - Wil = O. Selection of row g yields the two submatrices:
F
0
= -0
1
0
0
I
1
0
1o a1 0 b
1 0 c
0 0 I d
0 1 _]1 e
1 0 f
0 0 h
F
1
-1 0 1 1-
0 0 1 0
0 1 0 1
0 1 1 1
1 1 1 1
1 0 0 0
1 1 0 0
For the second step, one of the rows a, b, c or f should be selected
from F0, and one of the rows c or h from F1, since only these rows have
an equal number of 0_S and 1_s. Choosing the last alternative in each
case gives the following partitions into four smaller submatrices:
FO0 = -0 I] a FOI =.-0 I-]
00l b 1 11
0 1] c 0 i[
0 1[ d 0 _1
I 1] e I
_0 lJ h 0
FIO = 'I-I _] ba FII = '-i0 _]
0 11 c 0 11
1 II d 0 iI
1 11 e 1 11
_0 OJ f 1 Oj
For the third step, there are many possibilities, but choice of row c
serves simultaneously for all four submatrices. The resulting decision
graph is shown in Fig. III-B-4(b), and has gL = 3 levels.
Sindeev gives an example in his paper of a fault matrix with
14 rows and m = 15. His own method (which he claims to yield a minimal
number of levels in a flxed-schedule solution) applied to this example
yields a decision graph having six levels. Application of Chang's method to
to the same example yields the same fixed-schedule result, six levels.
The G-matrix method described in the preceding part of this section
yields five levels. Therefore Sindeev's method does not give a minimal
result, as he claims. Moreover, both Chang's and Sindeev's methods can
result in very long fixed schedules if the wrong choices happen to be
180
made at those points in the procedures when two or more rows are
calculated to be equally desirable choices.
The risk of making such unwise choices appears to be much less
for the serial-test procedure described above. Applied to Sindeev's
example, this procedure gives a decision graph having only four levels.
d. Yault Location to Within Modules
The method described above applies with little change to the
case of fault location to within modules. Following Chang, we now modify
the criterion of row acceptance to count not all (0, 1) pairs, but only
those in each row in which the 0 and the 1 fall in different module
classes. This quantity is most easy calculated by subtracting from the
total number of (0, 1) pairs the sum of the number of (0, 1) pairs which
fall entirely within the individual classes, namely
P
R. - E w. w
I = WioWil ljO ijl
j=l
where wij 0 and wij I are the numbers of O's and l's, respectively, in the
.th
3 module class in row i. The row to be selected is the one with the
largest row count R..1
For the running example, with the previously used column
partition (0) (123) (4567) into module classes, the row counts on the
8 rows of F are:
a: 15 - 2 - 3 = i0
b: 15 - 2 - 3 = 10
c: 16 - 2 - 2 = 12
d: 16 - 2 - 3 = 11
e: 7 - 2 = 5
f: 15 - 2 - 3 = I0
g: 16
h: 15 - 2 - 2 = 11
181
Clearly, row g should be selected first, and the same two submatrices
F 0 and F 1 as arose in the previous section result. F I falls entirely
within (and in fact covers exactl_ module class (4567), so it need
not be further decomposed. For the rows of F0, the row-count values
are:
a: 4-2=2
b: 4-2=2
c: 4-2=2
d: 3-2=1
e: 3-2=1
f: 4-2=2
h: 3-2=1
Any of a, b, c, or f should be chosen. Selecting f, only FO0 need be
further decomposed, and any of the nonconstant rows a, c, d, or h may
be used. Figure III-B-4(c) displays the resulting decision graph.
Note that the graph of Fig. III-B-4(c) could have been
obtained from the graph of Fig. III-B-4(b) by typing together outputs
1, 2, and 3, then 4, 5, 6, and 7, in accordance with the grouping of
columns into module classes, and then reducing the resulting graph
according to s well-known procedure for the simplification of transfer-
contact networks. 2s4 While this method yields a minimum-level graph in
the present example, it will not do so in general, and the procedure
described above must be used to group together advantageously the out-
puts associated with each module class.
Similarly, a valid fault-detection graph could be obtained from
the graph of Fig. III-B-4(c) by merging the outputs (I, 2, 3) and
(4, 5, 6, 7). However, the resulting graph cannot be further simplified.
The graph of Fig. III-B-4(a) has fewer levels, and is therefore preferable.
e. Bounds
The bounds derived in Sec. III-B-2-e on the minimal number N
of tests in fixed test schedules apply without change of argument or
182
or result to the number £ of levels in serial test schedules:
I %:ND m
1 + [log 2 (m)] _ _ _ N L _ m
1 + [log 2 (p)] _ ZM _ NM _ m
For most problems, of course, we can expect _L to be much smaller than
NL
We have assumed throughout this section that the parameter to
be minimized when optimizing a serial test schedule is the number % of
levels in the decision graph, since this number is proportional to the
running time of the diagnostic test. If, instead, it is the total length
of the diagnostic program (that is, the amount of memory space required
to store the test schedule) which is of principal interest, then it is
the total number d of decision nodes in the graph which should be
minimized. However, a simple argument shows that for fault location,
this number _ is fixed for a given F-matrix, and is equal to m, the
total number of faults. To see this, observe that each of the 2_
output arrows from the % nodes of the graph terminates either on one
of the m + 1 outputs, or on one of the (d L - I) nodes to the right of
the leftmost input node. Thus:
or _ = m.
2d L = m + 1 + (_ - i),
For fault detection, we have % : ND = % _ m. For fault
location to within modules, all we can assert is: _M <_ NM <_ dM_< m.
f. Potential Economies of Serial Test Schedules for
Fault Location
The following exemplary family of F-matrices demonstrates that
there exist problems for which the number of levels in a minimal serial
test schedule is approximately equal to the logarithm of the number of
levels in the corresponding minimal fixed test schedule.
183
Consider first the particular F-matrix
F _
-0 1 2 3 4 5 6 7-
0 0 0 0 1 1 1 1
0 0 1 1 0 0 0 0
0 0 0 0 0 0 1 1
0 1 0 0 0 0 0 0
0 0 0 1 0 0 0 0
0 0 0 0 0 1 0 0
_0 0 0 0 0 0 0 1
a
b
c
d
e
f
g
which has m = 7 and seven rows. For a serial test schedule, the decislon
tree has m + i = 8 outputs, so %L --_l°g2 (8) = 3. Three levels are
clearly adequate, as shown in Fig. III-B-6(a). For a fixed test schedule,
observe that all seven rows of F are necessary, since deletion of any
row causes two columns to be identical. Thus, _L --> 7. This decision
tree is shown in Fig. III-B-6(b).
I
I I
6 ', T i
7 I I I I I ' 7
I I I I I I
FIG. III-B-6 SEQUENTIAL DECISION TREE FOR A LIMITING CASE
184
More generally, let m + I be any power of two, and let F be
formed according to the same pattern as was used above: the O's and
l's in the first row separate the set of columns into two equal-sized
groups; the second and third rows similarly divide each of these two
groups in two, leaving four groups; the next four rows divide each of
these four groups in half, leaving eight groups; etc. In each instance
of group splitting by a row, O's are entered in the columns corresponding
to all other groups. Realization of the decision trees in the same
pattern as shown in Fig. III-B-6 yields immediately:
_L = l°g2 ( m + i) for serial tests
£L = m for fixed tests.
However, regardless of F, the number of levels in an (m + i) - output
decision tree must fall somewhere in between these same values (see the
Sec. III-B-2-e. Therefore, these are extreme values, and cannot be
further separated.
This family of examples shows that there exist problems in
fault diagnosis (location) for which serial test schedules are vastly
superior to fixed test schedules.
4. Fault Diagnosis in Digital Computers: Present State of the Art
This part summarizes the present status of diagnostics as applied
to digital circuitry and systems of the type anticipated for use in
spaceborne computers. This status is evaluated with respect to the need
for fault-diagnostic techniques, and recommendations are offered for
future research work to satisfy the deficiencies that exist at present.
Diagnostic procedures for the detection and for the location of non-
transient faults in completely arbitrary digital networks, subsystems,
computers, and systems are presently either unavailable or inadequate.
Satisfactory procedures exist for the derivation of diagnostic test
schedules only for the class of combinational networks which (I) are
not too large ( 8 to i0 inputs, and up to about 4 or 5 outputs) but
otherwise arbitrary, and (2) are subject to a limited number of
definitive, specified faults--not more than a few hundred for fault
185
detection, and not more than about 100 for fault location. In addition,
procedures are known for the detection of the most common types of
faults in large combinational nets, for the location of faults in some
special cases of large combinational networks, and for a few special
types of small and medium-sized sequential networks. For all other
conditions, however, including many situations of practical interest,
the only techniques which are available cannot be considered to be
satisfactory, either because they lack generality, because they are
too difficult to carry out, or because they lead to test schedules
which are much too long. For some cases, no techniques have even been
proposed.
For the known procedures, it is assumed that a general-purpose
computer is available for carrying out the derivation of a test schedule
for a given network. If this is not so, the maximum size of the network
and the number of faults which can be handled are much reduced. It is
also assumed that the resultant diagnostic test schedule, to be applied
on demand to a network in actual service, can be expressed as a simple
computer program which may be stored in the spacecraft computer or
communicated to the spacecraft. If the testing cannot make use of an
already existing digital memory or communications link for diagnostic
purposes, then a special subsystem must be designed to serve this pur-
pose. Presently known design techniques for such subsystems are
available, but are not completely adequate, since they often lead to
circuitry which is unnecessarily costly or which cannot itself be readily
diagnosed or protected against its own faults.
It should also be noted that there is much to be gained in reducing
the complexity and length of diagnostic test schedules, if one or more
of the following additional techniques are employed:
(1) Extracting selected circuit nodes as test points or
inserting additional control inputs, to facilitate
testing
(2) Using "serial" test schedules--in which the selection
of successive tests in the test schedule is made to
depend upon the results of the tests applied earlier,
in closed-loop fashion
186
(3) Designing the original network in such a manner as
to make it more readily diagnosible.
Circuits are known for which considerable improvements can be demonstrated,
using each of the above techniques. Unfortunately, however, practically
nothing is known in general about how to actually achieve these improve-
ments. In some cases, one or more of these techniques may be absolutely
necessary in order to obtain acceptable test schedules. For example, it
may be essential to employ a small number of test points when testing
large sequential circuits.
In summary, there are unsolved problems, the solution of which would
contribute substantially to the realization of advanced reliable space-
borne computers; these are*:
(I) Development of techniques for selecting an economically
small number of test points or additional inputs for a
given combinational network, in order to drastically re-
duce the length of the test schedule required for (a)
fault detection and (b) fault location (to within a
replaceable module).
(2) Evolution of a general approach to finding good diagnos-
tic techniques for sequential circuits, and the deriva-
tion of some useful procedures for arbitrary moderate-
sized circuits; also, the identification of those
special classes of large sequential networks for which
these procedures might still be used.
(3) Development of practical procedures for deriving
economical test schedules for very large combinational
networks.
(4) Investigation of the economies to be achieved and the
methods which are appropriate when serial test schedules
are used for combinational networks; also, the identifi-
cation of those cases in which the use of serial test
schedules offers the greatest advantage over fixed test
schedules.
(5) Resolution of the principal questions regarding which
system organizations should be employed to facilitate
both diagnostic testing and subsequent repair (switch-
over to a spare or to a reconfigured system).
* It should also be noted that the first four of these topics are of
considerable importance to the manufacture of integrated circuit
packages for computer use.
187
(6) Study of the ways in which the original design of
a digital network or subsystem may be modified in
order to make it more susceptible to the diagnostic
procedures which have been developed.
(7) Development of techniques for the design of special-
purpose diagnostic circuitry, including estimates
of the economies to be achieved through its use.
188
DC. Design of Networks for a Reconfigurable Computer
1. Introduction
In Sec. III-A-1 it was pointed out that a reconfigurable computer
should have high degrees of flexibility of structure, simplicity of
diagnosis, and reliability of control; and furthermore, that both flex-
ibility and inherent reliability of manufacture are improved by using
a small number of different kinds of complex logic modules in the compo-
sition of the computer. It was also noted there that in order to realize
the potential increase in reliability of reconfiguration, it is necessary
to accomplish the switching of data and the control of reconfiguration
efficiently, that is, to achieve a large number of configurations with a
small number of switches. New network schemes and design techniques are
needed to realize these unusual requirements.
In the next three sections we consider approaches to the design of
networks that are especially well suited to the functions of inter-
connection, logical processing, and control, respectively. Other net-
work types might be considered, e.g., some based upon combinations of
these functions; but those chosen appear to offer good possibilities
for efficiency of design.
The particular kind of interconnection network described in
Sec. III-C-2 is called a commutation switch. It is intended to be
complementary to the processing network described in Sec. III-C-3.
Several approaches to control networks are discussed in Sec. III-C-4, one
of which is based on the processing module of Sec. III-C-3.
All of the networks to be considered are designed to be multipurpose
and programmable. The criteria of ease of diagnosis and incorporation
of fault masking have not been incorporated, and constitute problems
for further development.
189
2. Programmable Interconnection Networks
a. Introduction
The preceding discussion has served to describe the overall
features of a system which can diagnose its internal faults and correct
them by reconfiguring its subsystems. Of major importance in such a
reconfigurable system are the circuits whose function is to provide
interconnection between the operating modules and to disconnect faulty
modules from the system. These circuits, which we call commutation
networks, are discussed in this subsection. Our aim is to describe
techniques which a designer can use to develop commutation networks
which provide reliable flexibility of interconnection at low cost.
Initially we have selected a simple model for analysis, based
upon the control of data flow between two parallel linear arrays of
modules--as, for example, between two registers or between a register
and memory. Figure III-C-l(a) shows the conventional interface between
each of the M modules of two arrays, for the parallel transfer of data
where there are no spare units. Figure III-C-2(b) shows the same two
arrays, with the addition of N1 - M spare modules on the input array
and N 2 - M spare units on the output array. The function of the
commutation network is to provide connection between a set of M input
modules and a set of M output modules, and also, of course, to disconnect
the paths from (and to) faulty modules. It is assumed that the modules
of the two arrays are diagnosed by an external authority, which in
turn furnishes status signals to the commutation network.
Usually in the parallel flow of data it is of primary impor-
tance to preserve the order of the signals; i.e., the digits contained in
a set of M operating modules of the input array should be transferred in
order to a set of M operating output modules. In some applications it
might be possible to append an identifier to the signals traversing the
arrays, indicating the source of the signal. In this case the commuta-
tion network is not constrained to preserve order. We ascribe primary
importance to the order-preserving case, particularly in describing
implementations, because in conventional computer functions the order
190
INPUTS OUTPUTS INPUTS
[D-
e
-- STATUS INPUT
OUTPUTS
,4
0
k"
W
Z
Z
F-
F"
0
U "
(o) WITHOUT SPARE UNITS (b) WITH SPARE UNITS
Ta - 5580 -7
FIG. III-C.1 INTERFACE BETWEEN MODULE TYPES
191
of data is preserved; as, for example, in the transfer of data between
registers and arithmetic units. However, we also briefly consider
the non-order-preserving case because of the reduction in complexity
afforded by such commutation networks. _
All of the networks considered here provide for total
commutation, i.e., up to some number of faults at the input and some
number of faults at the output, all arrangements of the faults are
accommodated. Consideration of partial or nontotal commutation is an
important problem for future study.
In Sec. III-C-2-b we consider an order-preserving asynchronous
commutator in which the data "diffuses" along a register between the
inputs and outputs. The operating status of the input and output modules
is stored in the register.
A combinational approach, relating to both the order-preserving
and non-order-preserving cases, is initially discussed in Sec. III-C-2-c.
With this approach there are a set of possible connections between the
inputs, visualized as single-pole single-throw switches, which are
controlled by status signals. The design goal here is to minimize the
number of switches required for complete flexibility.
In Sec. III-C-2-d we briefly discuss the conventional implemen-
tation of these commutation networks as a simple combinational net, and
we also present a cellular implementation which is suggestive of a
Holland-Machine structure. Techniques for achieving commutation switches
which are failure tolerant are discussed in Sec. III-C-2-e.
_ It is recognized that similarities exist between the design of efficient
commutation circuits and the design of telephone central-office systems.
(The reader is referred to Benes 24 for an analytic discussion of tele-
phone interconnection networks.) However, there are several important
distinguishing factors; e.g., in the commutation case all pairs of in-
put and output modules are not required to be connectable, and also the
network must function unattended. These factors are sufficient to re-
quire an extensive examination of both the interconnection system it-
self and the control modules by which the decisions are made.
192
b. A Sequential Commutation Network
This section describes a sequential buffer network that pro-
vides total order-preserving commutation with reasonable economy.
Although the network contains logical delay elements, no clock signals
are needed. If this scheme is used within a synchronous computer,
higher network speed will result than if the system clock were used in an
equivalent clocked network. The network is composed of a cascade
of stages, with parallel input and output facilities, and with serial
propagation with the cascade. In the propagating mode, the network
acts essentially as a speed-independent cascade, @ specially modified to
accommodate invalid sources and loads. Another application of this mode
of operation is described in Section II-A-2-C-3.
i) Overall Behavior of the Network
The overall behavior will be explained with reference to
Fig. III-C-2. The network is seen to be a cascade of 2n stages, serving
n data sources and n data sinks.
The use of a double-length register is one of several
possible techniques for allowing a relative positive or negative dis-
placement of the indexes of corresponding source and load channels.
That is, depending upon the distribution of invalid source and data
channels, the index of a source channel may be greater than, less than 9
or equal to the index of the corresponding load channel. Another approach
might be to use a register of n cells, in which propagation may be forced
in either direction.
Stages 1 through n receive information in parallel from
source channels s I through sn respectively, and stages n + 1 through 2n
deliver information in parallel to load channels _i through _ . It isn
assumed that some diagnostic process distinguishes the validity or
invalidity of the data channels. Information as to which load channel
shall receive data is stored within the corresponding cell of the
* Conceived by D. Muller. TM
193
cascade. In the identification of the validity of data source channels,
there are two design choices:
(a) The s signals may have a three-valued encoding,
e.g., 0, 1, _, in which one symbol _ indicates
an invalid source.
(b) Information as to the validity of a channel may
be stored within the network, at the correspond-
ing cell.
Choice b will be assumed here, because it leads to the use of the same
cells throughout the cascade.
Information as to the validity or invalidity of each data
channel, then, is stored at the corresponding cell. This storage is
assumed to be accomplished during a setup mode. The processing mode
has three phases: write-in, propagation, and read-out. The first and
last are parallel processes on the s and % channels respectively. In
the second, information propagates toward stages of higher index, and
comes to rest in a stable configuration in such a way that symbols
obtained from valid source channels are collected in a contiguous string,
starting at the rightmost cell but skipping cells corresponding to
invalid load channels.
In the event that the number of valid symbols exceeds the
number of valid load channels, the leftmost symbols will be lost. If
the opposite is true, the leftmost load channels will be unused. The
design to be described provides that a special symbol indicating the
absence of valid data is available for entry to such channels.
2) General Description of the Propagation Mode
Before describing the detailed design of the cells, the
overall behavior of the propagation mode will be described, with the aid
of Fig. III-C-S.
Each cell is composed of two stages, which may be con-
sidered to be identical for simplicity of description. The figure
illustrates a case in which an incident pattern I, O, 1 appears on
channels i, 3, and 4 respectively and the pattern I, I, 0 is read out
194
sI s2 S n
/
•.. I_ln"l°'_... 12ol
L I Z2 ,t n
TA-55111O-Tg
FIG. III-C-2 BLOCK DIAGRAM OF A
SHIFT-REGISTER
COMMUTATING NETWORK
t, ;- _o _,
I__I_°I_+_I'_I'_I_I_I°[_I
TA-5580-2
FIG. III-C-3 TYPICAL SYMBOL STATES
IN AN ASYNCHRONOUS
SHIFT-REGISTER COMMUTATOR
195
(_)i +1
'_i-, i-I a _i.,,
li_ I a (_)i.+l
_i-I ) = .... li_ I or _i-,'l
or Oi+l f I_i_l a li+ I
Or li÷ t
(a)
Z_-I i i wi÷lWf Zf .,f
i-I i i 7i÷1W r Z r W r -f
INFORMATION FlOW
( b ) t_-_seo-_9
FIG. III-C-4 BLOCK AND STATE DIAGRAMS
FOR A SPEED-INDEPENDENT MODULE
196
Dto channels 5, 6, and 8, respectively. The symbol _ in both stages of
cells 2 and 7 indicates invalid channels. The _ symbols shown serve
two functions: they serve to indicate the absence of valid data, and
they serve to separate independent valid symbols; i.e., the network
is designed so that two valid symbols may never come to rest in adjacent
stages.*
In the propagation mode, then, each stage acts as a
repeater of incident information, with temporary data storage. The
effect of storing a _ symbol in a stage is to remove its temporary data
storage capacity.
3) Description of the Cell Design
Since the network is derived from Muller's Speed-
Independent module, a brief review of that module will be given. Following
that, a state diagram will be derived, and then a logical realization.
Review of Muller's SI stage: Muller's cascade is illustrated
in Fig. III-C-4(a). The stages are indexed i - 1, i, i + 1. The inputs
to stage i are w_ and W ir' which are obtained from z_ -I and Zr respec-
i+l
i
tively. All paths carry one of the three symbols O, i, @, and on the basic
i i
design zf = zr.
The state diagram for stage i is shown in Fig. III-C-4(b).
On the branches, symbols with index i-I appear at the wf input, and
symbols with index i + 1 appear at the w input. The symbols in the
r
state circles are those presented on the Zr, zf outputs.
The state behavior provides for the temporary storage and
propagation of information, at a rate dependent on the time-response
characteristics of the stages. The feedback of information in the
* In the transient mode, a given symbol may be temporarily replicated
in two or more adjacent stages, but strings of replicas that are de-
rived from different sources will be separated by one or more _ sym-
bols. For example, the sequence ... 0 1 1 0 I ... may appear momen-
tarily in the cascade as the string ... G 0 _ 1 1 1 0 1 _ 0 0 _ _ 1
..., and will come to rest in the form ... @ 0 0 1 _ I _ 0 @ 1 ....
197
direction opposite to that of propagation serves to prevent accidental
elimination of data. This feature may be exploited to permit the block-
ing of the flow of information by forcing the feedback path to some
stage to a 0 or 1 status. The data will then accumulate in a contiguous
string of alternating spacer (_) and data (0 or l) symbols.
State behavior of an augmented cell: As described in
the general description of the propagation mode of the buffer network, the
effect of labeling a cell as invalid is to deprive it of its temporary
data-storage capability. This is accomplished by adding a fourth state,
_, to a basic Muller stage, in which the forward-feeding output signal
i repeats the forward-incident w_ signal and the feedback output signal
zf _i
zi repeats the incident feedback w signal, each without delay. The
r r
output functions in the other three states are exactly as in the Muller
stage.
There are several design possibilities for the transitions
into and out of the _ state. Pernaps the simplest (at least for purposes
of describing a basic design) is to assume one special input signal ms,
capable of forcing a transition into _ from any other state, and a second
special input signal mr, capable of setting the stage to some other state,
most naturally the _ state.*
The block diagram is shown in Fig. III-C-5(a) and the
state diagram is shown in Fig. III-C-5(b). The conventions of Fig.
i
III-C-4 apply to the inputs (with the addition of the m input, whose
values are ms, mr' and m@), but the outputs zf and Zr are indicated
explicitly at the state circles. The dotted lines indicate possible
parallel data inputs and outputs.
For simplicity, two functions have not been shown in the
state diagram; these are the writing of data into a cell from an ex-
ternal source, and the permitting or inhibiting of propagation within
the cascade. The first is trivial, given the ability to accomplish the
* A more elegant means (requiring fewer special input lines) might be
to provide that a string of O's and l's be entered serially (repre-
senting invalid and valid cells), and then "frozen" into _ and non-_
states upon special command.
198
)si
i ' i!
wf
 fl' L- -Ii÷II
Im i .l I
---r ms72"_-
Io e,_ //m r
mrOr I i-I i+l// '
( 'i-'l_ Oi+')
mr°r*i-l°r _'_lzf=_ li-'&@i+l-[_l z
Oi÷ I or li, I k_lz;_-tbL L _I'
rnsor m_
ms
m & (li_l°r _i+l)
TA-5580-3
FIG. III-C-5 BLOCK AND STATE DIAGRAMS
OF FULL SHIFT-REGISTER CELL
second. One satisfactory way to inhibit propagation is to provide a
common signal to the second stage of all cells, that injects a I signal
into the w input. This will serve to keep all such cells in the
r
state (at which state the cells naturally arrive following all complete
propagations). To permit propagation, these I signals are removed.
Even if the signals are not removed simultaneously, the presence of a
symbol within the driven cell prevents the loss of information.
By applying such a signal only at the last stage in the
cascade, information may be caused to "pile up" in a stationary pattern
at the end of the cascade. Removal of the signal will permit the discharge
of the information from the cascade.
199
A logical realization: In the following illustrative
realization, the w and z signal variables and the x state variable are
encoded as pairs of binary variables, as follows:
wf = (Wfl, Wf2 ), Wr = (Wrl, Wr2 ), zf = (Zfl, zf2 ), Zr = (Zrl, Zr2)' and
x = (Xl, x2). The values corresponding to the symbols of the state
diagram are
0 : 0 1
1 : 1 1
,0 : 0 0
: 1 0
The m variable is likewise encoded as a pair, (ms, mr) , and it is assumed
that m . m = O.
s r
Table III-C-I lists the symbolic and logical functions
governing entry to the four states, together with the output functions
of the states. The conditions for external entry of data are also
included, with s i = (0, i) the external data, "write" = (0, i), the
sampling function, and p = (0, I), the "propagate release" function.
Table III-C-I
STATE-TRANSITION AND OUTPUT LOGIC FOR REGISTER CELL
State Code
x XlX 2
10
9 00
0 01
1 11
Entry Function--logical
m
s
m or & ( or li+l)r _i-I 0i+l
(0i_ 1 & _i+1) or (_i & write)
(li_ I & ¢i+i) or (s i & write)
Entry
Function--
Boolean
m
s
m +
r wf2Wr2P
wf2wfl Wr2P
+ s..wri te
1
m
wf2wflWr2P
+ si.write
Outputs
Zfl zf2 Zrl Zr2
Wfl wf21Wrl Wr2
0 0 0 0
0 I 0 I
1 1 1 1
200
Ms Mr
J I
_4
INV.
q--
4
TA-5580-6
FIG. III-C-6 LOGICAL REALIZATION OF SHIFT-REGISTER CELL
Figure III-C-6 shows a logical realization using flip-flops,
AND gates, OR gates, and inverters.
4) Summary
The scheme described is one of several possibilities.
Other asynchronous schemes should be investigated, as well as the
synchronous-sequential and combinational approaches. Some merits of the
present scheme are (i) its complete flexibility within the limits of
redundancy, (2) the linear growth in size with number of channels, and
(S) the absence of the need for a clock. These merits are achieved at
the cost of some control logic (not shown here) for the sequencing of the
various phases of operation, and of time (not calculated here). An
additional feature which may be useful in applications such as parallel
arithmetic is that the order of information among the input channels is
preserved at the output.
c. Combinational Commutation Networks--Minimization of Number
of Switches
1) Introduction
In this section and the two succeeding sections we present
a detailed examination of the combinational commutation networks. We
distinguish these networks as combinational because in estimating the
201
expected complexity, for given values of N1, N2, M it is convenient
to visualize the commutation network as a multiple-input multiple-output
combinational network where the inputs are the data lines from the input
arrays along with the status lines, and the outputs are the data lines to
the output array. Indeed, these commutation networks can be implemented
strictly as combinational circuits, working from the truth table; or for
some applications it might be convenient to incorporate some sequential
blocks.
At present we seek techniques for minimizing the complexity
of the commutation networks. In order to facilitate the synthesis we
propose to initially consider the commutation network as a switch net-
work, i.e. a net of single-pole single-throw switches. We can then state
the following problem relating to the synthesis of efficient commutation
networks:
For a given number M of operating modules in both the input
and output arrays, and a set of N 1 - M spare input modules
and N 2 - M spare output modules, find the switch network,
denoted as the minimal switch network, with the least num-
ber of switches such that all sets of M inputs and M out-
putsare connectable both when the ordering of the input
data Is preserved at the output and when disordering is
allowed.
It should be noted that the minimization procedure based
upon this switch model is not necessarily an optimum approach for all
applications, but it is felt that efficient designs will generally result
from the consideration of the minimal switch networks as a point of em-
barkation.
We now distinguish types of switch networks according to
the number of intermediate levels present between the input and output
arrays. To be more specific, first consider the single-level switch
network of Fig. III-C-7(a). In this case the lines between the N 1 inputs
and the N 2 outputs represent switches that are either open or closed. A
double-level switch network is shown in Fig. III-C-7. In this case
there are a set of switches connecting the N 1 inputs with the N' inter-
mediate collection points, and also a set of switches connecting the N 1
202
DINTERMEDIATE
LEVEL
INPUTS OUTPUTS INPUTS OUTPUTS
I I I _ I
I
2 2 2 2
2
3 3 3__ID_ 3
NI N_
N2 Nz
(a) SINGLE LEVEL SWITCH (b) DOUBLE LEVEL SWITCH
T_,- 5580-8
FIG. III-C-7 SINGLE AND DOUBLE-LEVEL INTERCONNECTION SCHEMES
outputs with the N t intermediate points.
level switch networks is straightforward.
The extension to general multi-
One immediately observes that the required commutation
can be accomplished if the single-level switch network contains NIN 2
switches, but we will show that significantly fewer switches will suffice.
In addition the commutation can be accomplished by a double-level switch
containing N' = M intermediate collection points and M(N 1 + N2) switches,
but once again this connection is not minimal.
We will demonstrate the techniques for the realization of
minimal single-level switch networks, for both the order-preserving and
non-order-preserving cases. For most values of the parameters M, NI, and
N2, double-level switch networks can be found which are less complex than
the corresponding minimal single-level network. Unfortunately, we do not
know the techniques for realizing precisely minimal double-level networks,
nor do we indeed know the optimum number of intermediate levels, but a
203
technique is presented for realizing double-level networks which appear
to be "near" minimal. Examples of networks which are less costly are
discussed, so as to provide a basis for future research.
First we will consider the case where the ordering of the
input signals is preserved at the output. Primarily intuitive methods
will be employed to establish that particular switch networks can perform
the required reconfiguration.
2) Single-Level Order-Preserving Network
The number of switches required for the single-level switch
network may be counted with the aid of Fig. III-C-8, in which the horizon-
tal lines represent input buses, the vertical lines represent output buses,
and a heavy dot represents a switch connecting an input and output bus.
For ease of visualization, the case N 1 = 15, N 2 = 15, M = 5 is illustrated.
(Although the example indicates a case where N 1 = Nb, the theory is applica-
ble for cases where there are an arbitrary number of spare modules for the
input and output arrays; these general results may be useful when the in-
put and output arrays are of different reliability.) Input 1 need be
connected to only N 2 - M + 1 outputs to assure it of access to one of the
M valid outputs. In order to preserve the order of input and output
channels, input 2 must be able to cover the outputs of input 1 (since 1
may be inactive) plus one more (in case input 1 is active). Successive
inputs must thus cover the outputs of the preceding inputs, plus one
more, until the (N 1 - M + 2)th input. That input need not cover the first
output, since if the first output is active it will be supplied a signal
by one of the preceding N 1 - M + 1 inputs, one of which must surely carry
a valid signal. Successive inputs, then, need cover one output terminal
less.
The number of required switches S_ p) , may be seen,
by inspection of Fig. III-C-8 to be
M
S_LP) = NIN 2 - 2i_li = N I N 2 - m(m- i).
2O4
INPUTS
I
2
3
4
5
6
7
M
3
2
Nil
I 2 3 4 5 6 7 ... N 2
M---3 2 1
I
2
3
4
5
6
7
8
9
IO
II
12
13
14
15
I 2 3 4 5 6 7 8 9 I0 II 12 13 14 15
OUTPUTS
TA-5580-26
FIG, III-C-8 SINGLE-LEVEL ORDER-PRESERVING COMMUTATION
NETWORK N I = N 2 = 15, M = 5
3) Double-Level Order-Preserving Network
Under some conditions, a double-level network may be
advantageous. The structure of a possible network is shown in
Fig. III-C-9, in which the horizontal lines represent input and output
buses, and the vertical lines represent intermediate buses.
Considering the intermediate buses, labeled in numerical
order, l, 2, 3, ..., we may note that each bus must be connected to an
active input. The first intermediate bus, l, need by connectable only
to the first N1 - M + 1 inputs in order to serve the first valid
input; the second intermediate bus, 2, need be connectable only to inputs
2 through N 1 - M + 2 in order to serve the second valid input, since, at
worst, if the first valid input is at input N 1 - M + 1 (II, here), the
second valid input must occur at input N1 - M + 2, the third at input
N 1 - M + 3, and so on, with the Mth at input N1 - M + M = NI.
205
INPUTS
I
2
3
4
5
6
7
8
9
I0
II
12
13
14
15
1 2 3--M
M
o
2
I Ni
M
3
2
N 2 1
I OUTPUTS
2
3
4
5
6
7
8
9
I0
II
12
13
14
15
T&-5580-27
FIG. III-C-9 DOUBLE-LEVEL ORDER-PRESERVING
COMMUTATION NETWORK
N 1 = N 2 = 1,5, M = 5
206
The second set of switches, connecting the intermediate
buses to the outputs, follows the same construction. The number of
switches S_[p)" required for this two-level switch may be seen, by
inspection of the figure to be
s_P) : M(N1 - M ÷ I)÷ _(N2 -M + i) = M(NI+ N2)- 2M(M- i)
It is instructive to compare the two networks with respect
to the number of switches required. To be specific, we seek the relation-
ship among the variables NI, N 2 and M such that S_p)" > S [Op)'" In order
' SL "
to simplify /_the analysis/_ we will consider the case N 1 = N 2 = N. Then in
order that S_ p) > s_[P)o2 it is necessary that
2_ 2M 2 + 2M > N 2 M2
- - +M
which is equivalent to the condition that
N 2 - 2MN + M 2 > M
or
(N - M) 2 > M
or
N>M+_M .
The "break-even" values of N for various M's are given
in Table III-C-2; for values of N equal to and above those given, single-
level switching is more costly than double-level switching, at least
for the double-level switch described.
Table III-C-2
"BRF_K-EVEN" VALUES OF N
M 1 2 3 4 5 I0 20 50
sN-E 3 4 5 7 8 14 25 58
207
It is conjectured that this double-level switch network
is the minimal double-level switch network (and in general the minimal
multilevel switch network) when the order is preserved not only on the
output buses, but also on all intermediate level buses. However, it is
clear that the order need not be preserved at the intermediate level,
and in fact we have found some examples which illustrate cases where
less costly order-preserving networks result when disordering at the
intermediate buses is allowed.* At present we have not been able to
adequately generalize these examples to provide large classes of economi-
cal networks, but it appears that only a small reduction in cost is
achieved when compared with the simple double-level realizations of
Fig. III-C-8.
We now proceed to investigate non-order-preserving switch
networks. Our motivation for studying such networks is based upon the
fact that the realizations appear to be less costly than the order-
preserving realizations,t and also that we envision future computers as
being composed of large reconfigurable arrays of identical programmable
modules in which case it might be feasible to consider an identifier
appended to the data on each signal line.
4) Single-Level Non-Order-Preserving Network
The switch network schematic as illustrated in Fig.
III-C-7(a) is suggestive of a graph called a bipartite graph.2Ss§ The
* The realization of sample networks is discussed further on in this sec-
tion when the techniques of non-order-preserving networks are described.
It appears that the non-order-preserving networks offer significant
economy only if N1 and N2 are quite unequal.
§ The method for establishing that particular non-order-preserving switch
networks perform the required reconfiguration was formulated by B. Elspas
on the basis of the properties of bipartite graphs. Briefly, a graph con-
sists of certain points, called its vertices, and certain line segments
connecting vertices, called the edges of the graph. A graph where the
set of vertices is decomposed into two separate parts such that there are
edges only between these parts is called a bipartite graph. This type of
graph is commonly used to match a set of available jobs with a set of men,
when each man is qualified for certain of the jobs.
208
set of input modules are represented by a set of N 1 vertices, the output
modules by a set of N 2 vertices, and the switches by lines connecting
the two sets of vertices.
It is indicated in the footnote below that the theory of
bipartite graphs can be applied to the problem of matching men and jobs.
A very powerful condition, known as the diversity conditlon 233 has been de-
rived in order to establish whether a suitable job exists for each man:
Suppose there are _ men applying for positions.
Then each man can be assigned if and only if
for each group of _ men, for all 8 = I, 2,
..., _, there are at least _ jobs for which
they are collectively qualified.
The diversity condition can be used to determine whether
a particular single-level switch network can reconfigure M input and out-
put modules despite the occurrence of N I - M or fewer failures in the
input array and N 2 - M or fewer failures in the output array, as follows:
The non-order-preserving switch network can
perform the required reconfiguration if and
only if each group of 7 input modules, for
all 7 = 1, 2, ..., M, is connectable collect-
ively to at least y + N_ - M output modules,2
and if and only if each group of 5 output
modules, for all 8 = i, 2, ..., M is connect-
able collectively to at least 5 + N 1 - M
input modules.
It is convenient, in performing analyses utilizing the
diversity condition, to describe the switch network in terms of an N 1
× N 2 Boolean matrix B, where the entry b. is i if the i th input
th iJ
module is connectable to the j output module, and 0 otherwise. Then
the reconfigurability of a particular network can be stated in terms of
the matrix B as follows:
The switch network can perform the required
reconfiguration if for the matrix B, a vec-
tor which is the Boolean sum of any 7 rows,
= I, 2, ..., M, has weight (number of
entries which are one) at least 7 + N 2 - M,
and if the vector which is the Boolean sum
of any 8 columns, 8 = I, 2, ..., M, has
weiRht at least 8 + N 1 - M.
209
The above condition will now be utilized in connection
with a specific case. Consider the switch network shown in Fig. III-C-lO
and its matrix B below.
t 2 3 ...
INPUTS
v _ w
3 -- A = :
A
NI
A A
v v
v v
, A ,
. w .
• _ A A
• v v v
• v v
A A
A A A
w w
A a
2 :3 4 5 6 7 8 9 tO II 12 13 14 15
OUTPUTS
T&-5580-17
FIG. Ill.C-10 SINGLE-LEVEL NON-ORDER-PRESERVING COMMUTATION
NETWORK N 1 = N 2 = 15, M = 5
210
1 2 3 4 5 6 7 8 9 i0 ii 12 13 14 15
1 _ 1 1 1 1 1 1 1 I 1 1 0 0 0 0--
2 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0
3 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0
4 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0
5 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
6 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1
7 i I 0 0 0 0 I 1 i I i I I I I
B = 8 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1
9 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1
I0 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1
ii 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1
12 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1
13 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1
14 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1
15 _ 1 1 1 1 1 1 1 1 1 0 0 0 0 1
Clearly no network is less costly since each row in the
above matrix contains the minimum allotment of N 2 - M + 1 = ii ones.
It now remains to demonstrate that the Boolean sums of rows (and equiva-
lently columns since the matrix is symmetric except for cyclic permuta-
tions) satisfies the weight conditions. Form a vector V given by
m
V = _ (Vk£) , m K M = 5£=1
where V k is the k_h_ row vector of B and represents the Booleam sum
of the vectors.
It is easy to verify that the minimum weight of v, Wmi n (v)
is given by Wmi n (v) = N 2 - M + 1 + m, which is the weight when kl, k2,
..., k are successive integers, where 16 is taken to be equivalent to 1
m
Thus the diversity condition is satisfied and the indicated switch net-
work can perform the required reconfiguration. Although for the example
shown it was specified that N 1 = N2, a minimal single-level
non-order-preserving switch network with, for example, N 1 > N 2 can be
211
synthesized from a Boolean matrix whose first column is a vector con-
sisting of N1 - M + 1 ones followed by M - 1 zeros, and whose succeeding
columns are any set of distinct cyclic shifts of this column. For this
(nop) is given bynetwork the numberof switches SSL
S(nop) [mn(NI,N2)] {[_ax(N1 N2)- M + i]}SL = " '
S. Double-Level Non-Order-Preserving Network
Here we are concerned with synthesizing double-level
non-order-preserving switch networks which are approximately minimal.
Techniques are presented from which double-level networks can be synthe-
sized, although at present we do not know how close they are to minimal.
The general structure of the double-level network to be
investigated is illustrated in Fig. III-C-II. It consists of a
distributing network whose function is to distribute the information from
any arbitrary set of M input lines onto any M of N' intermediate lines,
followed by a collecting network whose function is to collect the signals
appearing on any M intermediate lines for delivery to any M output lines. _
In driving the number of switches required by the collecting network we
note that it performs the same function as a single level
non-order-preserving network with N' input channels. Hence the required
number of collecting switches S is
C
Sc= EMin(N2,N')]{E_ax(N2,N')- M+ 11}
The number of switches required by the distributing net-
work is considered below. It is not clear that this method of arbitrary
distribution followed by collection is indeed the most efficient commuta-
tion technique, but since networks which are significantly more economical
than single-level realizations did result in example cases, we will briefly
pursue some procedures for synthesis.
_ The double-level network as shown with the distributing network
associated with the inputs is based upon the assumption that N 1 > N2.
If N2 > N1 then the distributing network should be associated with the
output_ for optimum economy.
212
INPUTS
I
2
INTERMEDIATE LINES
I 2 oeo N I
IBUTING
NETWORK
NI
COLLECTING
NETWORK
OUTPUTS
I
2
o
e
N2
NUMBER OF SWITCHES {COLLECTING NETWORK)
: [..,oCN2.N'_]{[r.o_{N,,N')-h. ,]}
TA - 5500 - 20
FIG. III-C-11 GENERAL DOUBLE-LEVEL NON-ORDER-PRESERVING
NETWORK
213
Consider first the case where the number of intermediate
levels assumes a minimal value, namely N' = M. Then the function of the
distributing network is identical to that of the single-level network
with N' = M output channels, and the total number of switches required,
(hop) I-M+ 1] +M[N i]SDL
= MINI + N2 - 2M + 2] .
It is noted that this result is identical to the number of switches
required for the order-preservin_ case.
However, specifying N' = M does zot generally result in
the most economical double-level networks. For example consider a
distributing network, with parameters N 1 = 15, N' = 6, M = 5 (N 2 = 15),
described by the following matrix, B'.
1
2
3
4
5
6
7
B' = 8
9
i0
ii
12
13
14
15
1 2 3 4 5
-I 1 l
1 1 1
1 1 1
1 1 1
1 1 1
1 1 1
I 1 1
1 1 1
1 1 1
1 1 1
1 1
1 1
1 1
1 1
1 1
6 = N'
1
1
1
1
1
We will show that the double-level switch network contain-
ing a distributing network related to the above matrix B', with the
addition of the appropriate collecting matrix, is more economical than
214
the network illustrated in Fig. III-C-10. First it is necessary to
prove that for the indicated distributing network each set of M input
channels is connectable to M distinct intermediate lines.
This will be accomplished by using the diversity condition
as related to the matrix B_; i.e., it is necessary to show that the
weight of the vector formed as the Boolean sum of any j rows, j = I, ...,
M, is at least j. Clearly, the condition is satisfied for j = l, 2, 3
since the weight of each row is 3, and the condition is also satisfied
by j = 4 since all row pairs differ in at least 2 places (i.e. maximum
overlap* is 2). However, we note that the number of vectors of weight 3
with overlap exactly 2 is limited to 4; hence if we consider 5 vectors
there must be at least one row pair with overlap not exceeding I. Then
the Boolean sum of any 5 rows yields a vector of minimum weight 5, thus
establishing the diversity condition.
It can be shown that the 4 underlined entries in the B'
matrix can be discarded without affecting the diversity condition. Hence
the distributing network contains 3(15) - 4 = 41 switches, which when
added to the 6 (15 - 5 + I) = 66 switches required for the collecting
array indicates a total of 107 switches as compared with the II0 switches
for the double-level network with N I = M. Admittedly the relative
improvement in economy is not startling, and indeed it appears that for
N 1 _ N2, the double-level network with N _ = M is quite economical; but
if N 1 and N 2 are somewhat unequal, significant economy is afforded by
allowing N _ > M.
Unfortunately we have not formulated deterministic
techniques for deriving the optimum value of N t, nor do we know techniques
for synthesizing distributing networks for all values of the parameters
NI, N e, M; but by a cut-and-try process it is generally possible to syn-
thesize reasonably economical networks. The most efficient distributing
networks we have found are based upon balanced incomplete block designs.
* The overlap of two vectors is defined as the weight of the dot product.
215
Before proceding to a discussion of the distributing
networks based upon block designs it is convenient to consider the
problem of determining a lower bound on the weight of the Boolean sum
of vectors, subject to certain constraints on the structure of the vec-
tors. Consider a set of vectors of weight W and maximum overlap max"
It can De shown from a result presented in connection with another
problem *s2 that the weight of the Boolean sum of j vectors, each of
weight W with a maximum overlap _ is greater than j provided the
max
inequality
j>
w(max +l )
is satisfied.
It is not difficult to show that if the inequality is
satisfied for some integer j, then it is satisfied for all integers less
than j. Hence if the matrix for a distributing network consists of rows
of weight W with a maximum overlap, _max' then all sets of m inputs are
connectable if the inequality
(m )k + 1
max
m
is satisfied. We will now demonstrate the technique of forming distribu-
ting networks from block designs.
Briefly, a block design _b constitutes s multiparameter
arrangement of objects, which for present purposes may conveniently be
represented as a matrix of zeros and ones. The incidence matrix B _ of a
so-called balanced incomplete block design (BIBD) with parameters
_ For a complete discussion of block designs the reader is referred to
Ref. 121.
216
(v, k, b, r, _) has b rows, v columns, k ones per row, and r ones per column,
and is such that the dot product of every pair of columns is just _. The
well-known identities
must be satisfied.
vr = bk
r(k - 1) = (v - 1)
H
Generally we will identify the rows of B with the switch
connection from the input channels (i.e. N 1 = b, M' = v), since for a
BIBD v _ b. (Otherwise a distributing network with N S > N 1 would result,
yielding an inefficient network.) However, in order to determine whether
the diversity condition is satisfied for the case of interest it is
necessary to know the maximum value of the dot product _max of any pair
of rows of B".
H
For the case where the rows of B consist of all of the
v
(k) combinations of vectors of length v each with k ones, then
_max = v - I. For example, the matrix B H for the block design of1
this type with v = 6, r = I0, b = 20, k = 3, k = 2, is shown below
1
2
3
4
5
6
7
8
9
H
B 1 = I0
ii
12
13
14
15
16
17
18
19
2O
1 2 3 4 5 6
1 1
1 1 1
1 1 1
1 1 1
1 1 1
I 1 1
1 1 1
1 1 1
1 1 1
1 1 1
i 1 1
1 1 1
1 1 1
1 1 1
1 1 1
1 1 1
1 1 1
1 1 1
1 1 1
1 1 1
217
Applying the inequality concerning the weight of the
Boolean sum of vectors, we note that the matrix B; describes a valid
distributing network for all m satisfying m < 5.
Another case where _max is known for a BIBD is for _ = I,
in which case _max = i. Many examples of BIBD's with _max = 1 have been
tabulated 121 4s
It is of interest to compare the number of switches in the
various types of networks (minimal or approximately minimal) distinguished
as single-level order-preserving, double-level order-preserving, single-
level non-order-preserving, double-level non-order-preserving (N' = M),
and double-level non-order-preserving (N' _ M). The results for two examples
are tabulated in Table III-C-3 for commutation networks with parameters
N 1 = 20, N 2 = 6, M = 5 and N 1 = 63, N 2 = 28, M = 13. The appropriate
distributing network for the former example (illustrated with the appro-
priate collecting network in Fig. III-C-12 is based upon the matrix B z1
with N S = 6, and the distributing network for the latter case is based
upon the block design with parameters b = 63, v = 28, k = 4, r = 9,
= i, in which case N' = 28. (It is easily verified that the "pertinent
inequality" is satisfied for m < 13 with these parameters.)
Table III-C-3
COMPARISON OF SWITCH NETWORKS
N 1 = 20, N 2 = 6, M = 51N 1 = 63, N 2 = 28, M = 13
I00 1608Single-level order-
preserving
Double-level order-
preserving
Single-level non-order-
preserving
Doub]e-level non-order-
preserving (N t = M)
Double-level non-order-
preserving (N t _ M)
9O
96
9O
72
871
1428
871
644
218
INPUTS
I
2
3
4
5
6
7
8
9
I0
II
12
13
14
15
16
17
18
19
N I =20
I 2 3 4 5 6=N I
A A
v
v w
I
2
3
4
5
6 =N 2
TA-5580-16
FIG. III-C-12 DOUBLE-LEVEL NON-ORDER-PRESERVING
COMMUTATION NETWORK N 1 = 20,
N 2 = 6, N' = 6, M = 5, 72 SWITCHES
219
It is seen that the double-level non-order-preserving
switch networks with N t _ M are clearly the most economical. Admittedly
these examples are somewhat artificial since the parameters were chosen
so as to be consistent with the parameters of known balanced incomplete
block designs. We cannot state a formal procedure for synthesizing
economical double-level networks for all values of parameters N 1, N 2,
M, but the following method appears to result in "fairly good" networks.
Choose a BIBD with parameter k = 1; parameter k a minimum
such that the inequality for the number of active channels M is satisfied;
parameter b = NI_; and parameter v > M. If several designs satisfying
these properties exist, then a cut-and-try procedure is employed in order
to select the design which results in the most economical network.
Before concluding this section we can present an order-
preserving network which is more economical than a double-level network
of the type shown in Fig. III-C-9. The network for parameters N 1 = 45,
N 2 = 4, M = 3 consists of a distributing network based upon the BIBD
with parameters b = 45, k = 2, v = 10, r = 9, k = 1, and a special
collecting network consisting of vN 2 = N'N 2 switches, t Hence the total
number of switches required in the double level network is
2(45) ÷ i0(4) = 130. The double-level switch network based upon the
structure of Fig. III-C-9 requires 3(45 + 4) - 2(3)(2) = 135 switches.
Our primary purpose in presenting this example (which is
somewhat pathological) is to indicate that many theoretical questions
remain unanswered concerning commutation networks which are minimal in
the context of the switch-network models. We have presented methods
for synthesizing minimal single-level networks, both order-preserving
and non-order-preserving, for all values of the parameters N I, N 2, M.
_ If no design with b = N I exists then a design with b _ N1 can be al-
tered by adding or deleting a few rows, so as to form an incidence
matrix containing N1 rows.
Clearly this network is order-preserving since the special collecting
network can transfer any disordered pattern on the intermediate lines,
onto the output lines so that the order at the input is preserved.
220
It was shown that significant economy is realized if double-level
realizations are considered. Techniques were presented for synthesizing
double-level switches which, although shown to be nonminimal, represent
an adequate engineering solution. For the practical case of an order-
preserving network with N I _ N2, the indicated network appears to be
close to minimal. For future work in this area the following studies
are recommended.
(i)
(2
(3
(4
Consider the derivation of lower
bounds on the number of required
switches, with an arbitrary number
of intermediate levels, in order
to indicate how close to minimal
are the networks presented in this
report.
For the order-preserving case with
multiple levels, consider additional
examples where disordering of the
signals on the intermediate levels
is allowed.
Consider other tools besides balanced
incomplete block designs for the
specification of distributing net-
works. Partially balanced incomplete
block designs have been studied
extensively and many more examples
are tabulated of these than of BIBD's.
In addition the structures of matrices
used to specify error-correcting codes
appear to be applicable here.
Consider the design of partial commutation
switches (defined in Sec. III-C-2-a).
d. Setup and Control Circuits
In the previous section we considered a model which visualized
the commutation circuits as a network of single-pole single-throw
switches. Our motivation for studying that model was based upon our
belief that logic designers in implementing commutation circuits might
prefer to employ realizations in which the commutation function is
executed by combinational means. In this case the switch network
is programmed by an external executive when reconfiguration is required.
221
The switch-network models which were presented should provide an
adequate basis for design.
In addition it is possible to synthesize the commutation
network by combinational techniques exclusively. The circuit, as
relating to the single-level switch network, can consist of a net
containing N 1 + N 2 inputs indicating the status of each of the modules
of the input and output arrays, N 1 signal inputs, and N 2 signal outputs.
Indeed, the resultant net as synthesized from a truth table
will be quite large, although some simplification is possible by the
consideration of "don't cares." The circuit as relating to the double-
level network will consist of two combinational parts--the first part
containing the N I + N 2 status inputs and the N 1 signal inputs but only
N s outputs, and the second part containing N t signal inputs, N 1 + N 2
status inputs, and N 2 outputs. We have not yet considered the estimation
of the complexity of these realizations.
It is not immediately clear that the most economical implemen-
tation is afforded by incorporating exclusively combinational logic for
the control. Moreover, some simplification could be realized by noting
that all connections need not be established anew after each failure;
rather, the circuit could "translate" to different connections with only
slight modification in organization. Comprehensive study will be required
to specify synthesis techniques which will minimize the overall cost of
the combinational commutation switch, and indeed also to ascertain the
specific advantages of the combinational approach as compared with, for
example, the diffusion circuit discussed in Sec. III-C-2-b.
Of major importance in the consideration of the implementations
are methods for ensuring that the resultant circuits are failure-tolerant.
This question is considered briefly in Sec. III-C-2-e.
As in our other studies, our aim with respect to the design of
commutation circuits is to present realizations which achieve reliability
at low cost and which are compatible with the anticipated future tech-
nology. In accordance with this goal we chose to consider cellular
222
implementations of the commutation networks. Although the circuits
based upon this approach will be costly compared with the combinational
types, considering component count exclusively, they do offer the
following advantages:
(I) The network is composed of a rectangular array of
identical cells.
(2) The control operation is straightforward in that
the paths between the input and output channels
are formed automatically.
(3) The networks can be diagnosed for faults in a
straightforward manner.
(4) It appears that low-cost redundancy is easily
incorporated into the design, permitting the by-
passing of faulty cells.
One type of cellular order-preserving commutation switch
which has been conceived shown in Fig. III-C-13, is based upon the
structure of the Holland Machine. IS3 The purpose here is to first
delineate distinct nonintersecting paths between each of the M operating
input and output channels. (It is assumed that the paths are formed
essentially anew upon the detection of each failed input or output
module.) Once the appropriate paths have been defined, the network
returns to the operating state in which data is transferred between
input and output modules, essentially through combinational logic
attendant to the cells.
The path-defining operation can be described as follows
[Fig. lll-C-iS(a)]. The network consists in general of a rectangular
array of Max (NI, N2) x M cells. The input and output modules which
are operating are distinguished, and a path commencing at the first
operating input seeks the first operating output by the following
strategies, executed in preferred order by the cells.
(I) An attempt is first made to establish a path
between the cell in question and the cell
immediately above.
223
INPUTS
I 2 M
I F- ..............
I
v
i
i
-_ 2 ...... , T .... 7
I
I ;
i L__
-I,(-3 ........... J
#-............. 7
I I
' II
i I
._ NI ...... j t._
OUTPUTS
I
--- 3 _-
--- N 2 w-
(o)
I 2 3
I
I
I
I
_. 2___J
4 ......
r- ->" .... 7 I
I
I
I
_-.... 2 *
M 3 _"
.... 4 "K"
(b)
I
II
(c) TA-5580-20
FIG.III-C-13HOLLAND-TYPE CELLULAR
COMMUTATION NETWORK
224
D(2) If a path has already been defined including the cell
above, or if the cell in question is on the upper
boundary of the array, or if the cell "northeast"
has been included in a path*, then an attempt is made
to establish a path to the adjacent cell on the right.
(3) If connection with the adjacent right cell is not
possible, then an attempt is made to establish a
path to the cell immediately below.
The path is complete when it reaches the cell adjacent to an
operating output module. The inputs and outputs required of such cells
for the path formation are shown in Fig. III-C-13(c). Since each cell
is to deliver a signal to one of three adjacent cells, two flip-flops
are required for each cell.
Other cellular realizations have been conceived, relating to
the minimal switch-network designs of the previous section. At present
we have not formulated well-defined measures for the comparison of the
various commutation-circuit realizations, but it is felt that a major
factor will be the cost of ensuring failure-tolerant operation. In
the following section we discuss techniques for synthesizing redundant
commutation switches.
e. Failure-Tolerant Interconnection Networks
Here we are concerned with establishing techniques whereby the
reliability of commutation circuits can be improved. Before proceding
to synthesis studies it is appropriate to question the need for reliable
commutation circuits. Most studies concerned with the estimation of
the reliability of reconfigurable computers have either neglected the
reliability of the interconnection networks, or have considered the
commutation circuits as part of the hard core. In either case it has
been assumed that the amount of equipment allocated for the commutation
networks is a negligible portion of the overall computer. Our studies,
as well as others, 3°2 have indicated that the reconfiguration method is
@ This last restriction is necessary to avoid the creation of a dead-
end path as would result if in defining the path between input 3 and
output 3 _of Fig. III-C-13(b)_ connection was permitted to the cell
in row 2, column 2.
225
feasible only if the blocks which are replaced are sufficiently small.
Admittedly this is imprecise, and indeed it is recommended that
consideration be given to quantitative determinations* of the optimum-
size replaceable network block, but at present it will be sufficient to
note that the self-repair method should probably be applied at present
to such regularly structured units as a block of memory, or a portion
of a register or adder. In addition, such blocks as the programmable
control units to be described in Sec. III-C-4 appear to be of approxi-
mately the proper complexity for replacement.
We have made a brief qualitative assessment of the properties
of the fault-masking techniques discussed in Sec. II-A-I as applied to
both the diffusion register and the combinational and cellular commuta-
tion networks. We note that a commutation network is essentially a
block with many outputs which requires, on the average, little equipment
per output, although it requires a large overall amount of equipment.
The voting type of redundancy is of little value here since the voters
will form a significant portion of the network, and unlike the cases
discussed in Sec. II-A-2-a, the voters cannot be replicated conveniently.
Thus the reliability of a large set of nonredundant voters will essen-
tially specify the reliability of the entire net. Particular timing
problems are encountered in applying the voting technique to the
diffusion register of Sec. III-C-2-b. in that each copy of a replicated
register operates asynchronously. It is possible to consider additional
redundant channels associated with the commutation circuits, and then
employ the error-correction techniques discussed in Sec. II-C-2. However,
it is felt that the additional equipment required for the decoding might
result in too costly a system.
* On a qualitative basis we note that if complex blocks are replaced
then a large number of spare blocks must be available, and if small
blocks are replaced the attendant commutation circuits tend to be
more complex than the replaceable blocks. Of course various policies
are possible whereby large blocks are replaced, but repaired off-line.
226
Since the commutation networks can be modelled (as in Sec.
III-C-2-c) as a net of switches, it appears feasible in the context
of this model to specify additional switches beyond the minimal number
required, so that modules are still connectable in the presence of
switch failures. We have considered the problem of synthesizing such
failure-tolerant switch networks considering two types of switch
failures--permanent shorts and permanent opens--and the resultant in-
crease in cost is not severe.
For example, consider the nonredundant double-level order-
preserving switch network of Fig. lll-C-14(a) with parameters N 1 = 4,
N 2 = 4, M = 2. A network containing only 8 additional switches which
is single switch-failure-tolerant is shown in Fig. III-C-14(b). It
is clear that an additional channel is required on the intermediate
level since the shorting of a switch connecting to a faulty input or
output module will cause the pertinent intermediate line to be inopera-
tive. On an intuitive basis we note that there exists a Moore-Shannon 214
type two-level single fault-masking hammock net, between each pair of
input and output modules which are connectable.
A double switch-failure-tolerant network, requiring 32 redun-
dant switches, is shown in Fig. III-C-14. The synthesis techniques
illustrated here can be extended to specify switch networks with arbitrary
failure tolerance, by noting that for each additional failure an addition-
al level, and an additional channel for each intermediate level are re-
quired.
There are several unanswered practical questions concerning
the applicability of the spare-bath technique. Means must be provided
for the diagnosis of the commutation network in order to locate sus-
pected faulty switches. It is not clear at present what is the optimum
diagnosis policy, but it appears that the exhaustive checking of all
connection combinations in a large redundant commutation network will
result in lengthy tests. In this instance a good policy might involve
waiting for a switch failure, and then using the information thus de-
rived, concerning the location of the faulty output channel, to form
a shorter test.
227
INPUTS OUTPUTS
2 2
(o) DOUBLE-LEVEL, NONREDUNDANT
INPUTS OUTPUTS
I I
2 2
3 3
4 4
(b) SINGLE FAILURE TOLERANT
INPUTS I I OUTPUTS
I I
2 2
2 2
3 3
4 4
(c) DOUBLE FAILURE TOLERANT
TA-5580-9
FIG. III-C-14 NONREDUNDANT AND REDUNDANT
COMMUTATION NETWORKS
228
Moreover in the implementation of a redundant switch network,
possibly as a combinational net, care must be taken so that single
component failures do not result in multiple equivalent switch failures.
The problem here, however, is not as formidable as in the case of most
multiple-output networks, in that there is little dependence between
the outputs of the commutation network. An important question is, for
example, "In a network which can tolerate at least single switch fail-
ures, what multiple failures can be tolerated?" The answer to this
question would lead to designs whereby selected portions of the net-
work could operate dependently.
We have not yet considered ways to incorporate redundancy
into the cellular commutation networks considered previously. However,
it appears that provision can be specified for bypassing faulty cells
in the path-delineation phase. It is not presently clear how the by-
passing instruction can be reliably implemented.
f. Conclusions and Problems for Future Study
In this Sec. III-C-2 we have presented the results of an
initial study concerned with the synthesis of the commutation circuits
required to effect the reconfiguration of modules upon the location of
faults. This exploratory examination was primarily concerned with a
simple model where each module of a given array (called the input array)
is required to transfer data to a module of an output array, and where
spare modules are present on both arrays.
One approach to the commutation-circuit design problem is
based upon a diffusion register of speed-independent cells. This net-
work, although it is attractive because of an economical design, exhibits
several drawbacks. The operation is slow because the data is transferred
through many sequential cells, and also it appears to be difficult to
incorporate low-cost redundancy into the design.
A somewhat different approach to the problem was considered,
based upon a multiple-input, multiple-output combinational net, which
could be modelled as a network of single-pole single-throw switches,
somewhat suggestive of a telephone exchange. The problem of deriving
229
networks of this kind which contain a minimal number of switches was
considered; although many networks which appear to be close to minimal
were synthesized, the global minimum has not been achieved. The
solutions obtained appear to be adequate in the engineering sense,
however. Techniques were discussed for the practical implementation
of such networks and some attention was directed towards the realization
of failure-tolerant implementations. In addition cellular implementations,
which afford simple control operation, were briefly discussed.
Many questions have been uncovered in this brief examination.
The problem still remains, for the simple model of the two reconfigurable
arrays, of determining the globally minimal switch networks for both
total and partial commutation networks. There remains also the problem
of synthesizing improved economical and reliable implementations.
3. Programmable Processing Modules
In this section we describe the detailed design features of a
modular arithmetic processor where the functions of computation, storage
and primitive control are all combined in an iterated set of replaceable
modules.
a. General Structure of a Modular Processor
The module to be described was designed to be suitable for the
composition of a reconfigurable parallel processor for arithmetic and
Boolean operations. A number of the design features of the module make
it attractive for other system uses. These are described following the
explanation of its functioning within a parallel processing unit.
Many of the logical processes in the central processor of
a bit-parallel computer are naturally realized in iterative structures
such as registers, counters, and accumulators. In collecting the logic
elements that process the various words into modules, there is a choice
to be made as to whether the collection is bit-oriented or word-oriented;
i.e., whether a module will process corresponding bits of a number of
words, or whether it will process all the bits of a single word.
230
Thus, if a processor serves w words of b bits each, if up to
t words are active at a given time, and if each bit has a total of s
input and output signals associated with it, the following dimensions
result from the two approaches:
Bit-oriented:
b modules
st leads per module for signals
t log2w leads maximum @ per module for selection of active
elements.
Word Oriented:
w modules
sb leads per module for signals
1 lead per module for selection of the module.
For the word lengths and numbers of active words common in
general-purpose computers, sb is substantially greater than t(s + log 2 w);
hence the bit-oriented approach would result in modules with substantially
fewer leads. As mentioned previously, this tends to increase the
reliability of the module; it also reduces the number of switch points
in a reconfiguration network.
These arguments are the motivation for the design example of
a reconfiguration parallel processing unit, illustrated in Fig. III-C-15.
The processor is composed of n identical functional modules, where n
may be greater than the number of digits of the words to be processed.
Adjacent modules exchange data bidirectionally, with outputs BOL and
C O _ taken to the left and BOR taken to the right. The set of outputs
'{B_I)oL' B(2)OL' "'" , B_)), and the set of inputs (B_I), B_2), ..., B_n))
are taken in parallel, and are joined to the external system by a
commutation switch of the kind described in Sec. llI-C-2. The set of
modules also communicates with an external control as follows: (i) the
BOL outputs of the first and last modules provide overflow and underflow
@ This result is seen as follows. In order to specify sets of w words
each containing O, i, 2..., t words, then log2 _;)1 (_> + ...
+ (_)_ _ tlog2 w bits are required.
231
AQ:
I--
g
L)
Z
9
I,-
_J
IE(/_
w_
I- w
L_a
w_
oZ
E_
c U
@D41.
J
<
I'-a
u.
_ L I
_1
I--Q
UO
u.
TH-
,-I II1 m L)
ZW
0 J
0
" -Z_Zg_¢,
en rn U
1H-
J
Z W
I.-Q
UO
b.
_J
<
I,- ¢'_
UO
L__
Z
Z
LU
U
0
n_
0_
LU
._I
n_
0_
LU
-d
rn
n_
U-
Z
0
U
LU
n_
u3
|
U
i
m
m
d
U-
232
Dstatus information to the control in arithmetic operations, and (2) the
control issues coded microcommands on a set of buses M, and coded register
address information on a set of buses X. Both sets go to all modules.
Within each module the X information specifies the selection of a bit
of stored information, and the M information specifies the logical
operation to be performed.
A key feature of the system is the method of reconfiguration.
If a given stage fails, means are provided for shifting its function
and the functions of all stages of higher order to the corresponding
next stages of higher order. For the data that flows between adjacent
stages, this shift is accomplished simply by logically short-circuiting
the faulty stage, under the individual command of the signals SI, $2,
... S . These signals are assumed to be derived from an external mainte-
n
nance control source. For parallel-access data, the commutation switch
must be designed so as to displace all bits of order higher than a given
index, for any index, preserving order in the displacement. The design
of such switches is discussed in Sec. III-C-2. This mode of reconfigura-
tion makes possible the use of all spare units, if needed, no matter
what the configuration of faulty stages, without the necessity of
providing that a given spare be capable of being switched into any
faulty position.
In the next part of this section, the interior design of the
module will be described; and in the following part, illustrative micro-
programs will be given for several familiar logical operations.
b. Module Description
The module will be explained with reference to Fig. III-C-16.
The main data operation is accomplished by a full-adder network, labeled
Z, with inputs a, b, c and outputs s (sum) and c O (carry). The accumulator
flip-flop A may record either s or c. The b data is obtained from one
of a set of storage elements 81 , ... , 8k , belonging to system registers
i to k respectively, or from one of three external inputs BIL , BIR , and
BIE , corresponding to the b outputs of the left and right cells and the
external system input, respectively. The k internal sources of b data
are selected by a decoder, with external inputs X and outputs X1, X2,
233
XX I
DECODER
Xk
WB
V
M _._ RECODER
Sc
WA
RA
YL
Yp
YR
Y
BIL -_
BOL
COL _I
DOT INDICATESINHIBITION
Y YL YP YR
,1/
MJ
WRITE B
VECTOR Op
SUM (CARRY)
WRITE A
READ A
READ B E
READ Bp
READ B R
READ EXT.
C o
O
z
C o
WA
RA
Sh
,&S h
(SHUNT STAGE)
A BTp
- BOR
BIR
_- CZR
TA-5580-1
FIG. Ill-C-16 MODULE FOR A RECONFIGURABLE PROCESSOR
234
..., Xk.* The external inputs are selected under the commands YL' YP'
YR" Shifting operations may be accomplished by a combination of commands;
th
thus if Xj and YL are energized, all the bits of the j word will be
presented to the b inputs of the adder networks of the modules one place
to the right.
The value of the A element may be read out as an input to the
adder, a, and as the data input to the set of _ storage elements. The
c input is equal to the external CIR input in arithmetic operations, and
zero in vector operations. The various commands WB, WA, RA, V, Sc, Y,
YL' Y and YR are derived from external command signals M. Finally ifP
the stage is allowed to operate normally (indicated by the signal S h
FALSE) the outputs BoR,BoL , and COL are equal to B, B, and c O respectively,
while if the state is to be shunted out these outputs are equal to BIL ,
BIR , and CIR respectively. The details of the logic are given in Table
III-C-4.
Using the data and control logic described, it is possible
to construct a number of useful operations that can be programmed to
accomplish a variety of useful functions. Table III-C-5 contains a
list of such basic operations, conventionally called microoperations.
The following notation is used:
(i) A refers to the set of A-elements.
(2) By (x) refers to a set of B-type data elements selected
according to the address index x, where x = I, 2,
..., k for the k internal sources, x = X for the null
input, and for shift-index y = I, L, P, R, for Internal,
Left-, Parallel- and Right-external sources. If no
y is specified, y = I is understood. Thus, for example
(3) refers to the presence at each adder b-input
the stored B bit of word 3 of the module to the left.
(3) B0 refers to the set of module outputs; in active
modules B0 = BOL = BOR-
* Providing an identical decoder to all modules causes the total parts
count to be high, but the separate decoding increases reliability by
decreasing the number of module terminals, and by reducing the damage
due to a fault in a decoder. Also, it may now be noted that another
means is available for accommodating faults within single storage
elements, aside from shunting out the entire stage; that is, to
assign the function of the entire register to which the fault element
belongs. This may be done by changing the address code externally,
at the central control unit.
235
Table III-C-4
LOGIC EQUATIONS FOR A PROCESSING MODULE
Storage elements: 8 = (81 , 82 , .... 8k); A
Data Inputs: BIL, BIR, BIp, CIR
Contro] Inputs: X = (Xl, x2, ..., Xs); M = (ml, m2, ..., m e )
Maintenance Input (for Shunting Stage): s h
Storage Selection Variables Decoded from X: X 1, X 2, ..., X k
Control Variables Derived from M:
W B (Write B) Y Read External B
W A (Write A) YL Read B L
R A (Read A) Yp Read Bp
V (Vector Operation) YR Read B R
S ( Sum-Carry Choice)
C
Intermediate Variables :
Selected Internal 8
Selected External 8
Composed Inputs to Adder:
a=A RA
b = Yt_i + Y BE
c =VCIR
Adder Outputs: s = Parity (a, b, c)
c 0 = Majority (a, b, c).
Data Inputs to Storage Elements
Element
A
8i, l<i<k_
Data Outputs:
!
BOR = S h 81 + S h BOL
I
BOL = Sh 81 + S h BOR
COL = S h e O + S h CIR
81 = X181 + X282 + ... + Xk8 k
8E = YLSL + YpSp + YRSR
Change Condition New State
WA s S + S'cC C
W B a X i
A -- a
236
Table I II-C-5
BASIC MICROOPERATIONS FOR A MODULAR PROCESSING UNIT
Micro-
operation
Code
M1
M2
M3
M4
M5
M6
M7
M8
Symbolic
Operation
A<--0
A .-B (x)
m my
A <--A'B (x)
m my
A+-A EB (x) 7, c o
_ _ -y
Description
Clear A
Load A from a specified B
Accumulate logical product of A
and a specified B
Accumulate sum of A and a specified
B, with initial carry
A <--A C)By (x)
B (x) .- 0
B (x) _A
B0 = B (x)
Accumulate mod-2 sum of A and a
specified B
Clear a specified internal B
Copy A into a specified internal B
Read a specified B at the B0 output
Table III-C-6
MICROOPERATION CODES
M I
M2
M 3
M 4
M 5
M 6
M 7
M 8
B Spec
Shift Index
Y X
0 1
y x
y x
y x
y x
0 x
0 x
B Control
Write B
0
0
0
0
0
1
1
1
A Control
Read A Write A
0 1
0 1
1 1
1 1
1 1
0 0
1 0
0 0
Logic
Vector
Operation
Sum/Carry
1/0
Table III-C-6 gives the appropriate excitations of the
internal control signals required to realize the given microoperations.
237
In the next part, microprograms are presented for several
familiar processes.
c; Microprograms for Common Functions
The basic microoperations given for the processing module may
be applied to a cascade of modules in obvious ways to realize the behavior
of familiar computer functional units, such as a bidirectional shift
register, a counter, and an adder. Subtraction may be accomplished by
adding the modulo-2 sum of the subtrahend (obtained by applying microcode
M5) to the minuend, with an injected carry, C O .
The structure is also obviously well suited to serve as a set
of index registers, with built-in adder for index arithmetic.
A program for arithmetic multiplication is given in Table
III-C-7. All the components of this process are held in the B registers
of a single processor, including Multiplier, Multiplicand, Product, and
Cycle Counter; hence these components must be processed serially. The
program loop has ten significant time steps. In conventional practice,
independent structures are used at least for the multiplier and the
cycle counter, with the advantage of greater parallelism, and at the
cost of many more data paths.
A program for decoding a binary vector is given in Table III-C-8.
.th
The object of the program is to yield a 1 at the output of the j stage
of the processor, where j is the binary-number equivalent of the binary
vector. The calculation employs a set of vector constants c(1), c(2),
etc., stored in the B registers, as shown in the table. The program is
the sequential equivalent of 2m-1 combinational functions; for example,
the function at stage i is
(x I ) (x 2 ) (x m )
Pi = el(l) • c(2) • ... • c(m) , 1 i 2m-1 ,
where a given c is taken directly if its exponent is true, or complemented
if its exponent is false. For the vectors indicated, this process results
in Pi = 1 when (x 1, x2, ... , Xm ) is the binary equivalent of i, as
desired.
238
Table III-C-7
MICRO,PROGRAM FOR MULTIPLICATION
Register Use
B (I) Multiplier*
B (2) Multiplicand*
B (3) Product
B (4) Cycle Counter
Program
Step Conditiont
1
2
3
4
Operation
A <---- 0
n
A _ A ZB I (3) 7.C O
B (4)_--A
_A _----BR (4)
5 B (4)_- A
m
6 B 0 = B 1 (4)
7 B = I Exit
n
90 B 0 = 0
91 B 0 -- 1
10
11
so = BL (S)
LSD = least significant digit
MSD = most significant digit
A .----_BL(1) B_-O
-- ' n
A _ _SL (1) B_-I
-- * n
B 1 (1) <---_A
_A<---- BL (3)
Description
12 _0 = 21 (1)
131 Bn = 1 _ *-- A Z21 (2)
14
BI (3)<---_
Return to Step 4
Clear Accumulator
Start Ring Counter
Store Count
Load and shift ring
counter left
Store count
Read Count
Count ended, multi-
plication ended
Read product, looking
right, test
Load multiplier, shifting
right, set MSD = 0
Load multiplier, shifting
right, set MSD = 1
Store multiplier
Load product, shifting
right
Read multiplier, test
LSD
Accumulate multiplicand
into product
Store product
* Assume factors in place at start
# If condition is not satisfied, skip step.
239
Table III-C-8
MICROPROGRAMFORDECODINGA BINARYVECTOR
Register Use
Register Function
B (I) constant c(1)
B (2) " c(2)
B (S) " c(3)
B (4) " c(4)
B (5) Input vector
B (6) Temporary product
Program
Step Condition Operation
Initial Value
(i 0101010)
(110011O0)
(11110000)
(11111111)
(0 0 0 0 0 x3x2x 1)
(oooooooo)
Description
1 A <---- B 1 (4)
2
B (6)_---A
3 A _-----BL (5)
4 B (5) +--- A
m
5 A _ BI(i)
6 _Bo = _BI (5)
7 B 0 = 0 _A _-- _A(_B I (4)
s A _-- A-B I (6)
9 B (6),__A
Load accumulator to i
m
Store temporary product
Load accumulator to input
vector shifting left
Store input vector
Load first constant
Read next input vector bit
Complement first constant
Accumulate first logical product
Store temporary product
Repeat steps 3 - I0 twice more, replacing B (I) first by B (2), then by
B (3)
240
For example, for m = 3, i = 6, the binary vector for i = 6 is 011.
Then P6 = (0)0 (1)1 " (1)1 = 1 1 1 = 1.
An alternative way of realizing this function is simply to
subtract the number 1 from the binary vector, treated as a number, until
the value zero is reached, and at each step shift a fiducial 1 digit
one stage to the left. This process would take an average of 2m-I
major steps, compared with the m major steps of the given process.
d. Other Uses of the Module
The module described has a number of features that make it
useful for realizing general logical functions. For example, suppose
it is desired to realize an arbitrary switching function of d variables
where d _ 2k. If the truth table (with column elements tl, t2, ... tk)
for the function, is stored in the like-index _ elements of the module
and the variables are applied to the X decoder inputs, the resulting
_I line realizes the function
_I = tl Xl + t2 X2 + "'" + tk Xk = _ (Xl' x2' "'" ' x d)
where X. is the 1-indexed i th
i min-term of the set of d variables; _i
may thus be set to be any switching function of the d variables by
proper choice of the t's.
As a further enhancement, the s output of the adder may be
set to provide the function
s =B (xI, x2, ..., xd)Qv'c,
where V and c are single, independent variables. This form makes the
module well suited to the realization of so-called "ring-sum" canonical
compositions of arbitrary switching functions.
241
e. Problems for Further Study
The following problems for further study are evident:
(a) Study augmentations of the given design that enlarge
the range of applications.
(b) Develop efficient means for testing the module.
(c) Study means for ensuring that the stage-shunting
action may be accomplished reliably, e.g. by
fault masking.
(d) Consider designs which incorporate register bits of
more than one index.
(e) Consider module designs that incorporate more control
functions.
4. Programmable Control Units
a. Uses of Programmability in a Control Unit
With few exceptions the control units of modern computers have
a fixed logical structure. In self-repairing reconfigurable computers,
several reasons may be distinguished for making the control unit a
variable structure, subject to external programming. The major uses
for such variability are the following:
(i) To allow for failures in functional blocks by
changing the hardware address of the block em-
ployed for a given function
(2) to allow for modification in the microsequence for
a given order, if hardware capability for that
order is lost
(3) in the control unit of a given processor in a
multiprocessor, to allow for specialized operation
of the processor by assumption of a special order
code
(4) to accommodate faults within the control unit itself.
In the first three uses, the variability in operation could be
achieved at the program level; but providing it in the control unit
permits greater compactness in the order code, or higher normal speed
of operation, or both. The benefits of the fourth use depend upon how
242
the reliability of the unit is affected by the added equipment needed
for the programmability.
In the next part, several methods for achieving such program-
mability will be discussed.
b. Approaches to the Structuring of Modular Programmable
Control Units
Fixed-function control units are usually quite complex in
structure. The criteria of feasibility and modularity suggest the
use of a high degree of regularity in structure. In this part, three
approaches will be considered that emphasize such regularity. The
modularization of control units is currently a subject of widespread
investigation, and the schemes described should be taken only as
examples of possible approaches.
i) Control Based Upon a Microprogram Memory Store
The well-known microprogram structure Sa° for control is
well suited for realizing a programmable control unit. It employs an
addressable memory, the contents of which are called a microprogram,*
and the state of the central unit is defined by the memory word currently
selected. Each such word carries information specifying (i) an output
excitation, and (2) the address of the next word (state) or words which
may be its successor in a program sequence. The output excitation in-
cludes both the specification of functionsl units--e.g., registers--and
of function--e.g., shift operation. In operation, the code for a given
machine order is used to address and retrieve a stored microorder;
thereafter the sequence of accesses is self-sustaining. The result is
the production of an arbitrary sequence of control signals that im-
plement the machine order.
* Early advocates of this scheme proposed that the stored microprogram
be alterable, but almost all realizations have employed permanent-
storage memories. The present discussion, of course, assumes variable
storage. An interesting scheme combining fixed and variable stores
has been proposed by Grasselli. 112
243
Several design approaches may be followed to provide for
branching within the sequence. One known scheme is to add special logic
to the access switch, so that when the address specified by a branching
instruction is applied to the access switch, the memory line selected
depends upon the state of some external logical variables.
The following scheme (which is believed to be original)
does not require any augmentation of the access switch.
Let the address be the base-2 number specified by the
m-bit vector (X, Y) = (XO, X1 ... Xa_1, YO' ... Yb_l), with x digits
having lower significance; and in each word, in addition to the X and
Y segments, let there be a bit B, which, if true, signifies that the
word is a branch point; then,
(1) in a nonbranch word, the X digits are
interpreted as the least significant
digits of the address of the successor
word, while
(2) in a branch word, the X field is interpre-
ted as a mask upon the external control
variables, such that if xi = 0, the i th
address digit is 0, while if xl = l, the
ith address digit is the i th control
variable, say zi.
For example,
let (X, Y) = (0101, 10110),
for B = 0, next address = (0101, i0110), while
for B = i, next address = (0z 2 0z4, loll0).
By this scheme, up to 2a-way branching is possible at a
given step; however, if two branches refer to the same successor, at
least one will have to pass by way of an intermediate nonbranching step,
which will have a full range of addressing. A limitation of the scheme
is that as the number a of external variables increases, the number 2m-a
of branch points decreases. One way of extending the number of usable
external variables would be to decode the X variables so as to select
one of 2a external variables as a single binary condition. This method,
244
with some refinements, is described by Kampe. 144 A general structure
covering these variations is illustrated in Figure III-C-17. Data paths
for modifying storage are not shown.
It is clear that various schemes may be devised that do
not require use of a special access switch; this means that it is
possible to use main memory as a backup for a microprogram store, in
the event that part of the microprogram store is lost because of a
permanent failure.
Another attractive way of accommodating faults in the
store is to use an associative memory. Such a memory provides for
relocation of words within memory without change of address code; but
of course the given fault must be localized to a few words in its
effect in order for relocation to be useful.
Finally, it may be observed that all the methods of
error control for memories, such as error-correcting codes, may be
employed to increase the reliability of the control unit.
One of the main limitations of this approach is that
large numbers of branch conditions, and complex branch conditions, are
not handled with great flexibility. Further development of the approach
should seek to increase this flexibility.
ACCESS STORAGE
INITIAL
ADDRESS A
BRANCH Z
CONDITIONS
FUNCTION
UNIT
TA-55flO-48
FIG. III-C-17 A MICROPROGRAM CONTROL UNIT
245
_) Control Using a Programmable Cellular Network
The use of a network of logic elements clearly provides
more complexity of logical operation than does a similar number of
elements in a memory structure. Recent investigations (by S. Wahlstrom
and others) at Stanford Research Institute have indicated the practical
feasibility of building logic networks with substantial variability in
function. These investigations are part of a general study of cellular
networks, i.e., networks of logic modules, having a uniform, primarily
neighborly interconnection structure 21°'21.'212'285 the approach con-
sidered provides for storage of information within the cells of such a
logic array. This information would specify both the logical function
performed by the cells, and the choice of particular connections of a
cell to its neighbors and to signal buses, from among the available
connections.
An example of such an array is illustrated in Fig. III-C-18.
The light lines indicate the signal paths available at each cell, the
heavy lines indicate the particular paths that are active in the illustrated
design, and the dashed lines indicate the paths used to program the array;
the x variables are the inputs, the F variables are the outputs, and the
letters a, b, ..., i represent the logic functions realized by the
X X 6 X. X 8
x2
X3 i
x,
m
-_- PROGRAMMING
DATA
I
m
F I F2 Ta_558o-64
FIG. III-C-18 A PROGRAMMABLE CELLULAR LOGIC NETWORK
246
individual cells. In the arrays studied thus far, typical functions are:
a set of combinational functions of two or three variables; a single
universal function (e.g. NOR) of six or seven variables; or a single
flip-flop.
A more detailed view of a cell is given in Fig. III-C-19.
The information stored in box f controls the functions performed by a
multifunctional logic network N, and the information stored in boxes
t, u, v and w controls the connections of the cell terminals to the
inputs and outputs of the network N. Signal paths for the introduction
of program data are indicated by dashed lines. Clearly, program data
can be designed to compensate for faulty cells.
The number of storage elements needed for control of a
cell is substantial, but with the advent of microelectronic arrays of
high component density, the cost of programmability may not be prohibi-
tive. The most natural form of storage would be flip-flops, which would
allow the use of the same technology as the controlled circuits. This
has the possible disadvantage of volatility of information with loss of
power, but it would seem to be quite a straightforward matter to record
the program state of a network in the main nonvolatile system memory.
This example is meant only to illustrate the basic ideas,
since there are many possible variations in cell functions, in inter-
connection structures, and in the means of introducing program data.
The general design problem, of course, is to develop arrays that have
LOGIC
VARIABLES
PROGRAMMING
DATA
FUNCTION
CONTROL MULTI FUNCTIONAL
NETWORK N
., TA-5580--65
FIG. III-C.19 DETAILS OF A PROGRAMMABLE CELL
247
a good combination of flexibility and economy. Additional design
problems of special relevance to reliability are
(I) The design of arrays that are easy to test
and diagnose
(2) The design of arrays that provide for avoidance
of faulty cells and connections with minimum
sacrifice of nonfaulty elements.
It is to be expected that cellular networks are, in
general, easier to diagnose and reconfigure than noncellular networks.
3) Control Based upon a Network of "Universal" Logic
Modules
Advances in microelectronics have resulted in an increase
in the potential complexity of prefabricated networks. Since the number
of possible switching functions grows exponentially with the number of
variables, a serious problem of standardization of fabrication arises
because of the large number of different networks that are needed to
realize arbitrary functions.
It is well known that a given m-input combinational net-
work may be used to realize a number of functions of the m variables,
by permuting and complementing the input variables at the terminals.
If it is permitted to tie terminals together or to apply constants
arbitrarily, the number of independent input variables, say n, is of
course less than m; but the fraction of the total number of possible
switching functions of the n variables that may be so realized is poten-
tially greater than the number that may be realized by permutation and
complementation at an n-input network. It is not at all obvious which
networks offer the greatest flexibility for such realizations. In
particular, it would be very useful to have a network which provides all
such functions for some appreciable number of variables. Kautz has
suggested the problem of finding universal logic modules (UfO's). These
are defined as follows:
Consider a (combinational) logic net with m in-
put terminals and two (complementary) output
terminals. The m input terminals may be connected
freely to any of 2 (n + I) source wires, carrying
248
Dthe variables, Xl, Xl, -.., Xn, Xn and the con-
stant signals zero (0) and one (i), respectively.
If under arbitrary connections of this sort the
output terminals produce all n-variable Boolean
functions (and their complements), f, f, then
we refer to the net as an (m,n) ULM. The princi-
pal questions associated with such nets are:
(i) Find ULM's with m = minimum for n up to say,
4 or 5.
(2) Determine the dependence on n of the minimum
m = M(n).
(3) Alternatively, find good estimates (upper
and lower bounds) on M(n).
This problem is currently under investigation by Kautz,
Elspas, and Stone at Stanford Research Institute under Institute sponsor-
ship. The function F3(a , b, c) = a Gbc is readily seen to be universal
for two variables. Investigations have disclosed that all functions
of three variables may be obtained from the function of five variables
!
F 5 (a, b, c, d, e) = e F 4 (a, b, c, d,) + e (abc),
where F 4 is the Harvard function of index 87. F 4 may be represented by
the set of min-term indexes (0, i, 2, 5, Ii, 12), or by the expressions
= + C + dca = a (dcb' + d'c'b).F 4 b'a' (d' ') '
Minimal functions for more than three independent variables
are not known, but upper and lower bounds have been investigated. The
following limits are known for M(n), the minimum m for an (m,n) ULM
for the next few values of n:
6 < M(4)< s,
m
i0 < M(5) < 18, and
17 < M(6) < 37.
If the value of M(n) proved to be excessive for the n of
interest in a particular design, it would also be useful to have a small
set of modules that together would provide coverage of all switching
functions of n variables, or even of a large subset of all the functions.
249
For the realization of a reconfigurable control system
for a computer, it would be necessary to provide controllable means
for permuting, complementing, and busing the inputs to a ULM. A
programmable universal logic module (PULM) could then be composed, as
illustrated in Fig. III-C-20 of an (m, n) U.L.M. (labeled U), fed by
an m-input, n-output connection network (C), which can be programmed
by an external input. The connection network itself must be rather
complex, and it is not inconceivable that using a larger value of m
than M(n) may result in the more economical overall design.
For the large number of variables found in a modern
control unit, it is clear that a number of PULM's would have to be
combined in some larger network. For maximum reconfigurability, the
interconnections within that network should have some degree of
programmability, both for modification of the control functions and
for the replacement of faulty PULM's. Such a control system is sketched
in Fig. III-C-21.
Several completely open questions pertaining to the
design of such a system, in addition to the problem of ULM minimization
already mentioned, are as follows.
(i) The design of the internal connection
network C
(2) The design of the external connection
network N to achieve a useful degree
of flexibility
(3) Suitable means for incorporating memory
within the overall structure.
The present discussion has been concerned with combina-
tional logic networks. The notion of universality may also be applied to
sequential networks, and, practically, it would also be desirable to
have modules with flexible, even if less than universal, state behavior.
This topic is currently being studied by a number of investigators. 229'2s5
The design of simple multipurpose combinational-sequential modules has
also been discussed by Ledley. 177
250
PROGRAM
m_ _ u_l _ OUTPUT
INDEPENDENT XI C U F FUNCTION
VARIABLES CONNECTION UNIVERSAL F'
X NETWORK J'--_'tLOGIC MODULEI I
RU.L.M. 'rA-s_ao-2a
FIG. Ill-C-20 LOGIC MODULE BASED ON A PROGRAMMABLE
UNIVERSAL LOGIC MODULE
I ...
!tt t tit I
INTERCONNECTION NETWORK N J
t't __
SYSTEM SYSTEM
STATUS INPUTS CONTROL SIGNALS
TA-5560-47
FIG. III-C-21 RECONFIGURABLE CONTROL UNIT BASED UPON
PROGRAMMABLE UNIVERSAL LOGIC MODULES
251
IV CONCLUSIONS AND RECOMMENDATIONS FOR FUTURE STUDY
A. Conclusions
In discussing the conclusions of the report it is convenient to assess
our results on the basis of the goals of the study set forth in See. I-A,
which can be summarized as follows:
(i) To examine the known techniques for reliability improve-
ment to determine their adequacy for the achievement of
long mission life.
(2) To conceive and evaluate new schemes of system design
and operation that offer promise of advancing the state
of the art.
(3) To recommend future directions of research which will
aid in the improvement of present techniques and also in
the development and realization of new schemes.
Each of these three areas received significant attention during the
course of the study as reviewed below:
(i) In the examination of known techniques, the approach
taken was to distinguish those sections of a hypotheti-
cal spaceborne computer to which distinct problems of
reliability apply, and to assess the utility of existing
reliability enhancement techniques in the light of space-
borne requirements and of advances in device technology.
Most of the logical design concepts previously examined
have applied to static error control--in particular, to
fault-masking techniques. Several significant analytical
problems have not been satisfactorily solved; these
problems involve the optimum application of the techniques,
and the estimation of the reliability improvement realized
by the techniques. However, the results which have been
obtained indicate that the exclusive application of present
static error-control techniques cannot lead to designs
which achieve the required mission life under the severe
constraints of the spaceborne environment. We conclude,
however, that fault masking techniques are very useful
for the protection of limited crucial functions, and that
existing known error-detection schemes are useful for
diagnosis, e.g., as illustrated by the use of error-
checking codes for arithmetic processors or for sequential
circuits.
253
(2) In the study, a number of error-control processes were
distinguished that, if implemented reliably, are capable
of realizing substantially higher reliability than can be
achieved by fault masking alone. These schemes relate to
dynamic error control, in which a computer is subject to
reconfiguration in structure and program. This mode has
been discussed in recent literature, but the study has found
a substantial lack of design knowledge appropriate to the
practical realization of a computer exhibiting a capability
of automatic diagnosis and reconfiguration. Indeed of
primary concern here is a new viewpoint on the overall
design of a computer system, including the design of its
stracture, and the coordination of the various maintenance
and computational processes.
In summary, it is suggested that in order to achieve the highest
levels of reliable performance, an advanced spaceborne computer will
need to have the following structural features to a high degree:
parallelism of logical operation, modularity and programmability of
functional modules, regularity and programmability of interconnection,
and autonomous capability for fault diagnosis and the control of re-
configuration. A number of error-control techniques, both fixed-
structure and variable-structure, will be needed to enhance the reliability
of basic functional units. These needs are summarized in the next section.
In addition to the general conclusions stated above, many conclusions
have been derived concerning particular aspects of the analysis and design
of various reliability schemes. The reader is referred to the individual
sections for detailed discussions of these conclusions.
B. Summary of Needs for Technique Development
We have distinguished the following major needs for further develop-
ment of reliability techniques:
(i) Fault-masking techniques represent the area of technical
interest that has received the most attention previously,
ranging from the protection of simple contact networks to
the construction of very complex adaptive networks that can
tolerate a wide variety of internal failures. However, it
still remains difficult to actually calculate the probability
of failure of any but the simplest network structures. Thus
new techniques are needed to facilitate the analysls of complex
fault-masked networks. An example of the more general analytic
techniques that are needed is the consideration in Sec. II-A-2
254
(2)
(3)
(4)
of comparative advantages between tree-like compositions of
multiple-output switching functions and compositions that
minimize the number of outputs that are affected by a given
element failure. There is a great need for further investi-
gations of this type for different network structures and
for different probabilities of failure amongst the elements.
Most failure analyses assume independence among different
faults. This assumption is made in order to yield an
analytically tractable model, but it is usually unrealistic
from a practical point of view.
An immediate consequence of this last point concerning the
independence of failures is the need to understand the design
of systems wherein it i_s true that faults are correlated only
over a relatively small and easily definable range of elements.
For example, in a modular system, it is desirable to constrain
fault propagation to within the module that first suffers a
fault. If this can be done, then the assumption of fault
independence between modules is tenable. One implication of
this desire to minimize fault propagation between modules is
that such systems will probably minimize communication between
modules; i.e., the modules will be sufficiently complex that
much useful computation can take place entirely within each
one, with communication between them limited as much as
possible to summary-type information that is highly protected.
This requirement on intermodule independence must also involve
the environmental facilities such as power supplies, radiation-
protective devices, and the like. Thus a power-supply failure
that results from a fault in a given module cannot be allowed
to disrupt all the other modules as well. Much work is thus
needed in extending redundancy techniques to the protection
of these peripheral factors.
Another consequence of the desire for modularity and recon-
figurability is the need for the development of the family
(or families) of modules themselves that simultaneously meet
the various system requirements imposed upon them, including:
(a) Compatibility with the dimensions and constraints
in logical and topological structure characteris-
tics of modern semiconductor technology, especially
that of large-scale integrated networks.
(b) Sufficient complexity to allow a considerable
amount of calculation to take place entirely
within the confines of a given module.
(c) Sufficient flexibility for the same module to
be usable for several different computational
tasks depending on its particular assignment
or reassignment within the system, and for the
255
family of modules to be complete--i.e., so that
all tasks can be accomplished by the set.
(d) Sufficient accessibility for the modules to be
easily diagnosed on the advent of trouble, and
easily switched from one point in the system to
another.
(e) Suitable scaling of complexity so that each
replaceable module is a small enough fraction
of the overall system for the redundancy ratio
required to be minimized.
As seen in Sec. III-C-3, for the case of regularly structured
functions (e.g., arithmetic or memory) such modularity is not
hard to achieve. For more complex, irregular functions
(e.g., control) several new approaches maybe seen, but more
work is needed to determine the best approach and to develop
practical design.
The problem of diagnosing modules only emphasizes the more
general fault-diagnosis situation. The formal fault-diagnosis
requirements are understood (and have been reported in this
document), but we are still a long way from understanding
approaches that are practical, particularly for systems
as large as the modules of a reconfigurable system will
probably be. As well as more efficient techniques for large
combinational networks, we need more work in serial testing
and in the design of efficient diagnosers themselves.
Finally, the problem of the design of networks that are
intended to be easily diagnosable has only recently been
introduced, either in this report or in the general
literature. Much further work in the utilization of
auxiliary terminals for monitoring purposes as well as for
test inputs is needed. Also the diagnosis of sequential
machines, in general, remains almost completely an "art
form" for any but the simplest machines.
A great deal of work is needed in the programming,
whether by software or hardware means, of systems that
can autonomously (and efficiently) change their program
mix in response to external problem demands, as well as
in response to gradual failures within the modules. Such
approaches should include the accommodationof problems
of highest priority first--resulting in sometasks simply
not being handled--as well as arrangements that involve a
selective degradation of the problems that are handled--
whether involving a decrease in accuracy, or an increase
in solution time.
256
(8)
(9)
(10)
(11)
The role of the new generation of integrated circuits must
be more fully evaluated, both in terms of the realization
of specific redundancy techniques that are appropriate for
them, and in the opening up of new possibilities they
imply because of the changing cost factors involved--e.g.,
the diminution of the total number of components as a
prime contributor to expense. Again, with regard to
modularity, it is important that these new components be
designed hand-in-hand with the design of modules that meet
the previously summarized requirements.
The need for better means of calculating the reliability
of static redundant systems has been mentioned, and the
need also exists, and is probably compounded in the case
of dynamic systems. Work is thus required on analyzing
the reliability of systems under different rules of re-
configuration, and under different reliability assumptions
concerning the switching system itself by means of which the
dynamic reconfiguration is actually determined and carried
out.
The design of reliable and efficient interconnection
switching systems for the reconfigurable spaceborne com-
puter remains an unsolved problem area. Some specific
designs were discussed in Sec. III-C, but much work remains
in the achievement of the goal of flexibility of inter-
connection in a design which is itself failure-tolerant,
and also in the design of control modules by which the
decisions are made.
All the various techniques that have been mentioned for
module design and error-control procedures, both in equip-
ment and in programming realizations, must be coordinated
in practical system designs. A number of approaches have
been suggested in the literature which differ in the extent
of the control of diagnosis and reconfiguration that is
invested in special equipment and in the balance of external
and self diagnosis for subsystems. New approaches need to
be considered and evaluated.
A number of special problems are found in the design of
memory systems. Several well-established schemes exist
for data-channel protection, and a number of potentially
useful schemes have been suggested for access-switch pro-
tection, but because of the close interaction of physical
and logical design in memory systems, there is a need to
test various schemes by carrying out complete designs.
257
(12)
(13)
A number of possible applications have been noted for
the use of magnetic-logic networks, in which the high
reliability and the lack of volatility of information
with loss of power of this technique may be exploited
without intolerable reduction in system speed. There
is a need for detailed analytical and experimental
work to verify the validity of this preliminary view.
Finally, the spaceborne system will in general be serviced
by some sort of radio link with either the ground station,
or with a "mother ship" control station. However_ the role
that such a link can play is widely variable, depending upon
the distances involved, the time available, and the specific
problem mix as a function of time into the mission. Much
further analytical work needs to be done in determining first
the range of mission characteristics, and then the relative
roles to be played by the ground and spaceborne stations
with regard to diagnosis_ backup computation, control9 re-
configuration specification, idle-time discourse, and storage
of programs.
C. Summary of Suggested Problems for Future Research
Detailed suggestions for future research are presented in the various
sections of this report. In this section we give brief summaries of these
suggestions, listed by sections and appendices.
(1) Consider variations on coding and adaptive logic schemes
to include redundant outputs and integrated restoration
(Sec. II-A-I).
(2) Develop improved computer-aided techniques for analysis
of complex restored nets and methods for globally opti-
mizing placement of restorers; extension of model to larger
classes of fault types (Sec. II-A-2-a).
(3) Develop more economical hybrid fault-masking switchover
realizations, and provide for noise insensitivity; extend to
to multiple-output networks; and incorporate fault masking
in the switching networks (Sec. ll-A-2-b).
(4) Develop efficient realizations for high-order threshold-
function networks using NOR elements (Sec. II-A-2-c).
Develop techniques for applying the parity-check and state-
weight types of redundant-state encoding for error detection
in a range of useful sequential networks (Sec. II-A-3).
(6) Develop codes and encoders that allow efficient, fault-masked
instrumentation of transmission-type error-control codes
(Sec. II-_l).
258
(7)
(8)
(9)
(10)
(11
(12
(13
(14)
(15)
(16)
Develop more easily instrumented arithmetic codes for error
detection and location, investigate possible improvements in
residue coding, and compare alternative arithmetic checking
schemes by detailed designs (Sec. II-B-2).
Develop a framework for the design of maintenance programs
that are well coordinated with hardware-maintenance processes
(Sec. III-A-2).
Develop and evaluate new schemes of system organization for
maintenance and general computations, especially those suited
to polymorphic (multi-processor) structures; develop tech-
niques for coordinating the flow of various data and control
information; specify subsystems so as to achieve high
modularity (Sec. III-A-3).
Develop improved techniques for fault diagnosis of large
multiple-output combinational networks, and of important
types of sequential networks; determine good means for
utilizing test points; develop techniques for including
ease of diagnosis in the original design of a network
(Sec. III-B).
Develop efficient means for control of commutation networks
and for avoidance or masking of faults within the network;
develop and evaluate practical path-seeking cellular inter-
connection arrays; extend present investigations to multi-
position switching (Sec. III-C-2).
Develop and evaluate more powerful logic modules; incorporate
aids for fault diagnosis; incorporate fault masking for
crucial functions (Sec. III-C-3).
Develop and evaluate new schemes for realizing programmable
control units, especially to incorporate complex functions
of a large number of variables; develop schemes for micro-
program control to incorporate memory backup and branching
(Sec. III-C-4).
Develop and evaluate schemes for logical error control of
memory access-switch failures; investigate error-control
needs of special types of memory systems (e.g., associative,
fixed); study the interaction of logical error-control tech-
niques and physical design techniques (Appendix A).
Develop practical designs for distributed power-supply system;
investigate feasibility of magnetic switching (Appendix B).
Carry out detailed analytical and experimental evaluation of
proposed all-magnetic logic network schemes (Appendix C).
259
- , PAGE g.LANK NOT FI:2#,E_.
p_ECEDIEIG
Appendix A
ERROR-CONTROL TECHNIQUES FOR MEMORY SYSTEMS
261
Appendix A
ERROR-CONTROLTECHNIQUESFORMEMORYSYSTEMS
i. Introduction
The primary effort of this task is to assess the manner of adding
redundancy to the main memory subsystem of a spaceborne digital com-
puter and the potential gain in doing so, and to distinguish those
areas where more work is needed. Special memory types, such as per-
manent memories and associative memories are not included in this
review. The techniques described are generally applicable to these
memory systems, but special techniques may be advantageous for those
types.
Part 2 of this Appendix presents the model and the assumptions
made in the several analyses. Part 3 examines a redundancy scheme based
upon replication of whole memory modules. Part 4 examines several
schemes for error control applied to the bit-channel, word-select, and
supply and control sections of a memory module. Part 5 describes the
logical design of a parallel encoder-decoder and Part 6 presents the
conclusions and recommendations of this study.
This study has emphasized the present state of the art of memory-
error control; thus the quantitative estimates for the benefits and
costs of the particular schemes described are based upon the use of
off-the-shelf components. The main effect of expected future reductions
in the size of logic components will be to increase the feasibility of
schemes involving complex operations on data and address information.
2. General Discussion of the Problem
The primary function of the main memory subsystem is to accept data,
address and control information, store that data in a specified location
(word register) for an indefinite time, and return it error free upon
demand to the parent system. This basically simple function establishes
263
a requirement for four separate functional units within the system. The
data section receives, stores and delivers information, one word at a
time. The access section controls the selection of the word location
being processed. The cycle control section encompasses the sequential
control of the signal sources that accomplish the reading and writing pro-
cesses. The support section supplies operating power and thermal control.
The information and control signals for these functions appear on
a number of busses. The data bus carries the data bits (of which it is
assumed there are b) in a data word to (and from) the main memory sub-
system (MMS). The address bus carries the address information during
either the store or the fetch cycle to specify the address or location
of the word register desired. The cycle control leads provide the
timing and control signals for the individual steps of the store and
fetch cycles. During the store cycle a valid address must be on the
address bus and valid data on the data bus. The specified word register
is then cleared to all zeros and the data word is copied into that
register. The fetch cycle requires only an address. The contents of
the specified word register are copied out onto the data bus and then
rewritten into the word register without change.
The power supply is required to produce a minimum of three forms
of power: one for write drivers, another for read drivers, and at least
one other form for logic circuits. The term environmental control is
used to refer to any necessary sensing and compensation for temperature-
sensitive elements.
The overwhelming majority of the physical storage techniques for
the data sections of present-day main working memories involve some form
of magnetic storage elements. All of these techniques have in common a
strong, complex intermix of circuits with both logic-level binary signals
and analog signals. Furthermore, the analog signals exist in close
proximity both at the driver power level and at low (near-noise) levels.
Much of the art of memory design is in where and how these circuits are
mixed. The wiring itself is an art and must take into account the
264
electrical and mechanical properties of the wire, as well as the winding
patterns. The circuit problems are difficult and many. The success
of the memory depends on attention to circuit and mechanical design
details and the behavior of the parameters of magnetic material. In
this Appendix, we consider the data section of the main memory subsystem
as a single, asynchronous entity with error-control capability independent
of the rest of the computer system. Integration of internal and external
error-control schemes is an important problem for future study.
We wish to discuss here the techniques of redundancy for error
protection separately from techniques of good design. To do this we
need a simple functional model of the MMS. The model we will use is
based on the connections to the parent system as shown in Fig. A-I.
ADDRESS
BUS
CYCLE
CONTROL
POWER AND
ENVIRONMENT
DATA INPUT BUS
MAIN MEMORY
SUBSYSTEM STATE OF CYCLE
CONTROL
DATA OUTPUT BUS
TA-5580-30
FIG. A-I FUNCTIONAL CONNECTIONS BETWEEN MAIN MEMORY SUBSYSTEM
AND COMPUTER SYSTEM
The data bus will be gated to a single data register; thus each bit of
any one data word going to and from the main memory system will pass
through this data register. All of the equipment for a single bit
265
is called a bit channel. Hence a bit channel includes the gating for a
single bit from bus to register, the gating from register to digit driver,
the digit driver itself, the array of storage cells for that bit, the
digit sense amplifier, the gating back to the register, and the gating
from register to output bus.
The model includes an address register, gated to the address bus,
word drivers for both reading and writing, and address decoders for the
selection of individual word registers. All of the necessary parts of
this portion of the MMS which are required to excite or select a single
word register and make it receptive to digit drive or make it excite a
digit sense amplifier are called an access circuit. The number of
leads threading an excited or selected word register from the access
circuitry is generally what determines the so-called dimensionality of
a memory. The model is receptive to only two types of cycle control--
store (clear/write) and fetch (read/restore). It is assumed to operate
completely asynehronously, i.e., once either cycle is started it will
go to completion in a time referred to as the cycle time, and return
the subsystem to a state of readiness to initiate either type of cycle.
The intersection of a bit channel and an access circuit is referred
to as the bit cell or storage element. The present model is sufficiently
general to represent the wide variety of storage elements available for
spaceborne computers. These include single- and multiple-aperture
ferrite cores, orthogonal-aperture ferrite cores (biax), thin planar
magnetic films (bieore) and thick or thin circumferential films (plated
wire). Monolithic-semiconductor memories do not require the conversion
from logic level to drive level nor the conversion from sense level to
logic level, since they do not exhibit the very large attenuation between
drive and sense analog levels in the bit cell.
There are three primary classes of redundancy techniques which will
be considered when comparing models. They are (i) circuit redundancy,
achieved by using series and parallel connections of components,
(2) logical fault masking, and (3) dynamic error control, achieved by
sequential fault detection and active switchover to a nonfaulty unit.
266
It should be noted that circuit redundancy may be applied, practically,
in only a limited number of places due to the reduction in circuit margins
which usually results. Hence, the emphasis in this section is on logical
fault masking and dynamic error control.
There are a number of criteria which could be used for the comparison
among various proposed memory organizations. For spaceborne computers,
the most important of these are (i) power consumption, (2) weight, and
(3) the system success probability as a function of failure of a functional
part within the subsystem.
To develop these criteria for comparisons of different organizational
schemes, we will assume four memory sizes: (i) 256 words X 8 bits/word;
(2) 256 words × 24 bits/word; (3) 4096 words X 8 bits/word and
(4) 4096 words X 24 bits/word. The storage elements are assumed to be
0.03 X 0.018 inch single-aperture ferrite cores. The access is assumed
to be of the coincident-current type, with both ends of the drive lines
available for the extensively used, so-called "switch-sink" type of word-
register selection. This selection scheme requires 4 wires per core.
Data is transferred in and out in parallel, and a cycle time of 1 micro-
second is assumed.
Under these assumptions for w words in the memory, with b bits per
word, there will be (b + w) register cells for data and address registers,
(4b + 2w) transfer gates to move information into and out of the registers
(in both directions in the case of data registers), b digit drivers,
b sense amplifiers, b X w cores, and finally not less than 4(w) 1/4
switch-sink line drivers. These assumptions establish numbers of
primary components.
On the basis of an examination of currently used components, we
have assumed a 30 X 18 mil ferrite core and i inch of _30 copper wire/
core--12 milligrams, and 40 milliwatts per _s/cycle. For these values,
the weight of the copper overshadows that of the ferrite. The weight
of the frames and other packaging components is assumed to rise
267
approximately in proportion to the numberof cores. If each core in the
word switches every cycle and the store is cycled at the maximumrate,
b × 40 mWwill be dissipated in the store. (This, of course, does not
account for the power dissipated in the half-selected cores).
For register cells (RC) and transfer gates(TG) we have assumedthe
Fairchild Semiconductor DT_line of integrated circuits and a popular
5-flatpack "mother board." Sense amplifiers (SA) and driver switches (DS)
are assumed to be gated, and operate with a duty factor of 0.5. Further-
more, no more than b + 4 drivers can be active simultaneously, i.e.,
b digit (inhibit drivers) and 1 switch and 1 sink each for X and Y word
selection.
These assumptions yield:
Register cell
Transfer gate
Sense amplifier
(with threshold R)
Driver/switch
(with terminating R)
2 grams
1 gram
4 grams
4 grams
80 milliwatts
4 milliwatts
360 milliwatts × 0.5
= 180 mW
800 milliwatts X 0.5
= 400 mW
These assumptions yield the following relative power and weight
figures for the basic irredundant memory sizes which are to be used as
standards for comparison.
268
Words X
Bit/Word
256 X 8
Total
256 X 24
Total
4,096 X 8
Total
4,096 X 24
Total
Part
16 RC
48 TG
8 SA
24 DS
32 RC
112 TG
24 SA
40 DS
20 RC
56 TG
8 SA
72 DS
36 RC
120 TG
24 SA
88 DS
Weight
(gr)
32
48
32
96
Power
(mW)
1,280
172
1,440
4,800
Cores
2,048
Weight
(gr)
24.5
Power
(mW)
320
240 7,690 24.5 320
2,560
448
4,320
11,200
18,448
1,600
224
1,440
4,800
8,064
2,880
480
4,320
11,200
18,880
64
112
96
160
432
6,144
32,768
93,304
40
56
32
228
416
74
74
393
393
1,180
1,180
72
120
96
342
630
960
960
320
320
960
960
269
3. Error Protection by Replication of Whole Memories
a. Triplication with Voting
One of the conceptually simplest means of protecting against any
type of error in the MMS is defamiliar scheme of replication and voting.
In this scheme input data to the MMS from the control leads and the
address and data busses is sent simultaneously to three independent,
irredundant copies of the MMS. During the fetch cycle, the selection
of a particular word register is identical in all three modules. The
output from the individual bit channels is combined with a three-input
voter (assuming triplication) before being placed on the output bus.
Thus, an individual corresponding bit cell in each data register would
be gated to the input of a single voter whose output would drive one
bit position on the output data bus.
The advantages of this scheme are numerous. First, no new design,
redesign, or alteration to the design of an existing irredundant MMS
is required. The only equipment needed in addition to the two full
replicas of the MMS is a three-input voter for each bit position in a
data word. This redundancy scheme, called "TV" for triplication with
voting, would protect against failures in any one module regardless of
whether that failure occurred in the bit-channel hardware, the access
circuits, the cycle control, or the power supply. The cost of such
potential gain is quite high; that is, the redundancy ratio is 3, plus a
small fraction for the voting circuits. Alternatively, for a given size
and weight of memory, the capacity is cut down by approximately a factor
of three. It may be noted that the same redundancy ratio is applied to
all components. Since the components differ widely in their reliability,
the relative reliability, improvement for the various components is quite
uneven in this scheme.
In order to be able to compare different schemes, we assume a single-
aperture, coincident-current ferrite-core design, with a destructive-
readout fetch cycle. The bits of the data words (or characters) are
entered and retrieved in parallel. In accordance with common practice,
270
it is assumed that both ends of the word selection drive lines are
available; hence a switch-sink (or driver-switch) scheme is applicable.
It is also assumed that address decoding is done in a diode decoding
tree before the conversion from logic level to power level is made.
In order to assess the potential gain in performance, we let X be
the probability that one bit is in error at the output of a set of bit
channels for a single MMS module.
The probability, Q, that the system is faulty is:
QTV = 1 - [(1 - X) 3 + 3(1 - X) 2 X] b, for the redundant system
QNR = 1 - (i - X) b, for the non-redundant system.
For the 8-bit or character-access case, neglecting higher orders
of X, this reduces to
QTV = 24X2' versus QNR = 8X.
For the word-access case, we have
QTV = 72X2 versus QNR = 24X.
In each case, QTv/QNR = 3X. Thus, the effectiveness of the redundancy
increases with channel reliability.
These probabilities hold regardless of the word capacity of the
memory, although X will increase with increased memory size.
If we assume that a "voter" is approximately the equivalent of a
register cell in power and weight, we can compute the relative costs as
follows.
271
Size
3(256 X 8)
Total
3(256 X 24)
Part
56 RC
144 TG
24 SA
72 DS
128 RC
336 TG
72 SA
168 DS
We igh t
(gr)
112
144
96
288
640
256
336
288
672
Power
(mW)
4,480
576
4,320
14,400
23,776
10,240
1,344
12,960
34,600
Cores
6,144
18,432
Weight
(gr)
74
74
221
Power
(roW)
960
960
2,880
Total 1,552 59,144 222 2,880
5,440
576
4,320
14,400
24,736
10,560
1,440
12,960
34,600
3(4K X 8)
Total
136
144
96
48O
93,304
294,912
904
264
360
288
1,056
1,968
68 RC
144 TG
24 SA
120 DS
I, 180
1,180
3,539
3,539
132 RC
360 TG
72 SA
264 DS
3(4K X 24)
Total 59,560
960
960
2,880
2,880
272
It is noted that for a large memory (say 4000 words and 24 bits/word)
the cores account for approximately 64 percent of the total system weight,
and the inhibit drivers account for a significant portion of the total
power. Since both the core weight and the inhibit drive power increase
proportionately with the replication order, it is of interest to consider
redundancy schemes which do not require constant replication of all com-
ponents. One example of such a technique involves the use of error-
correcting codes, as described in Sec. 4-a of this appendix. Various
other techniques can be conceived which involve augmenting the storage
section with additional data channels (comprising less than 100-percent
redundancy) and also with additional bits/word for which simple parity
checks are applied; the access circuitry can be protected by triplica-
tion and voting. Such a scheme is particularly attractive for memories
in that the cores probably comprise the most reliable section of the
memory and hence require the least protection.
b. Duplication with Parity Checkin G
The word capacity of the main memory is frequently the limiting
feature in the capabilities of a spaceborne computer system. This makes
it highly desirable to seek techniques which provide suitable measures
of protection and permit the realization of the maximum storage capacity
within given constraints of power and weight.
A considerable measure of protection can be afforded by merely
duplicating (rather than triplicating) the MMS, provided some means is
available for identifying and selecting the valid output. Such a
"duplex" scheme was proposed by Kemp 157 and has been designed and used
on the Saturn V computer, sl In order to provide for a validity check,
an additional bit channel is added to each block of the MMS to permit
parity checking on the output. Inputs are fed to both modules of MMS
simultaneously. During the fetch cycle, the output word is checked for
parity and in normal operation only one output is connected to the output
data bus. This data register is used in conjunction with the one module
so long as parity is correct. With each restore half-cycle, the data
273
from this correct data register is used to restore the information read
from the selected module into both duplex modules. On detection of a
parity error, operation is transferred entirely to the other module.
This scheme, called DP for duplex-parity, offers good protection
against all single-bit failures that are detectable by a parity check.
This may be an error in a bit channel or an error in the access circuits
which partially stimulates multiple locations in turn yielding a number
of improper bit channel outputs. By using odd parity, even the failure
of the cycle control circuits to give any output would be protected
against. Errors which cannot be detected by a single parity checker,
cannot, of course, be protected against.
Another important feature of the scheme is that good protection is
afforded against failures in access circuitry that affect a small number
of words. In an extreme case, up to w single-word faults could be
tolerated between the two sections.
The probability of system success, P, for the duplex-parity checker
(DP) arrangement can also be found as a function of the probability of
failure in a single bit channel, S.
P = 1 - Pr(system failed)
= 1 - Pr 2 (1 module failed)
The probability of one module failing is 1 - P (all bit channels in
a module are good) or 1 - (l-X) b. Hence the probability of system
success is
P=l- [i- (i- x)b] 2
P= i- [1- 2(1- x) b+ (1- x) 2hI
P_2(1- x) b- (1- x) 2b
For the character-access case this is
PDP = 1 - 84X 2
and for the word access case
PDP = 1 - 576X 2
274
We can extend this analysis to include errors in access and/or
control by considering the following:
Let
X = probability that a bit channel is in error
S = probability that an access and control group (A-C) is
in error;
then the probability of system success is
P = Pr(all A-C good) + Pr(at least 2 out of 3 bit channels good)
+ Pr(exactly 2 A-C good) + Pr(all bit channels good)
P : (i - S) 3 [3(1 - X) 2 - 2(1 - 3)33 b + 3(1 - S) 2 (i - X) 2b
P _ 1 - 3bX 2 - 6bXS - 6S 2
which reduces to
PDP = 1 - 24X 2 - 48SX - 6S 2 for character access
PDP = 1 - 72X 2 - 144XS - 6S 2 for word access.
While the memory capacity, w, does not appear in these results, it
must be understood that the probability of error in access or control (S)
certainly increases with increasing w. As would be expected, as S
approaches zero, the probability of success, P, approaches the value
indicated in the previous paragraph where S was neglected.
The relative power and weight for the duplex modes can be found by
assuming the exclusive-OR gate for the parity check requires approximately
the same power and weight as the register cell. The 9-input parallel
checker for the character requires ii gates and the 25-input word requires
28 gates. These will, of course, be required on both modules within the
duplex (DP) organization. These assumptions yield the following (the
notation 2[256(8 + i)] indicates a system composed of two 256-word
memories, with eight data bits and one check bit per word).
275
Size
2[256(8 + I)]:
Total
2[256(24 + 1)]
Total
2[4,096(8 + 1)]:
Total
2[4,096(25 + 1)]
Total
Part
54 RC
96 TG
18 SA
50 DSt
: 122 RC
232 TG
50 SA
114DS§
Weight
(gr)
108
96
72
2O0
476
244
232
200
456
1,132
128
112
72
328
640
Power
(roW)
4,320
384
3,240
10,400
18,344
9,760
928
9,000
23,200
42,888
5,120
448
3,240
10,400
19,208
Cores
4,608
12,800
Weight
(gr)
55
55
154
154
885
885
Powe r*
(roW)
720
720
2,000
2,000
720
720
64 RC
112 TG
18 SA
82 DS
130 RC
248 TG
50 SA
178 DS
260
248
200
712
1,420
10,400
992
9,000
23,200
43,592
73,728
204,800 2,457
2,457
2,000
2,000
_b
Although the duplex scheme is not as costly as the triplication
method (and of course not as powerful from the standpoint of
error correction) it still requires strict duplication of all
memory channels--producing a costly increase in power and weight.
The techniques of the following part are more economical, al-
though providing comparable protection.
t A maximum of 2[9 + 4] drlver/switches (DS) can be "on" at any
one time.
t!§ A maximum of 29 drlver/switches can be "on .
276
4. Error Protection by Redundancy Within a Memory
There are many ways of controlling transient or permanent errors
which do not involve replicating the entire MMS. These fall generally
into two rather natural groups: (i) those that increase the number of
bits per word--i.e., redundant bit channels--and (2) those that increase
the number of address locations--i.e., redundant storage registers. Both
approaches will be discussed in this section. In applying redundancy,
precautions should be taken to avoid overloading heavily stressed circuits.
For example, if a redundancy scheme increases the load on the current
drivers, which are usually stressed more heavily than other parts, they
may be expected to be more susceptible to failures, unless separate
access circuits are added to handle the increased load.
a. Redundant Bit Channels
Many faults may occur independently among the bit channels; hence
the use of error-correctin_ codes may provide an efficient method of
fault masking.
One manner in which this might be accomplished is illustrated in
Fig. A-2. Since the incoming data is assumed here to contain no redundant
bits, the generation of the necessary bits to store in the redundant bit
channels must be done within the MMS during the store cycle. During the
fetch cycle these bits are used to correct erroneous bits from the MMS
before they are placed on the output bus. The number of errors per word
that are detectable or correctable by this technique depends on the design
of the particular error-correcting code. We wish to illustrate here the
practical implementation of such codes for several word sizes.
A familiar and very effective family of codes is the family of
single-error-correcting Hamming codes. Circuits to implement this
scheme for the 8-bit character, the 24-bit word and the 24-bit word
broken into S 8-bit bytes are discussed in Sec. 5.
277
F0
BIT
CHANNELS
DATA ,_REGISTER
l-l-1
PARITY
CHECKERS H "_
L2 -._
m
b
OUTPUT
DATA
• • •
BIT CORRECTORS
m
f
I
6
• • •
PAR ITY
GENERATORS
]
- v
INPUT
"_" DATA
BUS BUS
TA-5580"31
FIG. A-2 REDUNDANT BIT-CHANNEL CONNECTION
For an 8-bit byte or character, four redundant bit channels are
required. Besides adding to weight and power dissipation, the individual
drive lines must now drive more cells or cores; hence they must either be
redesigned or suffer some loss in expected reliability. Protection is
afforded against failures in gating, the data register, digit drivers,
storage cells, and sense amplifiers. No protection is afforded against
failure to select the correct word, the encode-decode equipment, or the
cycle-control hardware.
The probability of system success for the single-byte (or single-
character) case can be computed as follows. There must now be 12 bit
channels--8 data and 4 parity. Again X is the probability that a single
278
bit (or bit channel) is in error, P the probability that the system is
good, and Q the probability that the system is not good. Then
P (i X) 12 12X(I X) II
Neglecting high order terms,
Q = 1 - P = 66x 2, for the character, versus QNR 8X.
Since this does not take into account failures in the access sub-
system this holds for both word capacities considered. If we now assume
3 bytes, similarly protected to make up a word, we have
3
P [(i X) 12 12X(i X) II
-- + -- ]
and
Q = 1 - P = 198x 2, versus QNR = 24X.
For both cases,
Q/QNR = 8.25X .
The minimum number of redundant bits to protect a 24-bit word against
single-bit errors is 5. Hence, to find the system success probability for
29 bit channels using single error correction, we have the following.
P = (i - X) 29 + 29X(1 - X) 28
Q = 1- P = 406X 2
Each of these cases neglects higher powers of X.
279
The relative power and weight are only slightly increased over the
irredundant case and are as follows.
For the character case:
Size
256 (8 + 4):
Total
4,o96 (8 + 4):
Total
256 (24 + 5)
Total
4,096 (24 + 5)
Total
Part
20 RC
60 TG
12 SA
28 DS
24 RC
68 TG
12 SA
76 DS
37 RC
127 TG
29 SA
61 DS
: 41 RC
135 rig
29 SA
93 DS
Weight
(gr)
4O
6O
48
112
260
48
68
48
304
Power
(roW)
1,600
240
2,160
6,400
10,400
1,920
272
2,160
6,400
Cores
3,072
49,152
We ight
(gr)
37
Power
(mW)
480
37 480
590 48O
468 i0,752 590 480
74
127
116
244
7,424
118,784
561
2,960
508
5,220
13,200
21,888
3,280
540
5,220
13,200
9O 1,160
90 1,160
1,426
22,140
82
135
116
372
705 1,426
7,160
7,160
Thus we see that the cost of the protection here is significantly less
than the cost of either the triplication scheme or the duplex scheme.
Although the protection afforded by the triplication scheme is somewhat
280
greater than the error-correcting scheme discussed here, if the probability
of an element failure is low then the protection levels are approximately
the same.
For the case where a word is treated as 3 bytes of 8 bits each, the
following costs apply:
Size
256 × 36
Parts
44 RC
148 Tfi
36 SA
68 DS
Weight
(gr)
88
148
144
272
Power
(roW)
3,520
592
6,480
16,000
Cores
9,218
Weight
(gr)
111
Power
(mW)
1,440
Total 652 26,592 Iii 1,440
147,45648 RC
156 TG
36 SA
i00 DS
3,840
624
6,480
16,000
i, 77096
156
144
400
4,096 × 36: 1,440
Total 796 26,944 1,770 1,440
b. Redundant Words
The use of error-correcti_ codes on the bit channels is of no help
in masking faults in the access section. Faults in an access switch
usually invalidate at least one word, and usually a block of words.
Certain faults in the access switch or in the decoding circuits may
invalidate all words.
One straight-forward way of preventing such catastrophic failure is
to subdivide the access equipment into independent parts, each giving
different sets of words, so as to limit the extent of propagation of a
given fault. In use, this requires that the system have the addressing
flexibility to permit transfer of data to different locations in memory.
Such flexibility is well developed in modern computers designed for
281
multiprogramming or for multiprocessing. Its implementation is aided by
use of a special directory table, which specifies the physical location
of memoryaddresses, in groups of addresses knownas blocks, or pages.
This table may be located in an ordinary memory, or perhaps replicated
in several memories, for safety. No discussion of this technique has
been found in the literature, but it would seemto have sufficient merit
to justify further development.
c. Accomodation to Access Faults
A fault in an access system may invalidate one or more words. It
would be desirable to shift the contents of such words to new locations.
One way to accomplish this relocation is to change the address code of
all instructions that call upon that word.
For those words that are directly addressed, the change may be
accomplished by changing the contents of the address field within the
instruction word stored in memory. Determination of those instructions
that address a given word may be accomplished in one of two ways: either
by exhaustively searching all instructions within memory or by waiting
until such references actually occur (at which time the location of the
calling instruction is revealed).
For those words that are accessed by the stepping of a counter or by
some other arithmetic operation, or by a composition of several address
subfields, such changes are impractical. One solution for these kinds
of access is to use an associative memory in which such exceptional
addresses are stored together with the address of the substituted words,
and to drive the memory with address data in parallel with the main access
switch. The output of the associative memory may then be used to sub-
stitute for the nominal, unusable address.
The merit of such schemes is that faults in access equipment may
be accommodated with equipment low redundancy compared to other known
schemes.
282
d. Addition of Access Redundancy
A very interesting scheme of error control for memories has been
suggested by C. A. Allen. This scheme uses redundant bit channels plus
the encoder--decoder for error-correcting codes exactly as in the previous
section. The essential addition is that each bit channel is provided
with its own word-access circuits for address decoding, and with read/
write drivers. In this organization, failures in any one set of access
circuits affect only a single bit channel. The error is then corrected
as the word is transferred to the output data bus during the fetch cycle.
This scheme has obvious disadvantages for magnetic memories. The
access circuits must be replicated b + k times. Each of these circuits
must not only decode the address input but amplify the decoded signal
from logic level to drive level. The replication of amplifiers (drivers)
is not likely to be practical. However, future memories may well be
built from techniques which require only decoding--that is, in which the
access information remains at logic levels. Integrated circuits with a
flip-flop for a bit-storage cell are an example.
Much of cycle control is associated with the access circuits. The
rest is associated with transfers, such as from register to bus. If
circuit redundancy is applied to this portion of cycle control it is
possible to design an MMS which is completely protected against single
part failures throughout the subsystem, and which costs considerably
less in power and weight than its TV counterpart.
e. Redundant Access Circuits
Probably more design effort has gone into the development of reliable
access switches than any other part of the MMS. In the area of design
for fault prevention, some knowledge of the mechanism of failures is
required. A very interesting study has been done by Minnick, 11°,11.
based on the assumption that the ferrite and wire portions of memories
are far more reliable than the semiconductor portions. Techniques are
Described in a graduate seminar at Stanford University, Fall 1965.
283
then developed which employ magnetic switching elements for address
decoding, and transistors are assumed to be restricted to the driving
of the magnetic switches. Thus, a number of drivers are turned on to
accomplish the selection of a single storage resistor, and the energy
from these drivers is combined in a magnetic access switch to do both
the selection and driving of the specified storage register. The switch
is wired according to certain codes, such as those based on block designs.
The design is such that the failure of a single driver changes only the
amplitude of the current supplied to the selected memory line. A further
development in the design consists of placing a load core on the ground-
return side of the memory lines so that its switching resistance assists
in regulating the amplitude of current which passes through the storage
module, in order to tolerate the variation in current due to a faulty
driver. A further requirement of these selection switches is that the
access lines to memory be single-ended; thus it does not permit a
switch-sink type of decoding. The utility of this technique is greatly
increased if the address information is generated in a redundant code
at the computer, since the drivers require a redundant code for their
excitation. If irredundant addresses are transmitted from the computer
to the MMS, an encoder from the address register to the drivers or from
the address bus to the address register must be provided. Minnick also
considered design techniques for a number of recoders using magnetic
elements. 111
It would be desirable to evaluate the total weight and power costs
of several of the schemes described by Minnick. These costs are probably
very high compared to all-semiconductor realizations, but recent advances
in the miniaturization of magnetic elements may increase the feasibility
of the approach. A disadvantage of the scheme is the small number of
faults that may be accommodated with reasonable cost.
Very little has been published on detection schemes for telling
whether or not the proper storage register has been selected. An
See Section III-C-2-c of this report for an illustration of this method.
284
interesting detection scheme has been published by the General Electric
Company 157 as to whether or not any word has been selected or whether
more than one word has been selected. This scheme involves the addition
of one core plane or bit channel, which lacks an inhibit driver. Each
storage register then has an additional core, and since there are no
inhibit circuits, the core switches on read and switches back on restore,
giving a sensible output on every cycle. Two separate sense amplifiers
are used on reading--one set with a threshold level for a single core
switching and the other with a threshold level for two cores switching.
If neither amplifier senses a switched core during a cycle, no storage
register has been selected and there has been improper operation in the
word control. If the single-threshold amplifier switches but not the
double-threshold amplifier, proper operation is assumed. This scheme
gives no indication as to which storage register was selected, but con-
sideration of several access-switch schemes indicates that an exchange
of single register selections due to a fault within an access switch is
extremely unlikely. If the double-threshold amplifier switches, it
indicates that two or more storage registers have been simultaneously
selected during that cycle. This information could be used in conjunction
with parity checks on the data to switch over to an entirely different
block of core memory. This scheme has been given the term "memory-cycle
validation check."
f. Redundant Material in the Storage Module
Single-aperture ferrite cores have proven to be exceedingly reliable
elements and little has been published on redundancy on the element level
within the storage module. The fact that readout requires destruction of
the stored information and relies upon the electronics external to the
storage module to restore the information has caused considerable concern
about the possible loss of information due to transient electronic failures.
Considerable effort has gone into the development of techniques to provide
nondestructive readout (NDRO).
285
It is difficult to design an NDRO memory to the same tolerances as a
DRO memory of the same size, speed, and power consumption. The principal
technique of construction is based on the use of some multiaperture
(usually two-aperture) magnetic element. Problems of drive-current
tolerance have led to the use of large structures, with consequent heavy
use of power. Use of more complex (three-aperture) elements can be help-
ful, but this requires extra drivers and wire. These costs of weight,
power, and reduced reliability should be evaluated with respect to the
extra costs of protecting DRO memories by special control circuits.
These weaknesses apply to an NDRO memory which is required to record
new information during its mission. It is useful in spaceborne missions
to have NDRO memories in which writing occurs only prior to launch,
perhaps using externally fed writing currents. In this case, the main
disadvantage is the extra weight due to the use of (generally) heavier
memory elements.
g. Redundant Cycle Control
The function of the circuits within the cycle control is to accept
the two commands from the computer, distinguish between them, and then
generate the detailed sequence of control pulses which will turn on
gates for data transfer, initiate drive pulses, and gate sense amplifiers
at appropriate times. The two primary types of circuits that are used
for this function are tapped delay lines and special counters. Delay
lines may of course be replicated and their outputs combined in majority
voters. For lines in which the expected failure mode is dropout, only
duplication is needed. Fault-tolerant counter circuits have been described
in the literature and they are generally smaller and require less power
than delay lines.
h. Redundant Power and Environment Control
Current must move through the access lines, and hence through the
storage cores, in opposite directions during the write and read portions
of a cycle. This is generally accomplished by having separate current
drivers which operate from power supplies having opposite polarities.
286
DAs temperature increases in a ferrite core, the drive current required to
switch it decreases. This change in drive current as a result of change
in temperature is usually built into the power supply with a temperature-
sensing device to control the voltage of the driver supplies. While
many designs are available to permit this variation in drive voltage with
temperature, no literature has been found on the use of redundant techniques
in the power supplies. A third supply voltage is generally required,
separate from the two just mentioned, to supply all of the logic circuits
within the main memory. Since historically power supplies and turn-on
transients have occasioned a great deal of lost data, it is surprising
that so little attention has been paid to this part of the memory design.
This problem is discussed further in Appendix B of this report.
5. Design of a Parallel Encoder/Decoder
Several techniques based on the use of error-correctiD_ codes have
been described which permit the addition of redundant bit channels to
achieve an improvement in reliability, with redundancy ratios less than 2.
In order to conserve operating speed, it is desirable that the code-
processing functions be performed on all data channels in parallel. The
design of parallel encoders has been studied in the past, but there is
little information as to the practical costs of such networks. The use
of error-correcting codes with additional bit channels to store the
redundant digits would provide a method of masking any single-error fault
within the bit-control module. In this section we study the design of
an encoder placed between the input data bus and the data register, with
the mating decoder placed between the data register and the output data
bus. Such placement of the encoder/decoder will mask faults within the
data register, the digit drivers, the storage module, the sense amplifiers,
and the transfer gates, but of course not in the encoder/decoder itself.
287
The design of the encoder/decoder was carried out assuming a commerical
line of integrated circuits in order to get some reasonable estimate of
size, speed, and power consumption. The design is based upon the following
parity matrix.
P
A B C D E F G H I J K L
1 0 0 i 1 0 i 0 I 0 0 0
i i 0 0 0 1 i i 0 i 0 0
0 i i 0 1 0 i i 0 0 i 0
0 0 1 1 0 1 0 1 0 0 0 1
This matrix is based on the 12,8 single-error-correcting Hamming code.
The data bits are represented by bit positions A through H and the four
redundant check bits are represented by positions I, J, K, and L. Hence
twelve bit channels are required for storing a given data word. The task
of the encoder is to generate the information which will be stored in the
redundant bit channels. As the operation of the encoder/decoder is
described, it will be helpful to follow Fig. A-3.
The encoding is accomplished by checking parity over those bit
positions in the data word for which ones exist in a given row of the
parity matrix. Thus, for example, bit I is obtained as the mod-2 sum
of the information in channels A, D, E, G, and H.
Decoding is done in the following manner. Four parity checkers are
built, whose outputs are designated W, X, Y, and Z, one corresponding to
each row of the parity matrix over all bits containing one, now for all
12 columns. The four bits from these checker outputs are termed the
"syndrome character." If there have been no errors, these 4 parity-
checked outputs will all be zeros. An error in any one bit channel will
cause one or more of the parity checkers to give a one output, generating
a syndrome character that is not all zeros. For single-bit error, the
possible syndrome characters will correspond to one of the columns A
The reader is referred to Sec. II-B-2 for a general discussion of
error-correctlng codes for storage.
288
'_I_-'-READ
x_w,_:._ /Y SWITCH
q
OI
OS
ORO.ECOOER_
ooo
coo
PARITY GENERATOR
/X SINK
/ / Y SINKj-
_1"1_
i'
_Kq_
.1>-.I O I DATA
REGISTER
PARITY CHECKER
TYPICAL DATA- BIT CHANNEL TYPICAL CHECK-BIT CHANNEL
TA-55eO-32
FIG. A.3 ERROR-CORRECTING CODE IN BIT CHANNELS
289
through H. Occurrence of a single error will result in one channel being
indicated, and the error in that channel may be corrected by inverting
its data before transferring it to the output bus. That column, then,
must have its bit output from the data register in error. This is done
by including an exclusive-OR gate in the output transfer, so that any
line containing a one bit from the syndrome decoder will invert the
output of that bit channel as it is placed on the data bus.
The parity generators for generating the Ith and Kth bits during
encoding are assumed to be built from Fairchild integrated circuits,
model DT L 930, and are shown in Fig. A-4. Similar parity generators
are required for the generation of the J and L bits and also in the
decoder for the generation of the W and X, Y and Z. Fig. A-5 shows the
necessary integrated circuits to accomplish the syndrome decoding, the
bit correction, and the data register to data bus transfer gates for
bit channels A and B.
It should be noted that a single four-input gate can be used to
detect the fact that an error has been generated somewhere even though
the error is masked in the transfer. That is, if the syndrome character
has a l, i.e., if it is not all O's, then an error has been detected and
masked. This fact can be transmitted to the computer for possible status-
of-equipment analysis.
The HamminG-code equations for a parity matrix to permit single error
detection and correction in a 24-bit word require 5 redundant bit channels,
or a parity matrix with 29 columns and 5 rows. Such a parity matrix can
be obtained from a binary counting sequence by eliminating the all-O
columns, the all-l's column, and the column with four l's. The parity
generator then requires the design of parity checkers over 15 bits.
The parity matrix for 24 information bits plus 5 redundant bits is:
P =
C1 C2 A C3 B C D C4 E F G H
01010101010
01100110
0011110
0000001
0000000
0110
0001
1111
0000
IJKCsLMNO
O101O1O
1100110
1100001
110.0000
0011111
PQRS
1010
0110
1110
0001
1111
T U V W X
i010 1
0110_0001
iiii
1111
290
Od
d
e
i
O
g
0
C
h
T&-5580-33
FIG. A-4 PARITY GENERATORS FOR i AND k
291
DR""BU
b !
X
y_
z
_l-L_
DR"='BU
TA-_580-34
FIG. A-5 SYNDROME DECODER, BIT CORRECTOR AND DATA REGISTER TRANSFER GATES
The columns with single l's are chosen to be the check bit, since
they can be generated with a single parity generator over the l's in
that row.
6. Conclusions
Several schemes have been described for the control of errors within
the various sections of a random-access main-memory system. The relative
costs in power and weight and the relative improvement in reliability
have been compared for a number of major approaches, for several parameters
of word size and count. The evaluation of costs is made on the basis of
cost parameters for present off-the-shelf components. Future developments
in technology promise to reduce the proportional weight costs of the
electronic subsystems.
The use of error-correcting codes for masking independent faults
in bit channels is practical and very beneficial.
Several schemes have been discussed for masking, avoiding, or
otherwise accommodating faults in access equipment, but their complexity
is such that a good comparison would require very extensive analysis, and
the analysis of some schemes would have to be tied to particular circuit
schemes. Further investigation of these schemes is recommended. Of
the various schemes, paging appears to be the easiest to employ.
292
DThe need for new techniques for the protections of power supplies
has been noted.
It is recommended that techniques of error control be investigated
for various special kinds o£ memory systems not considered here; e.g.,
associative memories, fixed memories, and buffer registers, assuming
the various appropriate device technologies. Also, although the scheme
described here are generally applicable to NDRO memories, special redun-
dancy techniques may be useful to particular memory structures. These
possibilities should be investigated further.
Because of the close physical interaction among the several func-
tional sections of a memory, it is necessary to evaluate a set of schemes
with respect to the overall reliability of an integrated memory system.
Such evaluation requires the conducting of design exercises based upon
the choice of particular sets of operational requirements and particular
device technologies. It recommended that such design studies be
undertaken.
293
Appendix B
DISTRIBUTED POWER-SUPPLY SYSTEMS
295
Appendix B
DISTRIBUTED POWER-SUPPLY SYSTEMS
I. Introduction
The design of power supplies has been too often considered as a
separate (and usually secondary) system problem. It is very clear that
as high-performance systems grow more complex, the power-supply and power-
conditioning equipment must be carefully designed as an integral part of
a system, rather than being added on as a separate system component.
There are many possible power-supply methods for ultra-reliable computers,
ranging from single (nonredundant) designs to methods of using distributed
power supplies, or at least distributed power conditioning.
Early in the present project we felt that the distributed power supply
(having a separate power unit for each logic group) had merit; we now feel
that some form of distributed, noncentralized power supply is essential to
the success Of complex, long-life computer systems for space use.
In the following pages the advantages and disadvantages of distributed
power conditioning systems and the interdependence between supply and logic
circuits will be discussed. Three examples of possible ways to design
power-supply systems are given, along with comments on these examples, and
some notes on areas where additional work is needed.
2. Advantages of Distributed Power-Supply Systems
(i) There is nearly complete independence between a logic circuit
group and other subsystem parts.
(2) Various logic circuit types of different origins could easily
be merged to fulfill new system needs, without being required
to power the units from available supply voltages. This is
particularly true if ac coupling is used at data interfaces
between modules.
(3) A power cutoff switch could be included in each module, so
that unused modules would not consume power.
297
(4) Current limiting is mucheasier, making it easy to protect
the raw power source from damage.
(5) There is very little electrical interaction between modules
through the power supply.
Semiconductor device reliability is better, since a large
numberof small-junction transistors and diodes are used,
rather than a few large-junction units. 278
(7) Switching a single power-input line to a multiple-output
logic module, in combination with the use of decoupling
diodes at each output, provides an economical meansfor
switching the whole module in and out of the system.
3. Disadvantages of Distributed-Power-Supply Systems
(i) A DPS system is heavier than a single nonredundant system,
possibly heavier than two single systems, unless higher
frequency converters are used in the DPS.
(2) Circuits are more complex, since each DPS module supply is
nearly as complex as a single large supply.
(3) Electrical efficiency is lower, particularly if some con-
ditioning is required before the raw dc power is distributed.
(4) The load requirements of each logic group should be similar,
to avoid too many different types of DPS in a given system.
Further comparisons are not particularly useful unless specific
systems are compared. The entire computer system should be considered
without attempting to treat the power supply separately.
4. The Interdependence Between Power Supply and Logic Circuits
a. Noise Problems
Many digital system problems can be traced to the power supply, or
are blamed on the power supply. Usually the troubles are due to "noise"
caused by logic-circuit loads being switched on and off abruptly in the
course of normal computation tasks. Unless the power-supply output im-
pedance is extremely low, the voltage on the power supply varies rapidly
as the loads are switched, causing noise on the output bus. If the noise
voltage is of sufficient magnitude, false triggering of circuits can occur.
298
The usual remedy is to use bypass capacitors distributed among the groups
of circuits, so that the short-duration currents are drawn from a local
capacitor. The most widely used form of digital logic uses dc level
shifts to define a 1 or 0, rather than ac coupling of pulses between
circuits or groups of circuits.
If several different power supplies are used in a system employing
dc coupled logic, the power supplies must have nearly equal voltages, or
the noise margins of the circuits will suffer.
The ratio between the 1-to-0 voltage swing and the noise-voltage
magnitude is a good measure of the susceptibility of a digital system
to noise problem. Some circuits require only a few tenths of a volt,
and are affected by very small noise voltages on the power supply (or on
data leads), while others require several volts to trigger, and are
therefore relatively unaffected by power-supply noise. High-speed cir-
cuits are more susceptible to supply problems, unless great care is
taken to minimize the length of leads.
If distributed supplies are used with dc coupled logic systems
having poor noise immunity, ac coupling at the data interfaces would
be very desirable, since a few tenths of a volt difference between the
individual module supplies would not degrade the noise immunity of the
logic circuits.
If high-level devices such as field-effect transistors were used in
logic circuits, ac coupling at the data interface would probably not be
required, and the design of a workable application of DPS system would
be much easier.
Note that the ac noise problems in a computer using DPS are not as
serious as in single-power-supply systems, since individual module regu-
lators are in effect cascaded (in series), greatly reducing mutual coupling
between modules.
299
b. Fault Location, Isolation, and Corrective Action
When a malfunction occurs in a computing system, it is desirable to
determine where the fault is, to isolate the defective elements, and to
take the best available course of action to restore as much computational
ability as possible.
When a power-supply fault occurs, the first order of business is to
protect as much of the system from consequential damage as possible. If
the fault results in a transient overload, it may be desirable to simply
limit the fault current to prevent damage to the raw supply, and restore
voltage to the module after a brief interval. If excessive current de-
mands persist, then the module should be disconnected from the raw supply
to prevent energy loss to a useless module. If voltage can be restored
after a momentary fault, then logical checks should be made to determine
whether the computational performance of the module has been impaired.
Note that a logic check will always determine whether significant
damage has occurred, so that the main function of the module supply itself
is to prevent consequential damage to other system elements. Permanent
disconnect of the module is thus a supervisory function, while protection
from damage is a local, self-contained function of the module supply
itself.
Future computers will operate on very low power, as evidenced by the
work being done by Fairchild. 343
5. Examples of Three Possible Designs for Power-Supply Systems
Of the many designs that could be used for supplying power to a
spacecraft computer, three methods have been selected as examples of power
conditioning. These examples are shown in Fig. B-I.
Method 1 is the most conventional of the three, and is used in some
systems already designed. Raw dc power is "chopped" in an dc-to-ac
converter, rectified and filtered, and regulated by a series regulator.
Method 2 is similar to Method I, except that the regulators are
distributed and located at each logic group. This kind of system has
been tried, but not used to any extent.
3O0
POWER I I _flCONVERTER AND FILTER REGULATOR X
4
L___ SPARE _ SPARE _ SPAREJ_.____l
X = SWITCH METHOD I
RAW dc IPOWER
REGULATOR
A I
°cTooc_ RECT'"ER_--1 tCONVERTER J lAND FILTER
s,,,,._H sp.._ t--_,--J I "_°u''r°',,
I 1
LOGIC GROUP
A /
REGULATORB I
METHOD 2
RAW dc IPOWER I
CONVERTER
RECTIFIER
FILTER
REGULATOR
LOGIC GROUP
A
CONVERTER
RECTIFIER
FILTER
REGULATOR
LOGIC GROUP
A I
CONVERTER
RECTI F IER
FILTER
REGULATOR
LOGIC GROUP
METHOD 3
tSINGLE UNIT
TA-55eO-35
F|G. B-1 POWER CONDITIONING SYSTEMS
301
Method 3 represents a true distributed power-conditioning approach,
since raw dc is wired to each power-supply converter/regulator, and no
elements are common to all logic groups.
The scheme used for switchover to a spare supply (Methods I and 2)
is very elementary: open 1 and 2 and then close 3 and 4. More complex
schemes using more switches and crossovers have been proposed. These
schemes allow a spare regulator, for instance, to be switched into the
existing supply. If the switch reliability is not considered, the com-
plex schemes could have much higher reliability than the simple spare-
supply concept. Switch reliability is important, and the control prob-
lems of the complex rerouting schemes are serious. Weight penalties of
two to three times the normal supply weights are also involved. The
complex schemes have therefore not been widely used.
Although individual regulators are not presently employed at each
logic group, series filter inductors or resistors and bypass capacitors
are frequently used.
Method 3 certainly involves more parts than the other examples, but
it should be easier to manage from a supervisory standpoint, and the weight
penalties are not severe.
6. A Possible Configuration for a Power-Control System
The converter/regulator associated with each logic group should also
include a disconnect function so that (I) the logic group can be removed
from the source of energy in the event of a fault within the group, and
(2) unused groups can be disconnected to conserve power.
It is believed that such disconnect circuits should have a toggle
action, so that if directed "off" or "on," they will remain in the
desired position, even though a general power failure has occurred.
A possible scheme for such a system is shown in Fig. B-2. Each power-
control element has a square-loop magnetic core associated with it, so
that when the core is "set," voltage is applied to the logic group.
Power would be applied to the computer by applying coincident pulses
to crosspoint Jl (Logic Group A energized), J-2 (B energized) etc.;
302
RAW POWER BUS
J
1 2
I I
I 1
I ' I 'I iI
i _ow_ i _ow_
| CONTROL I SENSING CONTROL
to_o_I _ _°i'_JI
I
i=
I
t POWER
CONTROL
SPARE
LOG I C
GROUP
I
I
I
I
MATRIX CONTROL
WIRES
I
I
I
I
I
I
LOGIC
GROUP
F
etc.
I
I
I
@-
i
i
GROUP I
c I
etc. I
I
I
I
I
I
GROUP IG
etc. I
I
TJ,- 5580 - 36
FIG.B-2 POSSIBLE POWER-CONTROL SYSTEMS
303
if logic group B failed, or was suspected to have malfunctioned, cross-
point J-2 would be pulsed off and the spare group turned on by pulsing
K-1. Magnetic control of this type would be easy to address, and has the
additional advantage of being electrically isolated from both the data
and power-supply components.
A carefully designed system for controlling power would probably
simplify data-switching problems since a failed logic unit would not
deliver any erroneous signals to adjacent units. As pointed out earlier
power can be conserved by turning off unused units. A simple method for
doing this by means of a series transistor is described in a very recent
article by Clift. 42 Note that this could be the same transistor used in
the regulator, or a separate unit. At least one commercial computer
already uses power switching to conserve power within an integrated
circuit memory. 253
7. Weight and Power Required for a Distributed Power Supply
As an example of how much one would have to pay in terms of weight
and power loss for a distributed supply system, we have considered a single-
voltage 20-watt supply. A well-regulated conventional supply would weigh
350-500 grams and consume about 22 watts from the raw supply under nominal
input-voltage conditions. In this power range, the chop frequency would
probably be approximately 1000 Hz. Weight could be reduced by increasing
the frequency, but efficiency would suffer.
Small supplies with oscillating converters can be made light and
efficient. As an example, consider the power supplies built at SRI for
the NASA PIONEER experiments. These units have the following speci-
fications:
Input voltage: 28 ± 9 volts, 80 mA nom. (3.1 watts)
Output voltages: -3, +5, +12, +2.5
Regulation factor: over I0,000
Chopper frequency: 30-40 kHz
Efficiency: 85% under nominal conditions, input may vary
from 19 to 37 volts
Power output: 2.6 watts, 90% of the power being in the
12-volt circuit
Weight: 55 grams, including transformer and filter
capacitors.
304
We have estimated that a 1-watt, single-voltage unit would weigh
about 25 grams"
Four degrees of split-up of the power supply have been considered;
a 1-section, a 25-section, a 36-section, and a 64-section system. The
individual supply sections would have to deliver slightly over 1 watt
for the 16-section unit, and about 1/3 watt for the 64-section.
Figure B-3 is a plot of the weight of the distributed system and the
amount of power needed for operation. Note that even for a 64-element
system, the weight is less than triple that for a single supply. In
making this estimate, we have assumed the transformer and capacitor
weights per unit would decrease from I0 grams for a 16-section system
to about 5 grams for a 64-section system, that the semiconductor weight
would decrease from 12 grams to about 4 grams, and that the core-switch
weight would be constant at about 4 grams. Efficiency was assumed constant
at 85 percent, except that each unit consumes about i0 mW for voltage-
reference circuits.
I I I I I I
SINGLE
UNIT
I I0
INPUT POWER
. 0 .... 0 ..... 0____0
I t I I 1 I I I I t
20 30 40 50 60
NUMBER OF LOGIC GROUPS
m
m
--25_
o
--24 _
I
--23_
W
_22_
n
G.
Z
20--
70
TA'SS|O-$?
FIG. B-3 WEIGHT AND POWER REQUIREMENTS FOR 20-WATT DISTRIBUTED SUPPLY,
305
Note that the small supplies can operate efficiently with a high
oscillation frequency (50 kHz or more), allowing the use of very small,
light transformers and smaller filter capacitors. With the possible
exception of the series regulator transistor, power dissipation is low
enough so that integrated circuits could be used for control and reference
amplifiers. Some integrated circuits are already available for this
339
purpose.
8. Conclusions
The distributed power-supply notion can be applied to the design of
a system of "self-powered" logic groups using individual dc to dc conver-
ters with very slightly more power loss than conventional single-supply
systems. The weight, including on/off control circuitry, should be only
about twice the weight of a single-unit supply, since the chop frequency
of the individual units is higher than for the large (single) unit.
The design of distributed power-supply systems is closely allied with
the design of the logic groups themselves. Unless the data-interface
circuits are ac coupled, slight differences in the relative voltages of
the supplies will cause loss of operating margins for data signals between
logic groups.
The average power consumption of a distributed power-supply computing
system can be reduced by turning off unused units. 0n/off cycling should
not present a serious reliability problem.
Although the work we have done deals mainly with dc to dc converters,
and indicates that a workable system could be made with such units, other
possibilities such as ac distribution systems should be investigated.
Either sine-wave or pulse-waveform ac systems could be used, but a careful
comparison needs to be made between dc, sine-wave, and pulse-waveform
systems. The desirability of using rough preregulation of the raw power
source also needs attention.
The use of a magnetic core for memory in the on/off switches for power
supply is feasible, but perhaps the possibility of using a thln-film memory
element should be investigated, since thin-film elements would allow con-
struction of small, rellable switches for power-control purposes.
306
Appendix C
APPLICATION OF MAGNETIC LOGIC
307
Appendix C
APPLICATIONOFMAGNETICLOGIC
i. Introduction
In this section of the report we give the results of a brief survey
and evaluation effort that we have undertaken to ascertain the role, if
any, that magnetic logic should have in future spaceborne computers
employing integrated semiconductor circuits. We think it is important
to conduct such a survey in a program aimed at ultrareliability because
digital magnetic elements have proven to be highly reliable in their
application to memories and logic. To anticipate our conclusions, we find
that it is feasible to use magnetic circuits in conjunction with inte-
grated semiconductor circuits in a high-speed (MHz region> ultrareliable
computer to effect a substantial increase in the overall system relia-
bility. We reach this conclusion chiefly on the basis (i) that magnetic
elements per se are extremely reliable, and (2) the speed of operation of
magnetic logic circuits is adequate for performing certain functions.
By way of contrast to this general conclusion, an example is described
wherein an attempt to apply magnetic logic circuits leads to a question-
able increase in reliability.
In addition to the reliability of magnetics, there are other charac-
teristics that make their use attractive. Magnetic circuits provide
nonvolatile information processing with no standby power; they are
essentially immune to many types of noise; and they are highly resistant
to all nuclear-radiation components, both steady-state and transient.
One approach to the use of magnetics in future space programs is to
follow the past pattern of decentralization of functions rather than
using a complex centralized computer. In such an approach magnetics are
used in systems (sub-systems) that are essentially separate from the
systems that use integrated semiconductors. A magnetic programmer,
309
a time-sequencer and an A/D converter W, are examples of separate magnetic
systems. While this is a concept that bears serious consideration, it is
not one that we shall deal with here.
In what follows, there is a brief discussion of the reliability of
magnetics, and then possible approaches to applying magnetics are dis-
cussed. We have categorized these approaches according to the kinds of
functions magnetics can perform in conjunction with high-speed integrated
semiconductor circuits.
The first category is that of monitoring the performance of the
semiconductor circuits to determine when (and perhaps what) corrective
measures are required. Two types of monitors are discussed: one is a
sophisticated current meter and one performs digital operations that are
the same as certain portions of the semiconductor unit.
The second category we discuss is that of magnetic switches.
Several different types and functions of switches are included.
The third and last category is that of a "hard-core" backup control.
In certain of these backup-control schemes the nonvolatile characteristics
of magnetics is essential. All of the schemes rely on the long life of
magnetics.
2. Reliability of Magnetics
The basic premise that motivated this survey is that magnetic
elements are ultrareliable. As a part of a recent SRI project for NASA
an attempt was made to determine a reliability figure for magnetic
ferrite elements. 122 The conclusion was that ferrite cores do not fail
in service, so long as they are operated within their physical limits,
even when operating at temperatures up to at least 250°C. t The reason
For one published example of a magnetic control unit, see Ref. 277.
f More specifically, this conclusion is for manganese-magnesium ferrite
cores. Most square-loop ferrite cores are of this type and hence
failure information (really the lack of it) could be gathered only
on this type of ferrite. There is no known failure mechanism for
these cores.
310
that no quantitative reliability figure has been established is that
there is no failure data. One method that has been used is to estimate
the number of cores used in memories that a particular company has pro-
duced, and then estimate the number of failure-free operating hours for
these cores. This number of core-hours is then taken to be the mean-
time-between-failures. For ferrite cores the MTBF has been variously
estimated to be between 10 I0 and I012 hours (i.e., l0 -I0 to l0 -12
failures/hour ). 340,294
A reliability analysis of a system that uses magnetics must include
the reliability of windings, connections, and associated nonmagnetic
components. It appears that it is really these parts that determine
the reliability of a magnetic system. As a part of the NASA work
mentioned above, a reliability analysis was made of a magnetically im-
plemented digital system. .24 For this particular purpose the following
failure rates were supplied by Langley Research Center:
Part Failures per 108 Hours*
transistor (discrete part)
ferrite core (wound)
solder joints (inspectable)
2
0.01 (assumed value for worst case)
0.01
Note that the failure rate for the ferrite core was supplied as a worst-
case value, and is consistent with the core-failure data above. We see
from this table that inspectable solder joints are highly reliable--this
is important in a core-wire system. Furthermore, redundant joints could
be used to further increase reliability if necessary.lS6, 34s
It is instructive to compare the failure rate of ferrite cores with
that of integrated semiconductor circuits; the latter may be as low as
2 failures per 108 hours. This number comes from a recent survey conducted
by TRW Systems Inc., in which many users of integrated circuits were
* Rates are based on high component reliability employing 100-percent
screening for known weaknesses, approved derating policies (stress
level assumed to be 50 percent), and approved fabrication techniques.
Failure rates correspond to 65°C maximum ambient and 15°C temperature
rise for part.
311
contacted. The average failure rate assigned to integrated-circuit
devices that are i00 percent screened is 7 failures per 108 hours, and
the lowest rate reported is 2 failures per 108 hours. We estimate that
a typical integrated circuit to which these rates are applicable comprises
about 30 components. (In this context it is important to note that a mag-
netic implementation of a function often requires fewer components than
does a semiconductor version.)
A comparison of these failure rates indicates that a single ferrite
toroid may have from 2 to 5 orders of magnitude lower failure rate than
an integrated circuit. We recognize that a comparison such as we are
making here is open to question from several points of view. Neverthe-
less, the large difference in the failure rates ascribed to the ferrite
core and an integrated circuit are significant and support the contention
that magnetics should be considered for ultrareliable applications.
3. A Magnetic Monitor Concept
a. A Metering Monitor
One method of using reliable magnetic circuits to increase system
reliability is to use magnetics to monitor critical electrical variables
within the main computer, such as power supplies and high-current pulse
drivers.
The nature of the monitor and the functions it performs can con-
ceptually take several forms. One possibility is to use a magnetic in-
strumentation system--a metering monitor--that would measure voltages and
currents and give an electrical signal indication when an out-of-
specification condition is detected. The voltages and currents could
be dc, pulse, or both. That such a magnetic system is feasible has
recently been demonstrated by the development of a magnetic telemetry
system122, 123 wherein currents from sensors are digitized and, in effect,
commutated. @ Currents as low as l0 microamperes and as high as one-quarter
_ The sensor currents per se are not commutated in this system and no
analog amplification is required. This magnetic implementation of
telemetry does not follow the standard organization for such a
system.
312
ampere can be measured with 1-percent accuracy. Voltages are measured by
determining the value of the current flowing through a known resistance.
b. An Information-Sampling Monitor
Another possible way to use magnetic circuitry to monitor performance
is to detect errors by processing selected data in parallel with the main
computer. In the event that an error is present in the sampled block of
information, this error will propagate through the high-speed system until
the magnetic unit has completed its testing of the block. A means of
stepping backward in the program and starting the processing again after
the equipment fault has been corrected is implied in this concept.
Alternatively, the system must tolerate the incorrect processing that
occurs during the error-detection interval.
A variation of the sampling technique is to generate error-detecting
codes in the fast processor and use the magnetic processor as reliable
error detector, e.g., a parity or arithmetic-code checker. If the speed
of such checking is too slow, the magnetic circuits may be used for a
delayed-output verification of a high-speed checker.
4. Implementation of Magnetic Monitor
There are three parts into which the implementation of an information-
sampling magnetic monitor can be logically divided: input and output
buffers, a comparing circuit, and the processor. The processor is the most
complex portion of the magnetic unit. There are several possible magnetic
approaches to implementing the processor; these can be categorized into
semiconductor-magnetic and all-magnetic logic schemes. Typically, semi-
conductors are used with both types of logic schemes, but in the case of
the all-magnetic logic the semiconductors are used only to generage clock
pulses. It is possible, however, to generate clock pulses from an ac
source (e.g., a sine or square wave) without using semiconductors.*2s, 17
Because the intent here is to achieve ultrareliable operation, the use
of a clock-pulse source that does not require semiconductors should be
considered. The semiconductor-magnetic logic schemes are usually faster
than the all-magnetic logic schemes, and within the all-magnetic logic
313
category the nonresistance schemes are usually faster than the resistance
schemes. Resistance schemes generally have greater tolerance to tempera-
ture and drive variations than do the nonresistance schemes. 2G
In this survey we cannot go in depth into the characteristics and
attributes of the various technologies that are applicable to the magnetic
processor. However, the following magnetic digital systems represent
technologies that may be applicable to the processor: Sperry Gyroscope
Magloc Computer; 338,267 Burroughs D210 Magnetic Computer;319, 34° Univac
Ferractor-type Magnetic Computer 3° (principally of historical interest);
Di/An Controls Core-Transistor-Logic systems;166, 342 Stanford Research
Institute MAD Feasibility Machine;51, 52 IBM Flux Logic Evaluation
Assembly (FLEA);347, 348 Stanford Research Institute Magnetic Versatile
Information Corrector (MAVERIC); 72 Stanford Research Institute Atomic
Reactor Control Module; 4s and Bell Telephone Laboratories Magnetic Stored
Program Computen 2°s Because of the speed limitations of magnetic circuits,
it is pertinent to note that the Sperry Magloc Computer has a 300 kHz bit
rate, an addition time of 86 _s, and a multiplication time of 3.87 ms for
24 bits. The Burroughs D210 has a 100 kHz bit rate, an addition time of
30 _s and a multiply time of 570 _s for 24 bits. For the benefit of the
reader who is interested in the details of magnetic techniques, we have
listed bibliographies on magnetics as Refs. 27, 120, 230, and 19.
In addition to the low-speed processor, the information-sampling
magnetic monitor comprises a comparing circuit, input and output buffers,
sampling gates, and synchronizing circuitry. These do not appear to be
difficult to implement in a manner that will be compatible with the
technology used in the magnetic processor. However, in the joining of the
integrated semiconductor units and magnetic units, the interface problem
must be carefully considered. It is probable that signal amplification
will be necessary in order to drive the magnetics from the semiconductor
units. If the number of transistor amplifiers becomes a significant number
when compared to the number of components in the processor, then the re-
liability increase achieved by a magnetic implementation over an integrated
semiconductor version can become small or nonexistant. Another way of
314
saying this is that the magnetic-monitor approach can add significantly
to the system reliability if the functions performed by the processors
are moderately complex. (This aspect is also discussed in the section
of interconnection switches.)
5. Magnetic Switches
The function that we envision for a magnetic switch is to direct
power or information (data) to or from a module on command of an electrical
signal, e.g., a signal from a magnetic monitor. Magnetic switches are
attractive because they offer high reliability, dc isolation (for noise
immunity and for coupling between subsystems that have different signal
levels), low susceptibility to noise from external sources, and nonvolatile
storage of information. In the case of an information switch the power
level will be low and the rate of change of the information could be in the
low MHz region. For the power switch the typical power level will be
higher than for the information switch, and the frequency will be in the
low or tens of kHz range. For both types of switches, low speed in the
switchover from one module to another one is acceptable, and switching
will be required only when a malfunction has developed in the computer.
The switches that are described below are of both the information and
power types. With one exception, the investigation to date indicates that
these switches merit further investigation.
a. Converging Switch
The converging switch is the name given to a unit that gives access
to one of N information sources from any one of K data processors, $2 The
processors supply the switch with the address of an information source.
The converging switch contains address-decoding circuits and magnetic-
gating structures. The magnetic-gating structure is a multipath ferrite
device that operates on the balanced-circuit principle. An information
signal upsets the balanced condition to store or transfer a logic i. The
particular device reported in Ref. 32 uses a low coercive ferrite material
(between 0.15 and 0.2 oersteds) and requires 30 mA to upset the balance.
An early production model produced by Western Electric connects to 128
315
sources, each with a 26-bit capacity; it has an access time of 1 _s and
a cycle time of 2 _s. The number of semiconductors required for the
switch is not reported. The converging switch was originally developed
for a missile application and is now being used in a modified form within
the Bell System.
b. Interconnection Switch
This switch exists as a paper design and is the result of work under-
taken by SRI for NASA. s14 The switch is part of a redundant computer
that is implemented by integrated semiconductor circuits. The inter-
connection switch is used in two different places. In the first appli-
cation, one of a set of three arithmetic units is connected to a processor,
and in the second application one of a set of three memories is connected
to the processor. Information flows through the switch in both directions
at a 1 MHz rate; voting also takes place in the switch. Toroids and
multiaperture devices are used in the switch implementation. The investi-
gation of this switch revealed that it was possible to perform the necessary
functions magnetically, but there are undesirable features. Perhaps the
major problem lies in the fact that a transistor driver is required at
each input terminal to the magnetic unit in order to increase the signal
power level sufficiently. This results in a large number of added semi-
conductors. Another major problem arises from operating at 1 MHz--the
power dissipation is high. (By way of reference, it takes about 0.1 watt
to switch a 30/50 memory core at 1Mttz.) The attributes of magnetic
implementation of this switch as compared to a semiconductor version of
the same switch are summarized in the table below.
Advantages of Magnetic Version Liabilities of Magnetic Version
Reliability is increased somewhat
Has nonvolatile characteristic
Number of semiconductors used is
between 1/2 and equal to number
required in an all-semiconductor
version
Power is greater (i0 times at 1MHz,
equal at i00 kHz)
Weight and volume are greater
316
This switch is an illustration of an application where the use of a
magnetic implementation is questionable. The functions required of the
magnetics are quite simple, and the interface between the integrated cir-
cuits and the magnetics therefore becomes a major problem.
It is apparent from this example that if a block of magnetic circuits
is interposed between blocks of integrated semiconductor circuitry, the
functions required of the magnetics should be at least as complex as those
that could be performed by the equipment in the interface circuits. An
exception to this generalization arises if the power level of the inte-
grated circuits is adequate to drive the magnetics directly without added
amplifiers. Balanced magnetic circuits have reportedly operated with an
input current of only 30 mA (but probably at less than 1 MHz). 32 It is
also worth noting that 30 mA is about the amount of current that is re-
quired to "tip" the flux in coherent-rotation switching in a metallic thin
film. 113 Since it is possible to get 30 mA (and more) from integrated
circuits that include a line driver as one of the available units, these
integrated-circuit units could be used to drive magnetic logic circuits.
(These units have an amplifying junction that is on the same silicon chip
as other junctions and components.) The reliability of an integrated-
circuit system is more a function of the number of chips than of the number
of junctions, so it may be possible in some systems to incorporate a line
driver on the chip without reducing the reliability of the integrated-
circuit portion of the system. This would mean that the semiconductor-
magnetic interface would be effected without a reliability penalty at this
point in the system.
It should be pointed out that the balanced magnetic scheme cited above
is a bipolar scheme that equates a logic 1 with a particular polarity of
output signal and a logic 0 with the opposite polarity signal. This
bipolar characteristic gives rise to some problems; for example, it
cannot be used as a switch to connect and disconnect a pulse train.
Additionally, the "balanced" feature of the scheme means that a clock
source switches flux repeatedly in a magnetic structure, irrespective of
the logic state of the circuit. This can mean high power consumption.
317
The problem of the semiconductor-magnetic interface should be
further investigated.
c. Data-Path Switch
The switch to be described has certain interesting features of dc
transmissions, with dc isolation. Figure C-I is a sketch of the switch.
SQUARE LOOP _CONTROL CURRENTCOR 
t_) FEEDBACK LOOP
I NPUT OSCILLATOR L
v
DC POWER MONITOR
TA-5580- 45
FIG. C-I DATA-PATH SWITCH
In this switch the oscillator is powered from the same module that
supplies the input data to the switch, and the output transistor is
powered by the module that receives the data. When an input signal is
received the circuit will oscillate if the feedback conditions are
correct. The oscillator output is coupled through a transformer and,
upon rectification, it becomes the input to the following stage. Units
operating in a manner similar to that just described are available from
Dynamics Instrumentation Company (floating digital drivers) and from
Radiation Incorporated (modular solld-stage telegraph relay). In the
data-path switch shown above we have added to the circuit a magnetic
core that controls the feedback of the oscillator. The state of the
magnetic core determines whether or not the circuit will oscillate upon
receipt of an input signal; therefore, the switch can be activated under
logic control. This means the switch could be used in redundant systems
to switch modules in and out of the system upon receipt of an electrical
318
signal. The details of this switch have not been worked out, but we think
that it is feasible and potentially useful for reliable computer applica-
tions and that it bears further consideration. The advantages we see for
such a switch are the following.
(I) Dc coupling is achieved so that logic levels, rather than
pulses, are transmitted.
(2) Isolation is complete--there are no grounding problems.
(3) There are no power-supply interaction problems.
(4) Low-power, coincident-current setting of the magnetic
element is possible because only feedback power is
controlled.
(5) The state of the switch (on or off) is nonvolatile (with
power failure) because of the magnetic element.
(6) It is easy to monitor data flow by adding a winding to
the output transformer, or by coupling through a small
capacitor.
(7) The geometry of the magnetic control element is simple;
that is, a toroid is adequate and a multipath device is
not required.
Another switching method that has merit for data-path applications
makes use of a multiaperture magnetic element (a MAD). Only a single
highly reliable oscillator is required for the entire system for this
switch implementation; however, an amplifier is required between the
input to the MAD and the integrated semiconductor circuitry. The re-
quired dc output is obtained by rectification as in Fig. C-I.
d. Power Switching
In a redundant system we believe it is beneficial to switch power on
and off to individual modules for several reasons: (i) it simplifies in-
formation switching; (2) it simplifies testing; and (3) it permits a
reduction in operating power. It may be possible to replace information-
path switching with power-supply switching. Magnetic technology is an
attractive candidate here because of its reliability and nonvolatility.
If all the power goes off and is subsequently restored, memory of the
information states at the subsystem level immediately preceeding the
failure will be valuable information.
319
Since a magnetic core responds to alternating current, magnetic im-
plementation of a power switch means switching ac power. For this reason
the only real differences between a magnetic switch for power and a switch
for data pulses is the power level and frequency. A module in a computer
typically requires dc power, so using a magnetic switch requires that
rectification be carried out in each module. This rectification require-
ment is consistent with the concept of individual power supplies for
modules rather than one big supply for the entire system. This concept
is discussed in Appendix B of this report.
Because of the similarity between an information-flow switch and a
power switch when magnetic implementation is employed, the data-path
switch described above is a candidate for an ac power switch. Likewise,
the MADs that are used in the interconnection switch and in one version of
the data-path switch can be used for controlling the flow of ac power.
The suitability of these approaches has not been evaluated in detail in
this survey.
A switching method that was not discussed as an information-flow
switch is that of a ferroresonant circuit. 292 This type of nonlinear,
bistable circuit relies upon the change in inductance of a reactor as it
is driven into and out of the saturation region. There are two charac-
teristics of these circuits that lead us to consider them for power switching
rather than information switching: (1) they are capable of handling reason-
ably large amounts of power, and (2) they control a continuous ac wave
rather than pulses.
A particular ferroresonant circuit is shown in Fig. C-2. 135 When
T l is saturated the inductance of the winding L 1 has just the right
value to resonate with capacitor C 1 at the applied carrier frequency.
The same is true for T2, L2, and C2; however, only one of the series cir-
cuits can be resonant at a given point in time. If both circuits were to
be resonant simultaneously then the voltage drop across C would increase
c
beyond the value it has when a single branch is resonant. With this In-
creased voltage drop, the current magnitude possible in the two branches
is not sufficient to maintain both T 1 and T 2 in saturation. Therefore
320
CONTROL
INPUT NO. I O.----
OUTPUT .,_--,_ _
NO. I --
f_'Ci
CARRIER
T2
CONTROL
-..-,.-O INPUT NO. 2
J
_'_.,B,-OUTPUT
- NO. 2
f_
C2
TA-5580-4(_
FIG. C-2 FERRORESONANT SWITCH CIRCUIT
only one branch can be resonant and drawing a large current. The other
branch draws only a small current. Output power is obtained across
C 1 or C 2.
Ferroresonant circuits have been the basis for digital circuits
such as flip-flops and shift registers, and have been used in steady-
state ac circuits as voltage regulators.
Up to this point we have considered using the ac power switch as a
means for controlling the dc power supplied to a module. An additional
possibility is to use such a switch to control the distribution of
clock power to the various modules. The feasibility and utility of a
magnetic clock switch needs to be further investigated, as does the
possibility of applying the principle of ferroresonance to ac power
switching.
6. Backup Control
There are two types of backup control systems that we point out
here: one that is used to re-initiate operation of the integrated-
semiconductor computer after it has become inoperative, and one where
certain portions of the semiconductor computer are replaced by magnetic
321
units in the event of semiconductor failure. The purpose of the first
type of backup is to protect against a power failure that shuts down
the entire computer. When such a failure occurs but is not permanent,
it then becomes important to reestablish the operation of the computer
when system power is restored. It is not sufficient to simply allow the
power to be reapplied to the semiconductor logic circuits, because the
logic state that the flip-flops would assume is indeterminate. Even if
the flip-flops did assume predetermined states it would be necessary to
execute certain functions to reach the desired position in a program
and initiate the proper mode of operation. An ultrareliable magnetic
backup unit could be beneficially used to reinitiate operation of the
semiconductor circuits when power is restored. The backup unit would
execute certain primitive functions--e.g., control processor flip-flops
would be cleared to a reference state (that may be dependent upon the
point in real time at which failure occurred), and a program sequence
would be started. The return of power to the system would in itself pro-
vide the input signal required to bring the magnetic unit out of its
passive state and give it control of the entire computer. After proper
operation conditions were established, the magnetic backup unit would
be "locked out" of operation except for periodic updating. While the
magnetic unit was locked out, the operating speed of the computer would
\
not be impaired. In this type of backup the nonvolatile characteristic
of magnetics is an essential ingredient.
A backup control unit of this type could also be used to protect
against transient failures other than those of the system power supply.
If a transient caused malfunction of large segments of the computer so
that the utility of the computer as a whole were impaired, then the
backup could reestablish proper operation. In this mode of operation
the backup unit would be given control of the computer upon receipt
of a signal other than power supply turn-on.
322
In the second type of backup, upon command, a magnetic unit per-
manently replaces a faulty semiconductor unit, such as a control se-
quencer or program counter, at a sacrifice in operating speed and possibly
with a reduction in the type of functions that can be performed. This
type of control unit can be updated periodically, like the one described
above, to prepare it for operation on demand.
An interesting variation of this second type is to duplicate, with
a magnetic computer, all computations of the semiconductor computer that
are critical for the mission. This redundant computer could operate in
parallel, or it could be started when failure occurs in the main computer.
Alternatively, the concept of a redundant magnetic computer could be
reserved for missions of very long duration; use of the magnetic unit
instead of the semiconductor computer could save an appreciable amount
of power in the phase of the mission where the vehicle is at a great
distance from earth and high-speed computation is not essential.
The two types of backup controls described above are applicable
at the system level. The principle of magnetic backup can also be
applied at the circuit level. At this level magnetic circuitry could
be switched in and out as it is at the system level, but this would
probably result in too complex a switching system to actually increase
system reliability. A better method of supplying a backup at the cir-
cuit level is to use the magnetic elements in ankintimate mix with the
semiconductors. We propose, for example, in conjunction with certain
critical semiconductor flip-flop circuits, that toroids be set and re-
set according to the state of the flip-flop. The primary purpose of the
toroids is to remember the state of the flip-flop and maintain this in-
formation in the event of power failure. The toroids are not essential
to the circuit operation except for this nonvolatile characteristic.
A circuit that does operate in this manner at modest speeds has been
reported by Harry Diamond Laboratories. 19s An alternate method of
providing the same nonvolatile feature is to use a portion of the main
memory to store the state of the flip-flops. There are advantages and
disadvantages to both approaches that need to be further evaluated.
323
7. Conclusions and Recommendations
The reliability of magnetic-logic circuits is superior to that of
integrated-semiconductor circuits by several orders of magnitude, and
therefore magnetic logic should be applied to future ultrareliable
spaceborne computers. The high reliability of magnetic-logic systems
derives from the fact that ferrite-core failures are unknown and the fact
that inspectable solder joints are very reliable. In addition to long
life, magnetics have other characteristics that are important in ultra-
reliable systems: they require zero standby power and are therefore
nonvolatile, they are immune to most kinds of noise, and they are
radiation-tolerant.
The operating speed of magnetic circuits is a restriction upon
their general application in the megahertz region. However, this brief
survey indicates that there are substantial areas where magnetics can
and should be used in conjunction with high-speed integrated-semiconductor
circuits. We have discussed several different methods of applying mag-
netics in concept, and have given an indication of the kinds of schemes
and circuits that can be used.
We recommend that the following steps be taken:
(I) Further effort should be devoted to strengthening the
concepts presented here, and analytical and experimental
work should be carried out to support or refute the ideas
and circuits discussed.
(2) Because this survey is admittedly incomplete in scope and
depth, a continuing effort of this nature is recommended.
(3) In support of (i), we recommend that one or two specific
concepts be selected for a detailed design feasibility
study. Such a study should include circuit-level problems
of speed, synchronization, power required, and interface
compatibility with integrated-semiconductor circuits.
324
Appendix D
A SURVEY OF THE PUBLISHED LITERATURE ON THE ATTAINMENT
OF RELIABLE SYSTEMS THROUGH THE USE OF REDUNDANCY
325
DAppendix D
A SURVEY OF THE PUBLISHED LITERATURE ON THE ATTAINMENT
OF RELIABLE SYSTEMS THROUGH THE USE OF REDUNDANCY
i. Introduction
The primary thesis of this program for developing means for achieving
ultrareliable spaceborne computers has been that such systems can only be
achieved through the judicious use of redundancy. While it is of course
essential that component reliability be as high as possible, and that all
elements be operated well within the physical tolerances that guarantee
their continued operation, it is also necessary to provide backup facili-
ties that allow systems to tolerate internal failures, whether transient
or permanent, so that the computational mission can be successfully carried
out. For as the complexity of computer systems increases--as represented
most obviously by the enormous increases in the number of components re-
quired--almost any level of guaranteed reliability of individual elements
becomes insufficient to provide a satisfactory probability of successful
mission completion. These observations are particularly pertinent in the
case of extended spaceborne missions where the possibility of unprogrammed
maintenance and inspection routines is severely limited, and where success-
ful use of the radio link for such activities cannot be successfully
carried out unless careful anticipatory provisions have been made con-
cerning the types of spares to be installed and the interconnecting links
for installing and removing them on the detection of a fault.
These arguments are not new with this program, of course, and their
general validity has been recognized for at least a decade. As a result,
a great deal of effort has been expended in understanding just how re-
dundancy can be "judiciously" applied to systems, that is in such a
fashion that the overall reliability is actually increased. A not insig-
nificant factor that has compounded the problem of evaluating such systems
has been the paucity of good analytical tools for actually calculating the
reliability of complex configurations, or even of providing good lower
327
bounds so that overall mission probabilities can be estimated. Finally,
a new technological factor has appeared that is effectively changing the
rules of the game--we refer to the imminent widespread availability of
extremely reliable, very small, batch-fabricated elements having extremely
low power dissipation. The availability of such elements implies that
the number of components involved is not the critical factor in measuring
system cost--whether from energy, volume, or weight points of view--and
makes it possible to seriously consider large ratios of redundancy, if
indeed the resultant increase in reliability of the overall system can
be demonstrated.
As a result of the unquestioned relevance of redundancy techniques
in the construction of extremely reliable systems, the technical literature
is replete with reported activities that attempt to cover one aspect or
another of the large area. An important aspect of this program, then_ has
been to instigate a general and continuing appraisal and review of these
activities that have been reported and are available to us. Much of the
literature in this field has not been of a high technical quality, and
much of it is no longer relevant to today's technology. Hence, in making
a survey of activities in reliability theory it is necessary to prune away
much material in order to highlight the efforts that do seem to be
important.
Our goal in this section, then, is to present a critical and select-
ive survey of the literature that is relevant to the attainment of reli-
able networks and systems through the judicious use of redundant struct-
ures. In the following section we attempt to restate briefly the point
of view which leads to the particular categorization of topics that we
have chosen. Our concluding section contains a brief discussion of the
various activities that have been reported in the different categories
of reliability technology.
328
2. Summary of Subject Areas
a. Overview
The intention of this survey is to provide the interested reader
with an introduction to the literature on redundancy techniques by several
means. First of all, we provide an outline, or categorization of topics,
which serves to partition the field into the various technical areas
against which the reader may sharpen his own perception and conclusions
regarding the valid lines oi technical inquiry. Admittedly, the outline
we have chosen is one relevant to the concepts of modularity and recon-
figurability that we have regarded as essential to the program addressed
in the main body of this report.
Secondly, within this outline of topics we make reference to specific,
@
selected articles that are in general easily available to the researcher;
in addition, we briefly summarize their contribution to the technology.
The conclusions of the report (Sec. IV) essentially reflect the
conclusions o£ this literature compilation. Although valuable technical
contributions have been made in the development o£ systems that are more
reliable through the application of redundancy techniques, and although
these contributions are increasing in number and quality, it is clear
from a survey o£ the published literature that much remains to be done
before efficient and demonstrably reliable systems can be realized. Many
unsolved problems were uncovered as a result of this survey; a sampling
o£ these is contained in the listing of recommendations for future re-
search--Sec. IV.B.
* Included in the referenced articles are results from an earlier survey
of Soviet activities in the field o£ reliability theory. 28s This
survey, in the form o£ a preliminary draft, was presented at the Work-
shop on the Organization o£ Reliable Automata earlier this year. It
was supported partially by this program, and partially by the Air
Force Cambridge Research Laboratories. Since the Soviets have also
been quite productive in addressing themselves to these technical
areas; and since most of the work is available in translation to
researchers in the United States; and finally, since work in this
country and in the Soviet Union represents the overwhelming majority of
work done anywhere in these subjects, the Soviet references have been
freely included in this survey whenever appropriate £rom the technical
point o£ view.
329
It will be noted that almost all of the referenced articles have
appeared within the last ten years, and the great preponderance of them
within the past four years. This results partly from the selection
process, of course, but mainly occurs because that is simply the way the
density of publication has taken place--and it is still on an upward
slope.
Finally, it should be noted that several rather comprehensive bibliog-
raphies on reliability topics have appeared, and these are appropriately
referenced in their place below. It should be emphasized here, though,
that no attempt has been made here to supplant these bibliographies in
terms of comprehensiveness, although, of course, we shall make note of
some entries that appeared subsequent to their publication. Our goal
has been strictly to provide a selective reference to the literature,
from which the reader can proceed to his own ends.
b. Categorization of Subject Areas
In this program, redundant, modular, highly reconfigurable systems
have been identified as the basic organizational structure that holds the
most promise for the successful attainment of the mission objectives of
spaceborne computers. Accordingly, this guide to the literature is
structured to partition the referenced papers in a way that best serves
this point of view. Given a module that is a subsystem within a larger
complex, we have pointed out that there are two fundamental ways in which
redundancy can play an integral part in its functioning within the system.
These two ways are differentiated by the role played by the terminals of
the module. In static redundancy techniques, faults are accommodated
within the module itself (e.g., by fault masking) and the terminal activ-
ity is unaffected. In dynamic techniques, terminal activity plays an
essential role (involving fault detection, diagnosis, and the resultant
reconfiguration). These two basic categories are reflected in Secs. 4
and 5 in the discussion below, and all the other categories are, in a
sense, supplementary to them.
Thus the introductory Sec. 3, below, is concerned with papers of a
general nature. They involve either arguments supporting the need for
33O
redundancy, or tutorial or survey papers on the subject, or developments
in general reliability theory, including the calculation of the probabil-
ity of failure, or pertinent papers on the characteristics of specific
components. Also included are references to the several rather extensive
bibliographies that have appeared and are easily available to the reader.
The application of redundancy to other than computing networks and
subsystems (e.g., power supplies), as well as papers concerned with
environmental aspects (e.g., additional weight requirements of redundant
@
systems), is briefly reported on in Sec. 6.
3. Discussions of General Background
a. On the Need for Reliable Systems
The problem of achieving reliable systems in the face of today's
severe mission requirements has been well recognized in the published
literature. We mention several of these articles in order to set the
stage for the survey of the technical contributions that have been made.
In the first place, it is clear that the use of redundant structures
is but the last step in a hierarchy of measures that can be taken to
attain fault-free operation. It has been pointed out that increased
measures to achieve reliability are necessary in all stages of system
life, from the original design to the final installation and subsequent
maintenance of the system. Is° Certainly it is necessary that the com-
ponent reliability be as high as possible, that adequate attention be
given to the details of fabrication and assembly of equipments, and that
circuit designers carefully recognize the existence of tolerances and the
need for conservative designs, s7 Nonetheless, in systems consisting of
thousands of elements, it is necessary to provide also for alternative
* All of the reference numbers in this Appendix are keyed into the
common reference list that serves the entire report. This reference
list, in turn, has been alphabetically ordered so that it may serve
a separate function as a selected bibliography to the subject (although
it has not been supplemented as a bibliography--nonreferenced articles
do not appear in it).
331
system responses to accommodate to the almost certain event that an
unknowable number of components will fail before the end of the desired
life of the equipment--indeed, the likelihood is high that some failures
will occur before initial turn-on of power.
Furthermore the point has been made that the failure mechanisms in
modern solid-state components are such that periodic replacement and
testing simply do not make sense any more. 8 The reason for this is that
it is asserted that the state of the art has advanced to the point where
the few failures that do occur are purely random in nature. Present
component lifetimes are on the order of one failure per 1010 component
hours, and all identifiable causes of trouble have been eliminated to
the point where it is impossible to obtain adequate life-test data on
what the characteristics of the components truly are. s From these facts
it is reasoned that only redundant structures can be effective in markedly
increasing mission time, and even in deriving bounds that are useful in
estimating what the effectiveness time actually is.
Of course, the mission time is increased if it can be guaranteed
that all of the elements in a redundant system are initially working
properly, and the initial testing of such systems that are specifically
designed to ignore the occurrence of faults poses some problems. This
problem has been considered by Masters 2°2 who asserts that it may not be
necessary to know that all equipment is faultless initially, but only
that enough of it is functioning to be able to assert something about
the mission probability of success. In his discussion he develops some
analytical formulations relevant to such estimates.
Redundant structures may be effective, of course, in either of the
two fundamental ways that have been emphasized in this report. One way
is through the provision of auxiliary networks and schemes for the de-
tection and diagnosis of faults, with the implication that subsequent
manual or automatic action will be taken to replace the failed part.
The other way is to design the original networks so that a certain class
of faults can occur without affecting the input-output relations, i.e.,
by utilizing what we have called static redundancy techniques. Both of
these techniques may play a role in the same system.
332
DIn the first method, involving the use of dynamic redundancy
techniques, it is not at all clear just to what degree the human operator
can effectively play his role in the closed system. In unmanned space
vehicles his role is necessarily no more than performing analysis on
the basis of telemetered data (over noisy channels), followed by appro-
priate action signalled over the same channel in order to initiate
corrective procedures that must be implemented within the craft itself.
Even in the case of manned vehicles, the need for automated repair on
spaceborne vehicles seems substantiated. In a recent interview with
Roger Chaffee, one of the astronauts in the Apollo program, it was
claimed that all of the stabilization, control and communications
electronics utilize redundancy, usually a switchable redundancy at a
rather high level, lls When asked what role he thought in-flight main-
tenance could play--that is, whether the man on board could actually try
to fix a detected fault--Chaffee's answer was direct: "Not in the
electronics. As you know, these systems are pretty complex...and there
are quite a few integrated circuits .... "
On the other hand, even though there may be some argument that the
man may accomplish some of the physical aspects of repair, e.g., by
adjusting control switches, there seems to be general agreement that he
can play almost no role in the analytical functions that are required to
detect and locate troubles that may occur in the spacecraft electronics.
Indeed, even the possibility of using plug-in modules has been questioned
because of the unreliability of connectors. 3 In the same reference it
is asserted that much has already been done in the provision of automatic
malfunction detection and switchover logic--although so far mainly in
the control of relatively large subsystems, and certainly not on the
small module or component level.
Thus, at least in the Apollo program, the concept of manual in-
flight maintenance seems to be discarded as too demanding on the astro-
naut's time, and has been replaced by an automatic self-repair approach
that Uses a dual form of redundancy. In these dual circuits, a fault-
detection unit continuously monitors the operational unit and automatically
switches power to the auxiliary unit whenever a fault occurs. 12s
333
We shall not attempt to document similar problems that still pertain
to ground-based systems and conventional general-purpose computer installa-
tions, except to note that they exist; nor shall we consider the applica-
tion of redundancy to nondigital systems, e.g., contlnuous-control systems,
except to note that similar kinds of redundant structures and approaches
apply. 175 Even the Soviet Union, which hardly makes the activities of
its space programs a part of the accessible literature, has admitted the
general need for improved reliability in its main-line conventional com-
puter systems. Indeed it has been admitted that some of their present
computers, e.g., the BESM, STREL, and URAL computers, are simply not
reliable enough. TM
Thus it is clear that the problem of attaining reliable digital
systems continues to be paramount in the tasks of designers of spaceborne
equipment. A quotation from a recent book on the multitude of factors
involved in space exploration is appropriate: "Before the nearest
stellar systems can be probed, one of two technological breakthroughs
must take place. Either probe equipment, including electronics and power
supplies, must be given lifetimes on the order of ten years, or self-
diagnosing, self-repairing automata must be developed. Both avenues will
certainly be attempted."
b. On the Analysis of Reliable Systems--Bibliographies
Both in this country and in the Soviet Union there has been a great
deal of reported effort devoted strictly to the analytical problem of
calculating the probability of failure of a given network configuration,
given the failure probabilities of its component parts. These analyses
vary according to the different assumptions made about the failure law,
about the nature of the replacement-and-repair process, if any, and about
the nature of the structural or time redundancy provided. With few
exceptions, a fundamental shortcoming concerning these analyses is the
basic assumption concerning the independence of events--that is, the
assumption that one failure is in no way conditioned upon the occurrence
of another.
_ Ref. 49, p. 34.
334
DThere have also been a large number of survey papers which purport
to provide fundamentals to anyone new to the field; some of them are
quite good--including the several books that have been devoted exclusively
to the subject of reliability and redundancy--and provide a rich trove
for one who wants to delve into the subject for the first time, or for
the specialist who wants to broaden his understanding of the field as a
whole. Finally, there have been several good bibliographies published
on the subject.
An interesting qualitative discussion of the improvement in computer
reliability through redundancy techniques is found in a nontechnical
summary by Pierce. 24s This article includes a discussion of many of the
pertinent structures--such as restoring organs, threshold gates, vote
taking, and the like--all in terms that can be easily understood by the
nonspecialist. Herwald presents a concise summary of the state of reli-
ability theory, TM while Aroian 12 gives a summary of the basic formulas
needed for the calculation of the reliability of redundant systems. A
survey of the various redundancy techniques that have been proposed is
given by Teoste, s°° including detailed descriptions of the mathematical
models for estimating the reliability improvement and for comparing the
relative advantages of the several techniques including Moore-Shannon,
majority circuits, and other kinds of redundancy structures. He concludes
that where it is applicable, the Moore-Shannon type of redundancy provides
the most significant improvement. Other similar treatments are avail-
able. ss'2ss,2°9 In particular, Creasey's paper ss includes a review of
the role that can be played by the application of error-correcting codes,
while Fedderson and Shershin 35° describe the problem of determining the
optimum number of redundant elements when various restraining factors
such as cost, weight, and volume are taken into account. A more advanced,
theoretical treatise that surveys the mathematical models useful in
solving reliability problems, but from a mathematical point of view, is
found in Barlow and Proschan. 21 Also worth mentioning is the generally
high-caliber collection edited by Wilcox and Mann, 329 which brings to-
gether a great many contributions to reliability theory, ranging from
the quite scholarly to the eminently practical.
335
The purely statistical approach to reliability theory is summarized
in the collection edited by Zelen 336 which cites many interesting prob-
lems and contains papers concerned with statistical models, maintenance
and replacement policies, confidence limits, and the like. In particular,
325
a very readable review of the literature is in the paper by Weiss,
including topological and time-dependent aspects of reliability models.
The statistical basis for the exploration o£ redundancy systems is also
treated by Moscowitz 21s and by Kuznetsov 173 who presents a method for
determining the reliability of a system from the results on tests made
on part of the system (applying standard statistical techniques to the
assumed situation where complete testing is impractical because of the
amount of test equipment required and the limited amount of time avail-
able). Drenick 67 introduces the notion o£ expected economic gain, which
is a random function depending upon the number of failures and their
times of occurrence, as well as upon the particular replacement policy.
He also is concerned with the probability laws by which equipments fail B6
and purports to show that the time between failures tends toward an
exponential distribution as the number of components grows large.
The relevance of these models depends strongly, of course, on the
accuracy of the assumptions on which they are based. In an early paper,
Creveling 54 points out, among other things, the need to design circuits
so that the assumption of independence of faults is justified. Either
this approach is necessary, or the statistical models must be made more
complex; for, as Pollyak 249 demonstrates, an incorrect statistical-
independence assumption will indeed lead to errors in calculating reli-
ability. (He gives examples of calculations showing the effect upon
series-connected and parallel-connected components in particular.)
Virene 317 has discussed this problem rather directly in his pre-
sentation of nonparametric life testing--that is, the use of statistics
in which no assumption is made concerning the underlying distribution
that characterizes the operating life of an equipment--although he warns
of the possible losses o_ efficiency when such methods are used.
336
DMany other discussions of the application and estimation of system
reliability have appeared,lS,21s, 2s° including attempts to relate the
probability characteristics of system reliability to the detailed failure-
distribution law of the parameters of the components 26s and discussions
of the usefulness of statistical assumptions concerning independence,
probability distributions, and the like. 222
Most of the papers in this field give adequate reference to previous
and related publications; this is certainly true of most of the documents
cited above. In addition several lengthy bibliographies have appeared.
Of these we mention Balaban 18 and certainly the useful assemblies by
Jensen .I38,1 37
c. Optimum Redundancy and Other Considerations
An area of inquiry that is closely related to the previously dis-
cussed papers concerned with the various aspects of general reliability
theory, yet that asks a more system-oriented question, is that repre-
sented by a number o£ papers on determining the "optimum" value, or
ratio, of redundancy. These usually take into consideration other
environmental factors such as power consumption, volume, weight, etc.,
affected by the extra components iu the redundant system. We have
already mentioned one of these, 35° which attempts to formulate redundancy
as a function of total costs, individual element probabilities, etc.,
for various abstract configurations. Not surprisingly, the techniques
of dynamic programming are among the tools presented for the solution o£
such multiple-constraint problems.
An approach along these lines is given by Webster, 322 who bases his
work on the theorem that making a low-reliability part redundant causes
a larger numerical increase in reliability than making a high-reliability
part redundant. Hence redundancy should be progressively applied,
starting with the less reliable portions of the system, until some system
constraint--such as power consumption--is exceeded. In a later paper 323
he demonstrates the practical application of these procedures to a system
composed of 14 subsystems.
337
More abstract approaches to the question are presented by Barlow et
al. 22 and by Pierce, 242 who advances a procedure for synthesizing a sys-
tem to obtain the greatest reliability corresponding to a set of fixed
costs, or alternatively to obtain a given, fixed reliability specification
with the minimum set of costs. In a similar vein, Barlow and Hunter 2°
obtain relations for determining the number of components that maximizes
the expected life of a circuit, given an exponential failure law.
On a different question, Esary and Proschan 77 have considered the
relation between the failure rate of a system of (identical) components
and the failure rates of the components themselves. In particular,
they have treated the important case of "k out of n" circuits--that is,
circuits which function properly if any k of the n components are still
functioning.
In yet another area, an interesting point of view is represented by
the work of Merekin, 2°4 Malyugin, Is3 and Muroga, 224 who show how to form
expressions for the reliability of certain types of combinational circuits
directly from their switching functions, without analysis of the circuits
themselves. Gendler, ss on the other hand, analyzes a very particular
circuit logic structure by considering the probability that a given
threshold function is indeed realized by a threshold device that suffers
statistical variations in its weights and in its threshold.
4. Discussions of Static Redundancy Applications
a. Fault-masking Techniques
In a broad sense the terms "static redundancy" and "fault masking"
may be considered synonomous; as pointed out in this report the basic
concept embraced by both terms is that the redundancy is provided as an
internal, integral, autonomous part of the network and operates without
intervention through the input-output terminals--at least until the net-
work fails completely because the number of faults has become too large
to be covered by the masking provisions. In this sense it matters not
what the form of the redundancy is--whether a multiple-line voting scheme,
series-parallel configurations, an internal error-correcting code process,
338
Dor some other scheme. For the purpose of this exposition, however, we
shall reserve the term "fault masking" for the obvious structural type
of replication represented by voting schemes, for example. We shall
treat the signal type of redundancy based upon coding theory in the
following section. This is strictly for convenience, however, as close
inspection reveals that the distinction is quite artificial. Further-
more, we shall separate the structural fault-masking techniques into two
parts: voting schemes and nonvoting schemes, primarily because such a
split roughly represents an equally weighted division of the work that
has been done in the field.
i) Nonvotin_ Schemes
For our purposes, we shall consider the basic paper by Moore
and Shannon 214 to be the starting point for nonvoting redundancy schemes.
Here relay-contact networks are explicitly considered as the components
of interest in primarily combinational network construction. It is shown
that as long as the failures in contacts can be considered statistically
independent, then arbitrarily reliable networks can be built regardless
of how unreliable the individual contacts may be. The construction pro-
ceeds by an iterative process wherein the individual contacts of the
number are replaced by a certain network of contacts. Moore and Shannon
showed the number of contacts required to achieve a given reliability as
a function of the individual relay characteristics. Kochen Iss has
directly extended their results by showing that the required redundancy
in such networks is also a function of the particular logic function
being realized. Further extensions to more general types of networks,
including the important "k-out-of-n" structures, have been made by
Esary et al., 76'29,7s who also consider the case where the components
may have differing reliabilities. Many other papers have appeared that
are concerned with various other extensions or special network proper-
ties.172,s2,118,132,234, 111 Asymptotically, it turns out that the cor-
rection of any single fault in a contact network does not increase the
complexity of the network; this is shown for the case of shorted contacts
in a paper by Potapov and Yablonskii 252 and for contacts that fail by
opening by Madatyan. *ss For two or more shorts or opens, however, the
339
asymptotic complexity of the network must increase. In a similar fashion,
it can be shown that the asymptotic complexity of two-level rectifier
gate circuits must increase with the number of faults to be tolerated. 227
Some economies can be achieved by taking advantage of "don't-care" con-
ditions, however, as is the case with straightforward switching-function
realizations. Dunning et al. s9 consider the specific logic-block case
of general NOR-gate trees (including considerations and comparisons of
quadding, voting, etc., in such networks), while Weinstock 324 has treated
the mathematical problem of reducing arbitrary network structures to a
series-parallel form resulting in systematic methods for deriving the
reliability parameters of any network involving a flow of information
between two terminals.
Another important core technique for implementing a nonvoting
kind of redundancy is due to Tryon, 3°5 who introduced what he calls
"quadded" logic--a circuit construct wherein the components appear in
quadruplicate so that errors are corrected one or two levels downstream
from their inception by a mixing of good signals from neighboring units.
Many papers have been devoted to extending Tryon's original results
(e.g. Ref. 139 for NOR-gate networks). Of especial note are the valuable
extensions and generalizations developed by Pierce 244'243 in his "inter-
woven" logic, where Tryon's work was shown to apply to other than the
AND/OR/NOT logic blocks, and to much more general patterns of correction.
In some cases double, triple, and other errors can be corrected, as well
as single errors.
Taking a quite different approach, but in a similar vein--
namely, the replacement of each component by a network of components--
Urbano 310'311,312 has investigated what he calls polyfunctional networks.
These are an iterated network construction in which each component is
replaced by a copy of the entire network, and the iteration may be
carried on to any desired degree. Questions such as the stability of
such networks, and their convergence to a fixed function set, are
examined. Sethares TM has also examined the characterization of functions
in polyfunctional networks.
340
In yet another direction, Muchnik and Gindikin 22° have examined
the question of whether the conditions which establish the completeness
of a set of logical primitives need to be modified when some of the
primitives are unreliable. They find that some modification is necessary,
and determine the new conditions.
2) Voting Schemes
The large number of papers on vote-taking redundancy can be
traced back to the fundamental paper of Von Neumann, 31s where multiple-
line redundancy was first established as a mathematical reality for the
provision of arbitrarily reliable systems. In this paper it was demon-
strated that arbitrarily specified reliability could be achieved using
unreliable components (of bounded unreliability, however). It was
essentially a mathematician's answer to the question of whether redundancy
could indeed pay off. A spate of engineering attempts have followed in
an attempt to adapt the fundamental results to practical computing sys-
tems. In some sense, the voting schemes can be distinguished from the
nonvoting schemes (such as quadding) by the fact that the restoration
unit, i.e., the vote-taker, can be physically distinguished from the
units actually performing the logic. In essence, voting schemes simply
replicate the function to be realized in several different "lines" and
then takes a weighted measure, usually a majority vote, over the outputs
of the independent lines.
Subsequent papers treating such systems in general are numer-
ous. 33'156,ss'196 Jensen has developed a lower bound on the reliability
of multiple-line networks, 14o using a minimal-cut concept to describe
the system and analyze its failures. The fact that the reliability o£
majority voting circuits is sensitive to the particular modes of failure
(i.e., whether the probabilities of failing to zero and to one are
different) is discussed by Rhodes; 264 an extension to components that
can exhibit three possible states of behavior is discussed by Rau. 2s2
Lyons and Vanderkulk ls8 consider voting circuits in which the vote-takers
are assigned either to the inputs of the loglc modules or to the outputs.
It is shown that the output voting scheme is generally to be preferred,
341
from the standpoints both of improved reliability and of reduced redun-
dancy ratio. The important concept of adaptive vote-taking, in which
inputs are weighted according to their error history, was conceived by
Pierce TM and is discussed by Angell s and in Sec. II-A-2-b of the body
of this report.
A number of papers have been concerned with the detailed design
of the all-important vote-taker itself, s°,ls5 Farrell s° also considers
the problem o£ the optimum decomposition of a system, i.e., exactly where
the vote-takers are to be placed, as does Cohn, 47 who determines the
optimum level for a system composed of identical components. If the
vote-takers and connecting wires are also vulnerable to failures (an
often ignored assumption), it can be shown that it is not always best to
make systems redundant on the component level. Knox-Seith ls4 carries on
in this tradition and determines the best placement and resulting reli-
ability for cascaded logic systems. Gurzi x15 also has some recent results
on the cascaded three-line situation; she derives upper and lower bounds
on the possible improvements through triplication, and compares the
single-voter and three-voter schemes. Jensen et al. TM and Rubin _72
have also developed synthesis techniques which determine the minimum-cost
voter placement locations, as well as new techniques for estimating the
reliability of the resulting systems.
Many papers have been devoted to comparing the results that
can be obtained for the various schemes, voting and otherwise (e.g.,
Refs. 300 and 37). Attempts in this direction have also been made by
Domanitskii and Prangishvili, e3 who compare the failure probabilities of
systems using multiple-line voting structures with those of systems using
more elemental replication on the component level, specifically for the
case of transistor NOR circuits. Their results indicate that as the
reliability o£ the individual components increases, so does the preference
for replication at the component level. In the same vein, but for
different redundancy procedures, Polovko and Zaynashev TM compare fault-
masking schemes with replacement schemes; they observe that the situation
is ambiguous (perhaps not surprisingly) and that for one set o£ assump-
tions one approach is preferable to the other, and vice versa. They
342
Dconclude that the widest range of tolerance is achievable by using a
combined scheme employing both masking and replacement strategies--
certainly consistent with the couclusions of this report.
Variations on the majority scheme have also been mentioned in
the literature. For example, Depiaa and Grisamore 58 discuss the re-
storing element which averages its inputs, rather than taking a majority,
and conclude that the averaging method can provide greater reliability in
some cases. On the other hand, Lowrie questions the efficiency of the
triple-line scheme itself lss and proposes instead the use of a duplex
approach. In essence this is simply the parallel operation of two
identical logic circuits or computer subsystems, with associated cir-
cuitry to detect any discrepancy between the two. The system is not
truly fault-masked, however, for at this point the redundancy becomes
dynamic--not only must the discrepancy be detected but it must then be
responded to by external diagnostic equipment to localize the faulty
system and to initiate switching action to disconnect the offending
system. Nonetheless it is in the replicated-line family and it is claimed
that in some cases it can achieve a higher reliability than triple-line
systems, while using fewer parts. This assertion obviously stands or
falls on the complexity of the associated equipment needed for detection,
diagnosis, and switching, and on whether the lost time is acceptable
within the operation scheme of the system.
b. Applications of Codin_ Theory
The connection between error-detecting and correcting codes, which
have been largely developed and formalized under the aegis of communica-
tion theory for noisy-channel applications, and the application of re-
dundancy in computing networks has long been recognized, and a numerous
literature exists which formalizes the relation. These involve both
abstract determinations of bounds and specific implementations to various
network types. We shall point to a few of these contributions, recog-
nizing that the differentiation between static and dynamic systems
becomes a little hazy at this point. We also wish to mention fault
masking in sequential machines in a separate, subsequent section, and it
343
is obvious that we shall suffer some overlap, since clearly one very
direct application of error-correcting codes is in the judicious encoding
of the states of sequential machines in order to obtain certain proper-
ties of the code for the machine behavior. Nonetheless, coding is such
a well-developed formalism in its own right, as well as in its easily
identifiable relevance to computing networks, that we shall risk the
possible redundancy in this discussion.
An early, elementary discussion of the use of checking codes in
digital-computer applications is presented by Diamond; 6° in this paper,
for example, he pointed out the required distance properties for various
applications. Shortly thereafter, Elias 71 considered the problem of
reliable computation with a model resembling a noisy channel and on the
basis of his assumptions shows that truly arbitrary reliable computation
may be obtainable only at the expense of reducing the computational
capacity, in an information-theoretic sense, to zero.
A summary of the applications of error-correcting codes is found in
Massey, 2°° as well as some original work on "reversible" codes which make
it possible to read data from memory starting with either end of a data
block. Peterson 238 has a useful chapter on arithmetic codes, and Kautz 14s
presents an evaluation of the use of several code families in digital
systems, including a consideration of some new codes, and of the simpli-
fications that are possible if simply detection is of interest and cor-
rection is not.
Mandelbaum 194 applies cyclic codes for burst-error detection to the
case of arithmetic operations, and attempts to accommodate single,
double, triple, and double-burst errors, while Armstrong treats the use
of nonbinary single error-correcting codes, I° developing both bounds on
the required redundancy and some specific codes that meet the bounds.
Ray-ahaudhuri 263 has extended Armstrong's work and has shown that mul-
tiple error-correcting codes can be applied to the correction of failures
in several different units. Many other papers are available that treat
various aspects of the problem (e.g., Refs. 293, 304, and 237), and much
work has gone into the development of codes that are peculiarly well-
suited for certain arithmetic operations (e.g., Refs. 28, 31, 239, 89, 90).
344
Avizienis 13,1s has pointed out the use of codes for diagnostics in a
dynamic sense, as well as in the static sense.
Homan has described the logical design of an adder used in the IBM
Stretch computer TM which utilizes four-bit groupings that are tied
together by a carry-look-ahead system. This paper is one of the first
to describe high-speed (25 ns for the add operation) checking circuits
for arithmetic operations. The automatic correction of burst errors
originating in a computer memory has been described by Daher ss in con-
nection with the automatic error-correction unit for a disk memory.
The Soviets have also been active in the application of codes. For
examples following Gavrilov's leadSS,ss,s7, ss several attempts have been
made to apply error-correcting codes to the design of switching net-
works. 64'172'234'31s,296 For sequential circuits, Kurdyukov 172 has
developed a single fault-correcting counter based upon a Hamming code,
and Gavrilov 9s'98 has designed some redundant sequential relay networks
that are insensitive to the malfunction of any one relay. A similar
approach has been taken by Svechinskii 29s and his use of error-correcting
codes for redundant state assignment. Sagalovich 27s has also recently
considered the application of codes to the state-encoding problem in
automata so that failures can be tolerated.
Rubio 273 has also considered the design of self-correcting counters
using coding results. Other more abstract treatments of the general
interrelevance of coding theory and redundant networks are avail-
able.2O 3,332,333
c. Static Redundancy in Sequential Machines
There have been a number of more or less abstract papers dealing
with the careful encoding of the secondary state assignments of sequen-
tial machines, often directly utilizing the properties of error-correcting
codes. These machines are encoded redundantly, hence have redundant
states, and if properly designed can possess any of several error-
tolerant properties. For example, the assumption of a redundant state
may signal an error-detection circuit that an improper transition has
occurred; or the redundant states may be so organized that the machine
345
returns in a finite number of steps to the proper primary state, thus
truly masking the error (after a period of time, during which the output
behavior may be insensitive to the fact that something was in error).
Levenshtein TM has made an attempt to define the class of sequential
circuits which are insensitive to an accidental change of state, in the
sense that they will return to proper state behavior after a finite
amount of time following the occurrence of an internal error. Thus the
present state of the network is dependent only on the inputs and a finite
amount of past history. This situation has also been investigated by
Dauber sG and may be compared with the model described by Winograd TM and
Harrison, 119 where the errors are due to accidental input changes.
We have already noted the contributions of Sagalovich 27s and
Svechinskii 296 in the use of error-correcting codes to make secondary
assignments such that failures can be tolerated. Another paper which
considers a similar model that tolerates error in its state transitions
is that of Frank and Yah. ss The special problems in relay machines have
been discussed by Mullin 223 including the particular problems of analyzing
the reliability of such machines.
The general problem of analysis of such machines is still difficult
and not well understood. Tsertsvadze3°S, 3°7 has suggested the use of a
stochastic automaton model as the appropriate general approach to the
design of finite automata with specified reliability characteristics,
composed of unreliable components. His methods, which are based on Von
Neumann, as well as Moore-Shannon models, appear to be addressed both to
permanent and to intermittent faults within a system.
Several papers have been directed toward the design of specific
machine types; in particular, counters serve as a convenient vehicle.
We have mentioned Rubio's _73 paper in this regard; Russo 274 has also
used codes to design counters--using distance-three codes in the input
equations, thus assuring a tolerance to slngle-blt errors, either per-
manent or transient.
346
5. Discussions of Dynamic Redundancy Applications
a. Approaches to Fault Diagnosis
For systems where dynamic redundancy is used to increase the reli-
ability, it follows by definition that auxiliary equipment is involved
in the response of the overall system to a fault in one of its subsystems.
Thus the terminals of the subsystem are involved, with the help of equip-
ment external to the subsystem proper, in the detection, location, and
analysis of the nature of the fault (which we lump under the term diag-
nosis for this discussion). Secondly, external equipment (perhaps with
human intervention) decides the nature of the response and carries through
on its execution. Diagnostic procedures are reviewed here; the second
phase is reviewed in the next section. In the last section we mention
complete system organizations that have been proposed to facilitate
self-repair.
All structural redundancy is, after all, in some sense simply the
provision of spare parts in a mechanism to enable it to tolerate failures.
In static redundancy these spare parts are permanently part of the
mechanism and the toleration (up to some saturation point) takes place
autonomously (and anonymously) within the mechanism itself. In dynamic
redundancy, although the spare part may be fully powered--even performing
the same job--there must be a switchover process on the advent of a fail-
ure. Hence the task of diagnosis must be no more than to determine the
replaceable unit--whether that be an individual component, a network, or
a subsystem. Thus, depending on the system, simple fault detection may
be all the diagnostics that are needed if the replaceable unit is the
entity signalling the detected fault. On the other hand, the fault de-
tector may embrace a number of replaceable units, or modules, in which
case fault-location routines must be instigated in order to find the
particular offending module that must be removed. Finally, if the re-
placeable units are the actual components within a physically contained
module, then what is more conventionally described as a diagnostic test
routine must be used to locate the particular component that has failed.
347
Thus all of the terms "detection," "location," and "diagnosis," which
cause some semantic confusion, we identify as describing general diag-
nosis activities.
The formal problem statement and solution of the fault-diagnosis
situation for combinational networks has been presented in this report,
as well as in an earlier summary. 149 Given a network with inputs and
outputs, a fault can be diagnosed if and only if it produces an output
column in the fault table that is distinguishable from all other columns
(including the column for the network without faults. The problem then
evolves into one of determining which rows of the fault table to use as
inputs in a test schedule so that some criterion of the schedule is
optimized (e.g., perhaps the maximum length is important, or perhaps the
average length). Kautz' procedures 149 will formally give minimal sched-
ules but become unwieldy for large networks. He has also introduced the
notion of serial test schedules and has shown that these may result in
much shorter routines.
The similar problem in the Soviet Union stems from an early work by
Chegis and Yablonskii 41'33s who first introduced the notions of the fault
table and the use of a minimum-length test schedule for the isolation of
single faults (opens and shorts) in combinational relay-contact networks.
They proposed the use of special switching functions, and showed how
manipulations of these expressions could be considerably simplified when
the network is planar, by working with the dual network. Alexander 4 in
this country has also shown that special switching-circuit notations can
be applied to the drawing up of test and maintenance schedules.
Later Soviet work has relied heavily on the earlier work; for
example Kogan 167 improved the Chegis-Yablonskii procedure somewhat by
using the notion of proper-cut sets of the graph of the contact network,
and also evaluated some special networks based upon disjunctive and con-
junctive normal-form expansions. Kogan, 1ss in another paper, and later
Vaksov, 313 showed how to minimize the length of the test schedule for
the location of any number of shorts and opens in the class of nonrepe-
titious networks (i.e., those networks having just a single contact in
348
each input variable). Later, Glagolev I°6 specialized the Chegis-Yablonskii
approach to iterative contact networks, and others 291,287 have extended
the concepts to gate-type networks, including location to either the
faulty gate or the faulty subsystem. The possibility of utilizing
auxiliary inputs is examined by Karibskii, 145 and a special form of serial
testing is proposed by Kinsht 162 in an attempt to shorten the average
test-schedule length.
On the level of entire digital systems, there have been two interest-
ing analyses; one by Fleyshman, 83 who studied the self-monitoring concept
and concluded that such a system needs an ultrareliable "hard core" which
can be used to check out in succession increasingly larger portions of the
system hierarchy, and one by Lyabutov, 187 who solved the general problem
of optimizing a sequence of tests on a multimodule system, given for each
module its probability of failure, the length of time required to test
the module, and the probability that the test will not reveal a fault
within the module. (An optimal sequence is defined here to be one whose
average length is minimal.) In addition, the area of programming diag-
nostics has been examined for particular systems, e.g., the URAL com-
puter. 79 There have also been more sophisticated diagnostic approaches
which involve diagnosing over successively less gross portions of a
computer until finally the particular offending gate is isolated. 316
This sort of approach appears closely related to similar philosophies of
serial testing that have been mentioned previously.
Several papers have discussed the importance of diagnostic pro-
cedures in complex systems (e.g., Refs. 182 and 136), the nature of the
diagnostic equipment, and routines (e.g., Refs. 34 and 87). Lee Is°
proposes a method for the situation where normal operation can be inter-
rupted, while Schneider and Wagner 276 are concerned with the case where
uninterrupted operation is crucial. Several papers concern themselves
with diagnostic programs in particular machines,27s,158, TM and Tsiang
and Ulrich TM have described the routines used in the modern
electronic central office of a telephone system. The operation appears
to have the same flavor as the duplex-redundancy scheme previously
mentioned; that is, the central-office control is duplicated; both units
349
work together on the same input and the resulting outputs undergo con-
tinual comparison. Any discrepancy calls for fault-diagnosis routines
to be started during the next available period.
Poage 246 has also considered efficient schedules for combinational
networks, and has proposed an analytical procedure that yields an output
expression which reflects network structure; he then uses this output
expression to develop tests to detect single faults, and also extends
the method to multiple faults as well. Chang has also investigated the
problem of efficient diagnostic tests for combinational networks 4° and
develops an algorithm for reducing the redundancy in the rows of the
fault table selected for a schedule. Armstrong 11 uses a "path-sensitizing"
concept in the treatment of the class of faults wherein connections become
stuck at logical one or zero, and describes a procedure that can be ap-
plied to larger networks than can be treated by exact methods. Roth 27°
has also developed diagnostic algorithms for faults in combinational net-
works. One way of developing a set of tests is to simply simulate the
network on a computer and let the computer generate the failure cases;
this approach is proposed by Seshu and Freeman 28° who use a decision tree
to successively isolate the fault to a smaller and smaller set. Johnson
considers the cost of a given test and the information gained by it to
develop a figure of merit to help in developing the most efficient test
procedure. 142
Kautz 147 has described constructive procedures for the design of net-
works that provide an output to indicate a failure within the network--
that is, an automatic fault detection signal--while Kilmer Is9 has discussed
an idealized computer organization composed entirely of fault-detecting
circuitry of the fault-detecting type which can correct a bounded number
of transient failures which might occur anywhere within the computer.
Thus the area of combinational network diagnostics would appear to
be very well represented, and the formal problems seem to be well under-
stood. There are still problems in large networks, in the handling of
transient errors in general, in the utilization of auxiliary inputs and
outputs, etc., but much comprehensive work has been done and directions
for further research are relatively clear.
35O
D
Such is not the case for the diagnosis of sequential machines,
however. Some very excellent work has been done, but it is still largely
of a theoretical nature and certainly is not practicable for machines
larger than a very few states.
In one very basic approach, the problem is formulated in a fashion
rather analogous to the fault-table representation for combinational net-
works. If the given machine is designated by S, then the class of faults,
k in number say, corresponds to faulty machines S1, ..., S k. The obvious
approach, then, is to determine a set of inputs so that the resulting out-
put sequences will distinguish all the S from each other, as well as1
from the good machine S. This is essentially the formulation proposed by
Gill l°3 and by Poage and McCluskey, 247 wherein procedures for developing
appropriate input sequences are presented. It is reiterated that the
procedures rapidly become unwieldy, even with computer aids, beyond a
few states and a small set of faulty machines.
Hennie 12_ takes a similar approach, related to early work by Moore, 21s
and asks for appropriate "checking experiments" (i.e., input sequences
again) which will allow the determination of whether the assumed state
table for the machine is actually the one being traversed by the machine
being checked. An important consideration is whether the given machine
has a "synchronlzzng sequence, that is, a sequence which will return it
to a given starting state--not all machines do. Kime le°'lel proposes a
specific testing method based on Hennie's testing philosophy and considers
the problem of designing circuits that have "distinguishing" sequences.
Soviet treatment of the problem of diagnosing faults in sequential
networks is also rather skimpy. We mention only Karibskii et al. 14e
where Gill's procedure l°s has been successfully generalized and a few
bounds on the lengths of the test schedules are given. However, the
problem is far from solved, and there is much work to do.
b. Spare-Equipment Considerations
There has also been a great number of papers concerned with the
statistical aspects of systems with standby parts, under various assump-
tions on the lifetime of such spares, on their number, on their
351
repairability, and so on. It is convenient to mention, in this same
context, questions concerning maintenance procedures and the effects
that various maintenance policies have on system life, depending on
various assumptions on checking periods.
Weiss 326 has considered the determination of the optimum checkout
interval for a system which does not exhibit the usually assumed expo-
nential failure characteristics. Flehinger 81 provides a treatment of a
number of different probabilistic models corresponding to different
preventive-maintenance policies.
Teoste 299 presents the design of a computer which can continue
operating satisfactorily even while a part of it is being repaired. He
provides a general discussion of machines of this type; the first gross
approach is to consider several computers all operating on the same
inputs, with the output taken from one unit till it fails, then from the
next, and so on; then the system can be broken into smaller parts and
essentially the same process applied to these parts, etc. Gaver 94 also
discusses the time to failure and the availability of redundant systems
in which repair is permitted. He assumes that failures are immediately
identifiable, and again takes primarily a duplex approach. In a later
paper 93 he loosens the assumptions somewhat and analyzes the failure time
of such a system when the two units are not the same--in particular, when
they possess different kinds of statistical failure properties. He
develops explicit formulas and approximations for the mean time to failure,
and shows how these are affected by various repair-time distributions.
Other papers that consider various assumptions on the number and failure
distributions of spares include Refs. 207, 217, 163, 327, and 334.
Rosenheim and Ash 2ss assume that the redundancy in spares extends to
entire machines, rather than just to components or small units.
Flehinger s2 has presented a comparison between, on the one hand, the
situation where the redundancy is applied over complete independent
machines, and on the other hand, the provision o£ spares on the basis
of much smaller units. She calculates the resulting reliability in each
case as a function of the number of units in the nonredundant machine
and of the degree of replication.
352
Other special assumptions are reported by Sinitsa, 2ss who includes
the possibility of statistical dependence between the active and spare
components of a system, and by Muth, 225 who assumes limited repair capabil-
ities and plots expected time to failure as a function of the capability
for one repair, two repairs, and so on. Nerber 22s and Buckley 35 both
examine the necessity of including failure rates for the idle standbys,
even in the power-off mode, and including the problem of detecting whether
a turned-off part has failed.
The Soviets have also been fairly aggressive in examining these
areas. General but fairly conventional formulations, concerned with
determining the probability of failure-free operation of systems which
are operable until a given number of failures have occurred, seem to be
well represented.lS4,22s,2ss, TM These are generally concerned with the
calculation of the mean time to failure under different probability
assumptions. Closely related to these papers is the work of Shcherbakov 282
who investigates the characteristics of the "up-time" distribution of
systems under various assumptions on the time for repair once a break-
down occurs. Also worth noting in this regard is Zhozhikashvili and
Raikin's ss7 extension to include the consideration of systems in which
there is an elemental amount of self-diagnostic capability included in
order to signal the loss of some fault-masking capabilities as a system
progressively deteriorates. On the other hand, Malev 19° assumes that
maintenance procedures are instituted periodically and calculates the
average period of correct operation under this regimen, while the effects
of incorporating both standby-reserve equipment and a maintenance-and-
repair schedule are considered by Zubova TM and by Raikin et al. 2ss
Zubova assumes duplicate systems, with one in idle standby, under
different assumptions on failure times for each system, and different
times within which a system is assumed to have fully recovered.
Similar studies of standby systems have been reported by Gnedenko1°7, 1°s
and others.23,1°2, 2s7 The possibility of using one of the computers in
an interconnected system to check out the other for faults automatically
has been recognized, and programs for accomplishing this possibility have
been developed. TM
353
In a sequence of papers Raikin, 255,2se,259 Kel'mans, 155 Smolitskii
and Chukreev, 29° and Alekseev and Yakushev 2 have studied the problem of
determining the optimal number of spare modules of each type in a modular
system, taking into account the "cost" of each type of spare module, the
probability of failure of each module, and whether or not the module may
fail while it is a spare. Several algorithms for the optimization are
discussed in these papers. Gertsbakh I°2 and Barzilovich 2s have investi-
gated the same type of system, under the condition of minimizing the
average loss which occurs between the times of fault occurrence and fault
repair, and algorithms are derived for optimal maintenance based upon
this condition. Raikin 257 has extended his results on the average number
of standby modules that remain available as a function of time.
Korman 169 has addressed himself to the inverse of the problem of
determining the optimal reserve for a modular system; in particular he
assumes a given ratio of spare modules and determines the necessary re-
liability characteristics of the module itself in order to achieve a
specified overall system probability of trouble-free operation.
The possibility of using memories which are segmented into units is
an important aspect of complex reconfigurable systems. In this regard,
we note that in studies using the URAL-I and the M-20 computc[_ .7° it
has been concluded that the use of substitute, switchable memory spares
is feasible, even under the assumption that the switching system itself
is no more reliable than the balance of the system. In a related approach,
a duplexed memory system has been reported TM which may be operated in
either the duplex mode--with both memories being read simultaneously and
updated by the particular section being used--or in a simplex mode. In
the simplex mode, both memories are actively used and the choice of mode
is determined by the nature of the mission.
Cosgrove and Masters, s° and other workers at Westinghouse TM have
considered an interesting self-repair variation on the conventional
multiple-line voting scheme: they make provision for the shifting
around of blocks of circuitry as a function of the failure history. As
failures occur and leave certain networks more vulnerable than others to
354
succeeding failures, circuit blocks are switched about in order to in-
crease the reliability of the more vulnerable networks. In computer
simulation programs they produce results that show that only a modest
amount of switching is necessary to produce significant reliability
gains.
Goldberg I°9 has developed several network schemes that use both
static-masking techniques and adaptive-replacement techniques to augment
the reliability of multiple-line systems. (See also Sec. II-A-2-b of
the body of this report.)
c. System Organizations that Facilitate Self-Repair
There is also a growing number of papers concerned with the organi-
zation of entire systems that optimize the capability for self-diagnosis
and repair, reconfigurability, or modularity. For example, Terris and
Melkanoff 3°3 describe a self-repairing system with a switching mechanism
for the replacement of failed circuitry which has been automatically
diagnosed. They utilize a "master machine" to aid in the control functions,
and find that a thirty-percent increase in required equipment results in a
fourfold increase in the mean lifetime between failures.
Landers 174 attempts to categorize self-repairing systems in a
general way, notes that very little reduction to practice has been made
of the many concepts that have been formulated, and offers the debatable
opinion that the difficulties stem from the fact that such concepts border
on self-reproducing systems--and hence share their fundamental difficulties.
Doyle 6s describes a program that is used to enable a digital system to
repair itself, and in application to the SAGE system it is asserted that
the system provided automatic recovery from over 90 percent of the failures
occurring during the period of study. Kruus 171 has also studied self-
repairing systems composed of a number of identical machines, spare parts,
and the necessary interconnecting mechanisms. He has simulated machines
operating in parallel, with faults being detected by observing a difference
in outputs. He also notes the problem of the initial setting of a newly
substituted machine in order that its outputs agree with the outputs of
the operating machines.
355
Agnewet al. 1 have presented the design of a hypothetical aerospace
computer; they use a partitioning technique to determine the appropriate
diagnostic subsystems. Interestingly enough, they conclude that diagnosis
and self-repair (i.e., dynamic redundancy) are not sufficient for maximum
system availability, but that static redundancy techniques must be used
also in order to achieve the optimum configuration. Avizienis 14 has pre-
sented complete processor organizations using his codes for continuous
generation of real-time diagnostic information in order to initiate repair,
replacement, or reorganization of the system.
Forbes et al. 84 describe the organization of a computer specifically
designed to maximize up-time through self-diagnostic routines, but consider
only manual replacement of faulty modules. Manning197 has reported on an
extension of earlier work198 on the self-diagnosis problems in a single-
processor machine to those in a large multiprocessor machine. Joseph143
also reports on a specific multiprocessor configuration, and England74
describes a space-guidance computer configuration that exhibits both
static and dynamic redundancy techniques at different hierarchical levels.
6. Peripheral Considerations
Redundancy techniques can be applied systems other than computing
networks: for example, to power supplies, mechanical systems, etc.
There have also been a number of papers concerned with the other environ-
mental aspects exhibited by redundant systems: the increased volume and
weight requirement, the adjustment of loads within a system when fault-
masked components fail, and so on. These problem areas were not directly
attacked in developing this survey, but several relevant articles can be
noted that were accumulated while searching for articles pertinent to the
main body of interest.
It has already been noted that redundancy techniques can be trans-
formed directly to application in continuous systems. 17s Raikin has
investigated the problem of redistribution of loads in a series or
parallel-connected system; if an element fails during operation the re-
maining elements must take over its load or voltage, and specific reli-
ability formulas are derived for given dependencies of the damage incurred.
356
DHerron 12s considers weight as a constraint in maximizing reliability,
and shows how his method can be extended to include other environmental
constraints as well.
Finally, Paynter and Mathis TM consider the problem of redundancy
in power-supply design; they treat replication of entire units, as well
as component redundancy within units. The important point is made that
while a component failure in a signal-processing unit need not be cata-
strophic, a failure in a power supply usually is--hence, under the
philosophy that the weakest unit should be strengthened first by the use
of redundant structures, the attainment of power-supply reliability is
extremely important. Techniques for this purpose are considered in
Appendix B of this report.
357
REFERENCES
i.
2.
3.
4.
5.
6.
7.
8.
9.
i0.
11.
12.
13.
14.
15.
16.
Agnew, P. W., D. H. Rutherford, R. J. Suhocki, and C. M. Yen, "An Architectural
Study for a Self Repairing Computer," Final Technical Documentary Report No.
SSD-TR-65-159, U.S. Air Force, Space Systems Division, Air Force Systems Command
(November 1965), (AD 474 976).
Alekseyev, O. G., and V. L. Yakushev, "Algorithm for Optimum Reserve (Redundancy)
of an Apparatus, "Engr. Cyb., No. 3, pp. 55-61 (1964).
Alelyunas, Paul, "Checkout: Man's Changing Role," Space/Aeronautics, Yol. 44, No. 7,
pp. 66-73 (December 1965).
Alexander, S., "Application of Boolean Notation to the Maintenance of Switching
Circuits," Electrical Eng., Vol. 33, No. 6, pp. 372-374 (June 1961).
Allen, C., "Design of Digital Memories That Tolerate All Classes of Defects"
SEL Technical Report No. 4662-1, Stanford Electronics Laboratories, Stanford,
California (May 1966).
Amarel, S., and J. A. Brzozowski, "Theoretical Considerations on Reliability
Properties of Recursive Triangular Switching Networks", Redundancy Techniques for
Computing Systems, R. H. Wilcox and W. C. Mann, eds. (Spartan Books, 1962).
Amarel, S., G. Cooke, and R. O. Winder, "Majority Gate Networks," IEEE Trans.
on Electronic Computers, Vol. EC-13, No. I, pp. 4-13 (February 1964).
Angell, J. B., "The Need and Means for Fault Detection in Redundant Systems,"
Working Paper presented a_ the Workshop on the Organization of Reliable Automata,
Pacific Palisades, California (2-4 February 1966).
Angell, J. B., "The Need and Means for Self-Repairing Circuits," IEEE Int. Cony.
Record, Part 2, pp. 193-199 (March 1963).
Armstrong, D. B., "A General Method of Applying Error Correction to Synchronous
Digital Systems," Bell System Tech. J., Vol. 40, No. 2, pp. 577-593 (March 1961).
Armstrong, D. B., "On Finding a Nearly Minimal Set of Fault Detection Tests for
Combinational Logic Nets," IEEE Trans. on Elec. Computers, Vol. EC-15, No. i,
pp. 66-73 (February 1966).
Aroian, L. A., "The Reliability of Serial Systems and Redundant Systems," Proc.
lOth Ann. Symp. on Reliability and Quality Control, pp. 174-185 (January 196-_.
Avizienis, A., "Detection and Correction of Failure in Digital Arithmetic Units,"
Space Programs Summary 37-25, Vol. IV, pp. 21-24, Jet Propulsion Laboratory,
Pasadena, California (February 1964).
Avizienis, A., "A Diagnosable Arithmetic Processor," Working paper presented at
the Workshop on the Organization of Reliable Automata, Pacific Palisades, California
(2-4 February 1966).
Avizienis, A., "A Set of Algorithms for a Diagnosable Arithmetic Unit," Tech.
Report No. 32-546, Jet Propulsion Laboratory, Pasadena, California (1964).
Baechler, D. O., and D. E. Van Tijn, "Computer Reliability Study," Final Report
No. 240011374, Arinc Research Corp. (July 1963).
359
17. Baer, J. A., and C. H. Heckler, Jr., "Research on Reliable and Radiation Insen-
sitive Pulse Drive Sources for All-Magnetic Logic Systems," Final Report, Contract
950104 under NASw-6, SRI Project 3729, Stanford Research Institute, Menlo Park,
California (June 1962).
18.
19.
Balaban, H. S., "A Selected Bibliography on Reliability," IRE Trans. on Reliability
and Quality Control, Vol. RQC-11, pp. 86-103 (July 1962).
Barker, R. C., "Computer Magnetics--A Selective Review of Periodical Literature,"
Electro-Technology, Vol. 69, No. 3, pp. 74, 75, 140 (March 1962).
20. Barlow, R. E., and L. C. Hunter, "Criteria for Determining Optimum Redundancy,"
IRE Trans. on Reliability and Quality Control, Vol. 9, pp. 73-77 (April 1960).
21. Barlow, R. E., and F. Proschan, Mathematical Theory of Reliability (John Wiley &
Sons, New York, N.Y., 1965).
22. Barlow, R. E., L. C. Hunter, and F. Proschan, "Optimum Redundancy when Components
are Subject to Two Kinds of Failure," J. SIAM, Vol. 9, No. 1 pp. 64-73 (March 1963).
23. Barzilovich, Ye. Yu., "Determination of Optimum Periods of Preventative Maintenance
in Automatic Systems," Engr. Cyb., No. 3, pp. 32-39 (1964).
24. Benes, V. E., Mathematical Theory of Connecting Networks and Telephone Traffic
25.
(Academic Press, New York, N.Y., 1965).
Benes, V. E., "Permutation Groups, Complexes, and Rearrangeable Connecting Networks,"
Bell System Tech. J., Vol. 43, No. 4, part 2, pp. 1619-1656 (July 1964).
26. Bennion, D. R., and H. D. Crane, "All-Magnetic Circuit Techniques," Advances in
Computers, Vol. 4, pp. 53-133 (Academic Press, New York, N.Y. 1963).
27. Bennlon, D. R., H. D. Crane, and D. C. Engelbart, "A Bibliographical Sketch of All-
Magnetic Logic Schemes," IRE Trans. on Electronic Computers, Vol. EC-10, No. 2,
pp. 203-206 (June 1961).
28. Bernstein, A. J., and W. H. Kim, "Linear Codes for Single-Error Correction in
Symmetric and Asymmetric Computational Processes," IRE Trans. on Information Theory,
Vol. IT-8, No. 1, pp. 29-34 (January 1962).
29. Birnbam,, Z. W., J. D. Esary, and S. C. Sanders, "Multicomponent Systems and
Structures and Their Reliability," Technometrics, Vol. 3, No. 1, pp. 55-77
(February 1961).
30. Bonn, T. H., "Magnetic Computer Has High Speed," Electronics, Vol.30, No. 8, pp.
156-160 (August 1957).
31. Brown, D. T., "Error Detecting and Correcting Binary Codes for Arithmetic Opera-
tions," IRE Trans. on Elec. Computers, Vol. EC-9, No. 3, pp. 333-337 (September
1960).
32. Brown, J. N., and E. E. Newhall, "The Storage and the Gating of Information Using
Balanced Magnetic Circuits," Proc. of INTERMAG Conf., pp. 11-1-1 to 11-1-5
(April 1964).
33.
34.
Brown, W. G., J. Tierney, and R. Wasserman, "Improvement of Electronic Computer
Reliability Through the Use of Redundancy," IRE Trans. on Elec. Computers,
Vol. EC-IO, No. 3, pp. 407-416 (September 1961).
Brule, J. D., R. A. Johnson, and E. J. Kletsky, "Diagnosis of Equipment Failures,"
IRE Trans. on Reliability and Quality Control, Vol. RQC-9, pp. 23-34 (April 1960).
35. Buckley, J. J., "The Non-Operating Failure Rate--a Valuable Reliability Tool,"
Proc. Eleventh Natl. Sym. on Reliability and Quality Control, pp. 434-438,
(January 1965).
360
36. Bukharaev, R. G., "Computation of Contact-Circuit Reliability," Automation and
Remote Control, Vol. 25, No. 8, pp. 1085-1090 (August 1964).
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54.
Burt, M. W., and D. C. James, "How Much Does Redundancy Improve Reliability?"
Control Engineering, Vol. 10, No. 6, pp. 71-76 (June 1963).
"M " "tBuzzell, G., W. Nutting, and R. Wasserman, a3orl y Gate Logic Improves Digital
System Reliability," IRE Natl. Cony. Record, Pt. 2, pp. 264-270 (1961).
Campbell, R. L., and W. Thomis, "A New Approach to System Maintenance," Bell
Laboratories Record pp. 251-255 (June 1965).
Chang, H. Y., "An Algorithm for Selecting an Optimum Set of Diagnostic Tests,"
IEEE Trans. on Elec. Computers, Col. EC-14, No. 5, pp. 705-711 (October 1965).
Chegis, I. A., and S. V. Yablonskii, "Logical Methods for Controlling the Opera-
tion of Electrical Networks," Trudy Matematicheskvo Instituta imeni V. A. Steklova,
Vol. 51, pp. 270-360 (1958). Trans: No. 60-41069, Office of Technical Services
U.S. Dept. of Commerce, Washington, D.C. (1960).
Clift, R. A., "Power Switching Trims Digital System Weight, Cost," Electronics,
Vol. 39, No. 12, pp. 135-138 (June 1966).
Coates, C. L., and P. M. Lewis II, "A Threshold Gate Computer, IEEE Trans. on
Elec. Computers, Vol. EC-13, No. 3, pp. 240-247 (June 1964).
Cobham, A., R. Fridshal, and J. H. North, '_n Application of Linear Programming
to the Minimization of Boolean Functions," Proc. 2nd Annual Symp. Switching
Circuit Theory and Logical Design, AIEE Special Publication S-134, pp. 3-9,
(September 1961).
Cochran W. G., and G. M. Cox, Experimental Designs (John Wiley & Sons, New York,
N.Y., 1950),
Cohen, J. J. and L. A. Whitaker, "An Approach to Diagnostic Programming," 3rd.
Annual Meeting ACM, 23-25 August 1960.
Cohn, M., "Redundancy in Complex Computers," 1956 IRE National Conference Proc.
on Aeronautical Elec., pp. 231-235 (May 1956).
Condon, D. C., "Atomic Reactor Control System," Final Report Contract N259-1577,
SRI Project 4529, Stanford Research Institute, Menlo Park, California
(October 1963).
Corliss, William R., Space Probes and Planetary Exploration (D. Van Nostrand Co.,
Inc., Princeton, N. J., 1965).
Cosgrove, M. R., and C. G. Masters, "Self-Repair Techniques for Failure-Free
Systems," Special Technical Report 2, Westinghouse Elect. Corp. (September 1963).
Crane, H. D., "Design of an All-Magnetic Computing System: Part II-Logical Design,
IRE Trans. on Electronic Computers, Vol. EC-10, No. 2, pp. 222-232 (June 1961).
Crane, H. D., and E. K. Van De Reit, "Design of an All-Magnetic Computing System:
Part I-Circuit Design," IRE Trans. on Electronic Computers, Vol. EC-IO, No. 2,
pp. 207-220 (June 1961).
Creasey, D. J., "Redundancy Techniques for Use in an Air Traffic Control Computer,"
In Microelectronics and Reliability, Vol. 3, (Pergamon Press, MacMillan Co.,
New York, N.Y., 1964).
Creveling, C. J.; "Increasing the Reliability of Electronic Equipment by the Use
of Redundant Circuits," Proc. IRE, Col. 44, No. 4, pp. 509-15 (April 1956).
361
55. Daher, P. R.p "Automatic Correction of Multiple Errors Originating in a Computer
Memory," IBM J. Res. Dev., Vol. 7, No. 4, pp. 317-324 (October 1963).
56. Dauber, P. S., "An Analysis of Error in Finite Automata," Inf. and Control,
Vol. 8, No. 3, pp. 295-303 (June 1965).
57. Davis, R. A., "A Checking Arithmetic Unit," Proc. of the Fall Joint Computer
Conference, Vol. 27, Part 1, pp. 705-713 (1965).
58. De Pian, L., and N. T. Grisamore, "Reliability Using Redundancy Concepts,"
IRE Trans. on Reliability and Quality Control, Vol. RQC-9, pp. 53-60 (April 1960!.
59. De Plan, L., and N. T. Grisamore, "Two Approaches to Incorporating Redundancy into
Logical Design," Redundancy Techniques for Computing Systems, R. H. Wilcox and
W. C. Mann, eds. (Spartan Books, Washington, D.C., 1962).
60. Diamond, J. M., "Checking Codes for Digital Computers," Proc. IRE, Vol. 43, No. 4,
pp. 487-488 (April 1955).
61. Dickinson, M. M., J. B. Jackson, and G. C. Randa, "Saturn V Launch Vehicle Digital
Computer and Data Adapter," Proc. of the Fall Joint Computer Conference, pp. 501-
516 (1964).
62. Dickinson, W. E., and R. M. Walker, "Reliability Improvement by the Use of
Multiple-Element Switching Circuits," IBM J. Res. Dev., Vol. 2, No. 2, pp. 142-147
(April 1958).
63. Domanitskii, S. M., and I. V. Prangishvili, "Reliable Logic Elements and Output
Amplifiers with Redundant Structure," Automation and Remote Control, Vol. 25, No. 4
pp. 511-515 (April 1964).
64. Domanitskil, S. M., and I. V. Prangishvili, "An Accurate Design Method for Semi-
conductor Switching Circuits UseO in Industrial _utomation," Automation and Remote
Control, Vol. 24, No. 5, pp. 606-613 (1963).
65. Doyle, R. H., R. A. Meyer, and R. P. Pedowitz, "Automatic Failure Recovery in a
Digital Data Processing System," IBM J. Res. Dev., Vol. 3, No. 1, pp. 2-12
( January 1959).
66. Drenick, R. F., "The Failure Law of Complex Equipment," J. Soc. Indust. Appl. Math.,
Vol. 8, No. 4, pp. 680-690 (December 1960).
67. Drenick, R. F., "Mathematical Aspects of the Reliability Problem," J. Soc. Indust.
Appl. Math., Vol. 8, No. 1, pp. 125-149 (March 1960).
68. Dreyfus, P., "Programming Design Features of the Gamma 60 Computer," Proc. of the
Eastern Joint Computer Conference (December 1958).
69. M. Dunning, B. Kolman, and L. Steinberg, "Reliability and Fault Masking in n-Varlable
NOR Trees," Proc. of the 6th Annual S_rmposium on Switching Circuit Theory and Logical
Design, IEEE Publication 16c13, pp. 126-142 (October 1965).
70. Eckert, J. P., J. C. Chu, A. B. Tonlk, and W. F. Schmitt, "Design of UNIVAC LARC
System: I," Proc. of the Eastern Joint Computer Conference, pp. 59-65 (1959).
71. Elias, P., "Computation in the Presence of Noise," IBM J. Res. De_, Vol. 2, No. 4,
pp. 346-353 (October 1958).
72. Elspas, B., "Design and Instrumentation of Error-Correctlng Codes," Final Report,
Contract AF 30(602)-2327, SRI Project 3318 Stanford Research Institute, Menlo Park,
California (October 1962).
73. Elspas, B., et al., "Investigation of Propagation Limited Computer Networks," Final
Report, SRI Project 4523, pp. 68-75, Stanford Research Institute, Menlo Park,
California (April 1964) AD-603 165.
362
74.
75.
76.
77.
78.
79.
80.
81.
82.
83.
84.
85.
86.
87.
88.
89.
90.
England, W. A., "Improving Reliability by the Practical Application of Selected
Redundant Techniques," Working paper presented at the Workshop for the Organi-
zation of Reliable Automata, Pacific Palisades, California (2-4 February 1966).
Ergott, H. L., and D. P. Rozenberg, "On the Analysis of Reliability Improvement
through Redundancy," Report No. 62-825-494, IBM Federal Systems Div., Oswego, New
York (1964).
Esary, J. D. and F. Proschan, "Coherent Structures of Non-Identical Components,"
Technometrics, Vol. 5, pp. 191-209 (May 1963).
Esary, J. D., and F. Proschan, "Relationship between System Failure Rates and
Component Failure Rates," Technometrics, Vol. 5, pp. 183-189 (May 1963).
Esary, J. D. and F. Proschan, "The Reliability of Coherent Systems," in Redundancy
Techniques for Computing System (Spartan Books, Washington, D.C., 1962).
Fakharov, V. V., "An Overall Diagnostic Routine for the "URAL-I" Computer,"
Dokl. (3rd) Sibirsk, Konferentsii po Matem. i Mekhan Tomsk, p. 267 (19641.
Farrell, Edward J., "Improving the Reliability of Digital Devices with Redundancy;
An Application of Decision Theory," IRE Trans. on Reliability and Quality Control,
Vol. RQC-ll, No. 1, pp. 44-50 (May 1962).
Fedderson, A. P.--see item 350 of this Bibliography.
Flehinger, B. J., "A General Model for the Reliability Analysis of Systems Under
Various Preventive Maintenance Policies," Ann. Math. Statistics, pp. 137-156
(March 1962).
Flehinger, B. J., "Reliability Improvement Through Redundancy at Various System
Levels," IBM J. Res. Dev., pp. 148-158 (April 1958).
Fleyshman, B. S., "The Statistical Theory of Reliable Functioning in Finite
Automata During Interferences," In Relay Systems and Finite Automata, (Burroughs
Corp., Paoli, Pa., 1964).
Forbes, R. E., D. H. Rutherford, C. B. Streglitz, and L. H. Tung, "A Self-Diagnosable
Computer," Proc. of the Fall Joint Computer Conference, Vol. 27, Part 1, pp. 1073-
1086 (1965).
Frank, H. and S. S. Yau: "Improving Reliability of a Sequential Machine by Error-
Correcting State Assignments," IEEE Trans. on Electronic Computers, Vol. EC-15,
No. 1, pp. 111-113 (February 1966).
Frankel, S. P., "On the Minimum Logical Complexity Required for a General Purpose
Computer, IRE Trans. on Electronic Computers, Vol. EC-7, No. 4, pp. 282-285
(December 1958),
Galey, J. M., R.E. Norby, and J. P. Roth, "Techniques for the Diagnosis of Switching
Circuit Failures," IEEE Trans. on Comm. and Electr. Vol. 83, No. 74, pp. 509-514
(September 1964).
Gallager, R. G., "Low-Density Parity-Check Codes," IRE Trans. on Information Theory,
Vol. IT-8, No. 1, pp. 21-28 (January 1962).
Garner, H. L., "Error Codes for Arithmetic Operations," working paper presented
at the Workshop on the Organization of Reliable Automata, Pacific Palisades,
California (2-4 February 1966).
Garner, H. L., "Generalized Parity Checking," IRE Trans. on Electronic Computers,
Vol. EC-8, No. 1, pp. 25-30 (March 1959).
363
91.
92.
93.
94.
95.
96.
97.
98.
99.
Garner, H. L., "The Residue Number System," IRE Trans. on Electronic Computers,
Vol. EC-8, No. 2, pp. 140-147 (June 1959).
Garner, H. L., et al., "A Study of Iterative Circuit Computers," Tech. Doc.
Report AL-TDR-64-24, Air Force Avionics Laboratory, Research and Technology
Division, Air Force Systems Command, Wright-Patterson Air Force Base, Ohio,
(April 1964) AI)-601 212.
Gaver, D. P., "Failure Time for a Redundant Repairable System of Two Dissimilar
Elements," IEEE Trans. on Reliability and Quality Control, pp. 14-22 (March 1964).
Gaver, D. P., "Time to Failure and Availability of Paralleled System with Repair,"
IEEE Trans. on Reliability and Quality Control, p. 30 (June 1963).
Gavrilov, M. A., "Structural Redundancy and Reliability in the Operation of Relay
Circuits," Izdatelstvo Akademii Nauk, SSSR (Moscow) pp. 16 (1960).
Gavrilov, M. A., "Structural Redundancy and Reliability of Relay Circuits,"
Automatic and Remote Control Proceedings of the First International Congress of the
International Federation of Automatic Control (IFAC) Moscow, 1960, Vol. 2, pp. 838-844
(Butterworths, London, 1961).
Gavrilov, M. A., "The Synthesis of Remote Control Signals by Means of the Combin-
ational Use of Pulse Modes," Automation and Remote Control, Vol. 17, No. 12 (1956).
Gavrilov, M. A., et al. "The Realization of Digital Corrector Networks," Automation
Express, Vol. i, No. 8, pp. 10-12 (1959); Soviet Phys.--Doklady, Vol. 3, No. 6,
pp. 1102-1105 (1958).
Gendler, M. B., "Analysis of Conditions of Reliable Realizability of Threshold
Functions Using Real Threshold Elements," Engineering Cybernetics Vol. 1, pp. 29-38
(1965).
100. Germeroth, J. H., "Casting Out Three in Binary Numbers," IRE Trans. on Electronic
Computers, Vol. EC-9, No. 3, p. 373 (September 1960).
101. Gerrand, F. and H. S. Rasmussen, "Self-Correction in Large-Scale Digital Computers,"
Proc. 7th Intl. Symp. on Reliability and Quality Control, Philadelphia, pp. 351-359
(9-11 January 1961).
102. Gertsbakh, I. B., "Optimum Rule for Maintenance of a System with Many States,"
Engineering Cybernetics (1964).
103. Gill, A., Introduction to the Theory of Finite-State Machines (McGraw-Hill Book Co.,
New York, N.Y., 1962).
104.
105.
106.
107.
108.
109.
Gill, A., "Minimum-Scan Pattern Recognition, "IRE Trans. on Information Theory,
Vol. IT-5, No. 2, pp. 52-58 (June 1959).
Gimpel, J. F., "A Reduction Technique for Prime Implicant Tables," IEEE Trans. on
Electronic Computers, Vol. EC-14, No. 4, pp. 535-541 (August 1965).
Glagolev, V. V., "Formulation of Tests for Iterative Networks," Soviet Phys.--Doklady,
Vol. 7, No. 6, pp. 480-482, (1962).
Gnedenko, B. V., "Duplication with Repair" Engr. Cyb. Vol. 5, pp. 102-108 (1964).
Gnedenko, B. V., "Idle Duplication," Engr. Cyb. Vol. 4, pp. 1-9 (1964).
Goldberg, J., "Network Schemes for Combined Fault Masking and Replacement," Working
paper presented at the Workshop on the Organization of Reliable Automata, Pacific
Palisades, California (2-4 February 1966).
364
ii0.
iii.
112.
113.
114.
115.
116.
I17.
118.
119.
120.
121.
122.
123.
124.
125.
126.
127.
Goldberg, J., J. A. Baer, and R. C. Minnick, "Development of Techniques for
Improving the Reliability of Digital Systems Through Logical Redundancy," Final
Report, Phase III, Stanford Research Institute, Menlo Park, California
(August 1963).
Goldberg, J.,
Improving the
Report, Phase
R. C. Minnick and W. H. Kautz, "Development of Techniques for
Reliability of Digital Systems Through Logical Redundancy," Final
II, Stanford Research Institute, Menlo Park, California (May 1962).
Grasselli, A., "The Design of Program-Modifiable Microprogrammed Control Units,"
IRE Trans. on Electronic Computers, Vol. EC-11, No. 3 (June 1962).
Green, M. W., "High Speed Components for Digital Computers," Final Report, Part A,
Contract AF 33(616)-5804, SRI Project 2548, Stanford Research Institute, Menlo
Park, California (December 1959).
Gunderson, D. C., C. W. Hastings, and G. J. Penn, "Spaceborne Memory Organization,"
Interim Report, Contract NAS 12-38, Honeywell Systems & Research Division,
Minneapolis, Minnesota (15 December 19651.
Gurzi, K. J., "Estimates for Best Placement of Voters in a Triplicated Network,"
IEEE Trans. on Electronic Computers, Vol. EC-14, No. 5, pp. 711-717 (October 1965).
Haavind, R., "Reaching for the Moon--and Beyond" Electronic Design, Vol. 14, No. 7,
pp. 30-37 (29 March 1966).
Hamming, R. W., "Error Detecting and Error Correcting Codes" Bell System Tech. J.
Vol. 29, pp. 147-160 (1950).
Harrison, M. A., Introduction to Switching and Automata Theory (McGraw-Hill Book
Co., New York, N.Y., 1965).
Harrison, M. A., "On the Error-Correcting Capacity of Finite Automata," Inf. &
Control pp. 430-450 (August 1965).
Haynes, J. L., "Logic Circuits Using Square-Loop Magnetic Devices: A Survey,"
IRE Trans. on Electronic Computers, Vol. EC-10, No. 2, pp. 191-203 (June 1961).
Haynes, J. L., and R. C. Minnick, "Magnetic Core Access Switches," Tech. Supplement
to RADC-TR-61-117A, Stanford Research Institute, Menlo Park, California (May 1961).
Heckler, C. H. Jr., and J. A. Baer, "PCM Telemetry: A New Approach Using All-Magnetic
Techniques," Phase I Report, Contract NAS1-3380, SRI Project 4687, Stanford Research
Institute, Menlo Park, California (June 1964). Also issued as NASA CR-229 (May 1965!.
Heckler, C. H., Jr., and J. A. Baer, "Feasibility Breadboard of an All-Magnetic PCM
Telemetry System," Phase II Report, Contract NASI-3380, SRI Project 4687, Stanford
Research Institute, Menlo Park, California (June 1965).
Heckler, C. H., Jr., and J. A. Baer, "All-Magnetic PCM Telemetry: A Review of the
System Breadboard," Phase Ill Report, Contract NASI-3380, SRI Project 4687,
Stanford Research Institute, Menlo Park, California (October 1965). Also issued
as NASA CR-435 (April 1966).
Henderson, D. S., Logical Design for Arithmetic Units, Ph.D. Thesis (Harvard
University, Cambridge, Mass.; May 1960).
Henderson, D. S., "Residue Class Error Checking Codes," Paper presented at 16th
Annual Meeting of the ACM (1961).
Hennie, F. C., "Fault-Detecting Experiments for Sequential Circuits," Proc. of the
5th Annual Symp. on Switching Circuit Theory and Logical Design, Princeton,
pp. 95-110 (October 1964).
365
128.
129.
130.
131.
132.
133.
134.
135.
136.
137.
138.
139.
140.
141.
142.
143.
144.
145.
146.
Herron, D. P., "Optimizing Trade-offs of Reliability vs. Weight," IEEE Trans. on
Reliability, Vol. R-12, pp. 50-54 (December 1963).
Hersperger, S. and C. Caudill, "Apollo PCM Subsystems Come of Age" Missiles &
Rockets, pp. 36-39 (28 February 1966).
Herwald, S. W., "Evaluation and Reliability, Elect. Eng., Vol. 81, No. 8 pp. 614-618
(August 1962).
Herwald, S. W., "Reliability as a Parameter in the Systems Concept," in Systems:
Research and Design, by D. P. Eckman, ed. (Proc. of the First Systems Symposium at
Case Institute of Technology; John Wiley & Sons, New York, N.Y. 1961).
Hill, J. "Failsafe Circuits, "Report 120, Digital Computer Laboratory, University of
Illinois, Urbana, Illinois (June 1962).
Holland, J., "A Universal Computer Capable of Executing an Arbitrary Number of
Subprograms Simultaneously," Proc. of the Eastern Joint Computer Conference, p. 108
(December 1959).
M. E. Homan, "A 4-Megacycle 24-bit Checked Binary Adder, Comm. and Elec., pp. 443-450
(September 1961).
Osborn, P., "Ferroresonant Flip-Flops," Electronics, pp. 121-123 (April 1952).
James, D. C., A. H. Kent, and J. A. Holloway, "Redundancy and the Detection of First
Failures," IRE Trans. on Reliability and Control, Vol. RQC-II, pp. 8-27 (October 1962).
Jensen, P. A., Bibliography of Redundancy Techniques," in Failure Tolerant Computer
Design, W. H. Pierce, (Academic Press, New York, N.Y. 1965).
Jensen, P. A., "Bibliography on Redundancy Techniques," in Redundancy Techniques for
Computing Systems, R. H. Wilcox and William C. Mann, eds. (Spartan Books, Washington,
D.C., 1962).
Jensen, P. A., "Quadded NOR Logic, IEEE Trans. on Reliability, Vol. R-12, pp. 22-31
(September 1963).
Jensen, P., "The Reliability of Redundant Multiple-Line Networks," IEEE Trans. on
Reliability, Vol. R-13, pp. 23-33 (March 1964).
Jensen, P. A., W. C. Mann, and N. R. Cosgrove, "The Synthesis of Redundant
Multlple-Line Networks," First Annual Report Contract Nonr. 3842(00) Westinghouse
Elec. Corp. Baltimore, Maryland (May 1963).
Johnson, R. A., Abstract of "Information Theory Approach to Diagnosis," IRE
Trans. on Reliability and Control, p. 35 (1960).
Joseph, E. C., "Self Repairing Multiprocessor Design," Working paper presented
at the Workshop on the Organization of Reliable Automata, Pacific Palisades,
California (2-4 February 1966).
Kampe, T. W., "The Design of a General-purpose Microprogam-Controlled Computer
with Elementary Structure," IRE Trans. on Electronic Computers, Vol. EC-9, No. 2
pp. 208-213 (June 1960).
Karibiskii, V. V., "Analysis of Systems for Checking Operability and Diagnosing
Faults," Automation and Remote Control, Vol. 26, No. 2, pp. 305-310 (1965).
Karibskii, V. V., P. P. Parkhomenko and E. S. Sogomonyan, "Some Problems in Checking
the Performance and Locating Failures in Finite Automata," Soviet Phys.--Doklady,
Vol. 10, No. 3, pp. 182-184 (1965).
366
147.
148.
149.
150.
151.
152.
153.
154.
155.
156.
157.
158.
159.
160.
161.
162.
163.
164.
Kautz, W. H., "Automatic Fault Detection in Combinational Switching Networks,"
Proc. of the 2nd Annual Symposium on Switching Circuit Theory and Logical Design,
Detroit, Michigan, pp. 195-214 (1961).
Kautz, W. H., "Codes and Coding Circuitry for Automatic Error Correction within
Digital Systems," in Redundancy Techniques for Computing Systems, Wilcox and Mann,
eds. (Spartan Press, Washington, D. C. 1962).
Kautz, W. H., "Fault Diagnosis in Combinational Networks," Working Paper presented
at the Workshop on the Organization of Reliable Automata, Pacific Palisades,
California (2-4 February 1966).
Kautz, W. H., "Totally Sequential Switching Circuits," in Switching Theory in
Space Technology, Aiken and Main, eds., (Stanford University Press, Stanford, Calif.
1963).
Kautz, W. H., "Unit Distance Error-Checking Codes," IRE Trans. on Electronic Computers,
Vol. EC-7, No. 2 (June 1958).
Kautz, W. H., and R. C. Singleton, "Non-Random Superimposed Codes," IEEE Trans. on
Information Theory, Vol. IT-10, No. 4, pp. 363-377 (October 1964].
Keir, Y. A., P. W. Cheney, and M. Tannenbaum, "Division and Overflow Detection in
Residue Number Systems," IRE Trans. on Electronic Computers/ Vol. EC-II, No. 4.
pp. 501-507 (August 1962).
Kel'mans, A. K., "Some Problems of Network Reliability Analysis," Automation and
Remote Control, Vol. 26, No. 3, pp. 564-573 (March 1965).
Kel'mans, A. K., "On Certain Optimal Problems in the Theory of Reliability for
Information-Transmission Systems." Automation and Remote Control, Vol. 25, No. 5,
pp. 601-607 (May 1964).
Kemp, J. C., "Redundant Digital Systems," in Redundancy Techniques for Computing
Systems, Wilcox and Mann, eds. (Spartan Books, Washington, D.C., 1962).
Kemp, J. C., W. R. Hiatt and T. M. Grenewitz, "Automatic Recovery from Transient
Malfunctions in Space Computers," Tech. Documentary Report No. AL TDR 64-142,
General Electric Advanced Electronics Center, Ithaca, N.Y. (August 1964), AD-604
064.
Kidd, M. C., "Self-Test of Digital Evaluation Equipment," IEEE Trans. on Aerospace
Vol. AS-l, No. 2, pp. 1283-1296 (August 1963).
Kilmer, W. L., "An Idealized Overall Error-Correcting Digital Computer having only
an Error-Detecting Combinational Part" IRE Trans. on Electronic Computers, Vol.
EC-8, No. 3, pp. 321-325 (September 1959).
Kime, C. R., "A Failure Detection Method for Sequential Circuits," Technical
Report 66-13, Dept. of Elee. Eng., University of Iowa City, Iowa (January 1966).
Kime, C. R., "An Organization for Checking Experiments on Sequential Circuits,"
IEEE Trans. on Electronic Computers, Vol. EC-15, No. I, pp. 113-115 (February 1966).
Kinsht, N. V., "One Procedure for Malfunction Search," Novosibrosk Avtometriya,
No. 3, pp. 34-38 (1965). Abstract: USSR Scientific Abstracts, C. C. A. T., No. ii,
p. 134.
Kletsky, E. J., "Upper Bounds on Mean Life of Self-Repairing Systems," IRE Trans.
on Reliability and Quality Control, pp. 43-48 (October 1962).
Knox-Seith, J. K., "Improving the Reliability of Digital Systems by Redundancy and
Restoring Organs," Tech. Report No. 4816-2 SU-SEL-64-094, Stanford Electronics
Laboratory, Stanford University, Stanford, California (August 1964).
367
165. Kochen, M., "Extension of Moore-Shannon Model for Relay Circuits," IBM J. Res.
Dev., Vol. 3, No. 2, pp. 169-186 (April 1959).
166.
167.
Kodis, R. D., "Telemetering Buffer Storage for Space Vechicles," Data Systems
Engineering, Vol. 18, No. 8, pp. 10-14 (October 1963).
Kogan, I. V., "Monitoring the Operation of Logical Devices," Doklady na
Mezhdynarodnom Simposiume po Teorii Relenykh Ustroistv i Konechnykh Avtomatov.
(Moscow, 1962); Trans: Relay Systems and Finite Automata, pp. 116-137
Burroughs Corp., Paoli, Pc., 1964).
168. Kogan, I. V., "Tests for Nonrepetitious Contact Networks," Problemy Kibernetlki,
Vol. 12, pp. 39-44 (1964).
169. Korman, A. G., "Optimal Reserve of Equipment," En_r. Cyb. Vol. 4, pp. 21-33 (1964).
170. Korolenok-Gorskli, L. K., "Some Problems in the Reliability Analysis of Memories,"
Izd-vo Energiya (Leningrad), pp. 15-27 (1965); Abstract; USSR Scientific Abstracts,
C. C. A. T., No. ii, p. 131.
171.
172.
173.
174.
175.
176.
177.
178.
179.
180.
Kruus, J., "Upper Bounds for the Mean Life of Self-Repalrlng Systems," Coordinated
Science Laboratory Report R-172, University of Illinois, Urbana, Illinois (July 19631
AD-418 174.
Kurdyukov, K. P., "The Reliability of Relay-Contact Networks," Automation and Remote
Control, Vol. 21, No. 4, pp. 366-371 (1960). See also Automation Express, Vol. 2,
No. 4, pp. 30-34 (1960).
Kuznetsov, S. M., "Estimating the Reliability of Automatic Systems from the Results
of Testing in Incomplete Set of Equipment," Automation and Remote Control, Vol. 22,
No. 8, pp. 990-998 (February 1962).
Landers, R. R., "Achieving Higher Reliability Through Self-Repair," IEEE Trans.
on Aerospace, Vol. AS-l, pp. 735-46 (August 1963).
Lawrence, L. A. J., Hlgh-Security Control through Redundant Channels," Control
Engineering, Vol. 12, No. 3, pp. 74-79 (March 1965).
Lechner, J. A., "Simplified Reliability Calculations for Complicated Systems,"
IEEE Trans. On Systems Science and Cybernetics, Vol. SSC-I, No. I, pp. 31-36
(N0vember 1965).
Ledley, R. S., and J. B. Wilson, "Integrated Circuitry Implications in Logical
Design," in Microeleetronics and Large Systems Mathis, Wiley and Spandorfer, eds.
(Spartan Books, Washington, D.C., 1965).
Lee, C. Y., "Representation of Switching Circuits by Binary Decision Programs,"
Bell System Tech. J., Vol. 38, pp. 985-999 (1959).
Lee, C. Y., and Paul, M. C., "A Content Addressable Distributed Logic Memory with
Applications to Information Retrieval," Proc. IEEE, Vol. 51, pp. 924-932 (June 1963).
Lee, F. "An Automatic Self-Checking and Fault-Locating Method," IRE Trans. on Elec-
tronic Computers, Vol. EC-11, No. 5, pp. 649-654 (October 1962_.
181. Levenstein, V. I., "Concerning Self-Adjusting Automata," Doklady na Mezhdynarodnom
Simposiume po Teorii Relenykh Ustrotstv t Konechnykh Avtomatov (Moscow) pp. 147-192
(1962).
182. Llddell, D. W., "Integration and Automatic Fault Location Techniques In Large Digital
Data Systems," Proc. Spring Joint Computer Conference, pp. 213-224 (1962).
183. Liu, C. L. and J. W. S. Liu, "On a Multiplexing Scheme for Threshold Logical Elements,"
Information and Control, Vol. 8, pp. 282-294 1965_.
368
184.
185.
186.
187.
188.
189.
190.
191.
192.
193.
194.
195.
196.
197.
198.
199.
200.
201.
Lofgren, L., "Automata of High Complexity and Methods of Increasing their Reliabil-
ity by Redundancy," Information and Control, Vol. 1, pp. 126-147 (1958).
Lowenschuss, O., "Restoring Organs in Redundant Automata," Information and Control,
Vol. 2, pp. 113-136 (1959).
Lowrie, R. L., "High-Reliability Computers Using Duplex Redundancy" Electronic
Industries, pp. 116-128 (August 1963).
Lyabutov, Ya. V., "Optimal Procedures for Localization of Breakdown in a Modularized
Radioelectronic System," Engr. Cyb., No. 4, pp. 14-21 (1964_.
Lyons, R. E, and W. Vanderkulk, "The Use of Triple-Modular Redundancy to Improve
Computer Reliability," IBM J. Res. Dev., Vol. 6, No. 2, pp. 200-209 (April 1962).
Madatyan, Kh. A., "Synthesis of Circuits which Correct the Breaking of Contacts,"
Soviet Phys.--Doklady, Vol. 9, No. 11, pp. 949-951 (May 19651.
Malev, V. V., "Reliability of Reserve (Redundant) Systems with Periodic Maintenance,"
Engr. Cyb., No. 3, pp. 46-49 (1964).
Malikov, I. M., Rokhmistrov, A. N., Determining Computer Reliability, Leningradskii
Inzhenerno-Ekonomicheskii Institut Trudy, No. 55 (1965); Abstract: USSR Scientific
Abstracts_ C.C.A.T., No. 13, p. 95.
Maling, K., and E. L. Allen, "A Computer Organization and Programming System for
Automated Maintenance," IEEE Trans. on Electronic Computers, pp. 887-895 (December
1963).
Malyugin, V. D., "Reliability of Switching Circuits," Automation and Remote Control
Vol. 25, No. 9, pp. 1235-1242 (September 1964).
Mandelbaum, D., "Arithmetic Error-Detecting Codes for Communications Links Involving
Computers," IEEE Trans. on Communications Technology, Vol. COM 13, pp. 165-171
(June 1965).
Mann, W. C., "Restorative Processes for Redundant Computing Systems," in Redundancy
Techniques for Computing Systems, Wilcox and Mann, eds., pp. 267-284
(Spartan Books, Washington, D.C., 1962).
Mann, W. C., "Systematically Introduced Redundancy in Logical Systems," IRE
International Cony. Record, Pt. 2, pp. 241-63 (March 1961).
Manning, E., "On Self-Diagnosis of Large Multiprocessor Computers," working paper
presented at Workshop on the Organization of Reliable Automata, Pacific Palisades,
California (2-4 February 1966).
Manning, E. G., "Self-Diagnosis of Electronic Computers--An Experimental Study,"
Coordinated Science Laboratory Report R-259, University of Illinois, Urbana,
Illinois (May 1965).
Marcus,
TR-1198,
Project
I. R., "Transistor Flip-Flop and Ring Counter with Nonvolatile Memory,"
Harry Diamond Laboratory, Washington, D. C., HDL Project 46300, Army
1P523801A300 (February 1964).
Massey, J. L., Error-Correcting Codes Applied to Computer Technology, " Proc.
National Electronics Conference, Vol. 19, pp. 142-147 (1963).
Massey, J. L., "Survey of Residue Coding for Arithmetic Errors," ICC Bulletin
Vol. 3, No. 4, (October 1964).
Massey, J. L., Threshold Decoding, see item 352 of this Bibliography.
369
202. Masters, C. G., "Reliability Estimation for Redundant Systems," Working Paper
presented at the Workshop on the Organization of Reliable Automata, Pacific
Palisades, California (2-4 February 1966).
203. Maxwell, L., "Synthesis of Contact Networks from Prescribed Probability Functions,"
J. Franklin Inst., Vol. 28, No. 3, pp. 214-234 (March 1966).
204. Merekin, Yu. V. "Arithmetical Forms of Writing Boolean Expressions and Their Use in
Calculating Circuit Reliability," Vyshislitelnye Sistemy, No. 7, pp. 13-23 (1963).
205. McCluskey, E. J., Jr., "Determination of Redundancies in a Set of Patterns,"
IRE Trans. on Information Theory, Vol. IT-3, No. 2, p. 167 (June 1957).
206. McCluskey, E. J., Jr., "Minimization of Boolean Functions" Bell System Tech. J.,
Vol. 35, pp. 1417-1444 (November 1956).
207. Mlkhaelov, L. N., "On the Reliability of Multichannel Systems with Continuous
Service," Automation and Remote Control, Vol. 23, No. 11 (November 1962).
208. Mina, K. V.,and E. E. Newhall, "Magnetic Circuits as Logis Packages," presented
at the 1966 INTERMAG Conference, Stuttgart, Germany (Publication in IEEE Trans.
on Magnetics in 1966 expected).
209. Mine, H., "Reliability of Physical Systems," IRE Trans. on Circuit Theory, Vol.
CT-6 (Special Supplement), p. 138 (May 1959).
210. Minnick, R. C., "Cobweb Cellular Arrays," Proc. Fall Joint Computer Conference,
Vol. 27, Part 1, pp. 327-341 (1965).
211. Minnick, R. C., "Cutpoint Cellular Logic," IEEE Trans. on Electronic Computers,
Vol. EC-13, No. 6, pp. 686-698 (December 1964).
212. Minnick, R. C., et al., "Cellular Arrays for Logic and Storage," Final Report,
Contract AF 19(628)-4233, SRI Project 4087, Stanford Research Institute, Menlo Park,
California (April 1966).
213. Moore, E. F., "Gedanken-Experiments on Sequential Machines" Automata Studies,
pp. 129-153 (Princeton University Press, Princeton, N.J., 1956).
214. Moore, E. F., and C. E. Shannon,"Reliable Circuits Using Less Reliable Relays"
J. Franklin Inst., Vol. 262, pp. 191-208 (September 1956); pp. 281-297 (October 1956).
215. R. C. Moore, "Application of Residue Class Codes to the Cape System," Tech. Doc.
RORD 6359, Minneapolis-Honeywell Inc., Military Products Group Research Laboratory,
St. Paul, Minnesota (September 1964).
216. R. C. Moore, "An Investigation of Residue Check Symbols to Improve Residue Computer
Reliability," Tech. Doc.R-Rd 6352, Minneapolis-Honeywell Inc., Military Products
Group Research Laboratory, St. Paul, Minnesota (December 1964).
217. Morrison, D. F., and H. A. David, "The Life Distribution and Reliability of a
System with Spare Components," Ann. Math. Stat., Vol. 31, pp. 1084-1094 (December
1960).
218. Moscowitz, F., "The Analysis of Redundancy Networks," AIEE Trans. on Comm & Electr,
No. 39, p. 627-632 (November 1958).
219. Moscowitz, F., "The Statistical Analysis of Redundant Systems," IRE International
Convention Record, Part 6, pp. 78-89 (1960).
220. Muchnik, A. A., and S. G. Gindlkln, "The Completeness of a System of Unreliable
Elements which Realize Functions in Logical Algebra," Soviet Phys.--Doklady,
Vol. 7, No. 6, pp. 477-479 (1962).
370
221.
222.
223.
224.
225.
226.
227.
228.
229.
230.
231.
232.
233.
234.
235.
236.
237.
238.
239.
240.
Muller, D., "Asynchronous Logic and Application to Information Processing,"
Switching Theory in Space Technology, Aiken and Main, eds. (Stanford University
Press, Stanford, California, 1963).
Mullin, A. A., "On the Nature of Reliability of Automata," in Redundancy Techniques
for Computing Systems, Wilcox and Mann, eds. (Spartan Books, Washington D.C., 1962).
Mullin, A. A., "Reliable Stochastic Sequential Switching," Communication and Elec-
tronics (November 1958).
Muroga, S., "Preliminary Study of the Probabilistic Behavior of a Digital Network
with Majority Decision Elements," Rome Air Development Center, Technical Note
RADC-TN-60-146 (August 1960).
Muth, E. J., "Reliability of a System Having Standby Spare Plus Multiple-Repair
Capability," IEEE International Convention Record, Part 10, pp. 9-15 (1965).
Naumchenko, V. V., "The Reliability of Ideally Redundant Systems," Automation and
Remote Control, Vol. 25, No. 3, pp. 376-378 (March 1964).
Nechiporuk, E. I., "Self-Correcting Diode Networks," Soviet Phys.--Doklady,
Vol. 9, No. 6, pp. 422-425 (December 1964).
Nerber, P. O., "'Power-off' Time Impact on Reliability Estimates," IEEE Int. Cony.
Rec. Part I0, pp. 1-8, (March 1965).
Nichols, A. J., Ill, "Modular Synthesis of Sequential Machines," IEEE Conf. Record
on Switching Circuit Theory and Logical Design, pp. 62-70 (October 19651.
Nitzan, D. "Flux-Switching in Multipath Cores," Report I, Contract 950095 under
NASw-6, Jet Propulsion Laboratory, Pasadena, Calif., SRI Project 3696, Stanford
Research Institute, Menlo Park, California (November 1961).
Olefir, A. K., "Search for Faulty Elements in a System of Two Computers"
Vychislitelnye Sistemy, No. 6, pp. 21-31 (1963). Abstract: USSR Scientific
Abstracts, C.C.A.T., No. 9, p. 59.
Ord-Smith, R. J., "An Extension of Block Design Methods and an Application in the
Construction of Redundant Fault Reducing Circuits for Computers," Computer Journal,
Vol. 8, No. I, pp. 28-32 (April 1965)0
Ore, 0., Graphs and Their Uses (L. W. Singer Co., New York, N.Y. 1963).
Ostianu, V. M., "Binary Signal Correction Networks," Automation and Remote Control,
Vol. 21, No. 5, pp. 426-431 (1960); Automation Express, Vol. 3, No. 2, pp. 12-15
(1960).
Paderno, I. P., "Reliability of Systems Containing Reserve (Redundant) Equipment,"
Eng. Cyb., No. 2, pp. 33-37 (1963).
Paynter, D. A., and V. P. Mathis, "Redundancy Techniques in Reliable Power Supply
Design" Proc. 1961 International Solid State Circuit Conf., pp. 50-51.
Peterson, W. W., "Binary Controls for Error Control," AIEE Trans. on Communications
and Electronics, pp. 648-652 (January 1962).
Peterson, W. W., Error-Correcting Codes (John Wiley and Sons and MIT Press,
New York, N.Y., 1961).
Peterson, W. W., "On Checking an Adder," IBM J. Res. Dev., Vol. 2, No. 2, pp. 166-
168 (April 1958).
Pierce, W. H., "Adaptive Vote-Takers Improve the Use of Redundancy" in Redundancy
Techniques for Computing Systems, pp. 229-250 (Spartan Books, Washington, D.C.,
1962).
371
241. Pierce,W.H., "AProposedSystemof Redundancyto Improvethe Reliability of
Digital Computers,"Appl. Electr. LabReportNo.TR1552-1,StanfordUniversity,Stanford,California (July 1960).
242. Pierce,W.H., "AsymptoticPropertiesof SystemsSynthesizedfor MaximumReliability,"
Inf. & Contr.,Vol. 7, No. 3, pp. 340-359 (September 1964).
243. Pierce, W. H., Failure-tolerant Computer Design (Academic Press, New York, N.Y.,
1965).
244. Pierce, W. H., "Interwoven Redundant Logic," J. Franklin Institute, Vol. 277,
No. 1, pp. 55-85 (January 1964).
245. Pierce, W. H., "Redundancy in Computers," Scientific American, Vol. 210, pp.
103-109 (February 1964).
246. Poage, J. F., "Derivation of Optimum Tests to Detect Faults in Combinational
Circuits" Proceedings of the Symposium on Mathematical Theory of Automata,
pp. 483-528 (Polytechnic Press, New York, N.Y., April 1963).
247. Poage, J. F., and E. J. McCluskey, "Derivation of Optimum Test Sequences for
Sequential Machines," Proc. of the 5th Annual Symp. on Switching Circuit Theory
and Logical Design, Princeton, pp. 121-132 (October 1964).
248. Pohm, A. V., R. J. Zingg, J. H. Roper and R. M. Stewart, Jr., "Analysis of 108
Element Magnetic Film Memory Systems," Proc. Intermag Conf., pp. 5-3-1 to 5-3-4
(1964).
249. Pollyak, Yu. G., "On Error in Forecasting Reliability based on the Statistical
Relationship between Failures of Elements." Elektrosvyaz, Vol. 17, No. 4, (April
1963).
250. Polovko, A. M., "Calculating the Dependability of Complex Automatic Systems," Iz___v.
AN SSR, Ofn. Energetika i Avtomatika, Issue No. 5, pp. 174-178 (1960); translated
in ASTIA 267736.
251. Polovko, A. M., Zaynashev, N. K., "Increasing the Reliability of Apparatus by a
Combination of General Reserve by Replacement and a Separate System with a
Constantly Connected Reserve." Engr. Cyb,, No. 5, pp. 114-120 (1960).
252. Potapov, Yu. G., and S. V. Yablonskii, "On the Synthesis of Self-Correcting Relay
Circuits," Soviet Physics Doklady, Vol. 5, No. 51, pp. 932-935 (1960).
253.
254.
255.
256.
Potter, G. B., and J. Mendelson, "Integrated Scratch Pads Sire New Generation of
Computers," Electronics, pp. 118-126 (April 1966).
Pyne, I. B., and E. J. McCluskey, Jr., "The Reduction of Redundancy in Solving
Prime Implicant Tables," IRE Trans. on Electronic Computers, Vol. EC-II, No. 4,
pp. 473-482 (August 1962).
Raikin, A. L., "The Problem of Synthesizing a Surplus Structure in the Presence
of Restrictions in the Form of Linear Inequalities," in RelaySystems and Finite
Automata, pp. 531-538 (Burroughs Corp., Paoli, Pennsylvania, 1964).
Raikin. A. L., "Determining the Optimal Reserve for a System While Taking Into
Account the Damage of Blocks in the Reserve Mode," Automation and Remote Control,
Vol. 23, No. ii, pp. 1437-1442 (1963).
372
257. Raikin, A. L., "Additional Estimates for a Fractional Redundance Scheme," Automation
and Remote Control, Vol. 25, No. 4, pp. 533-535 (April 1964).
258.
259.
260.
261.
262.
263
264
265
266
267
268
269.
270.
271.
272 .
273.
274.
275.
Raikin, A. L., A. F. Rubtsov, and V. S. Penin, "Problem of Reliability of Technical
Systems with Regularly Renewable Reserve" Engr. Cyb. No. 4, pp. 9-14 (1964).
Raikin, A. L., "Redundancy Optimization in the Presence of Constraints," Automation
and Remote Control, Vol. 26, No. 2, pp. 381-392 (February 1965).
Raikin, A. L., "Reliability of Passive Redundancy Circuits with Permanently
Connected Redundant Elements in the Case of Redistribution of Loads or Voltages."
Automation and Remote Control, Vol. 24, No. 4, pp. 517-521 (April 1963).
Randa, G. C. and C. V. McNeil, "Self-Correcting Memory--The Basis of a Reliable
Computer," Electronic Design, Vol. 13, No. 18, p. 28 (1965).
Rau, J. G. "Redundancy and Trichotomous Systems," J. Soc. Ind. Appl. Math., Vol. 12,
No. 4, pp. 827-837 (December 1964).
Ray-Chaudhuri, D. K., "On the Construction of Minimally Redundant Reliable System
Designs," Bell System Tech. J., Vol. 40, pp. 595-611 (March 1961).
Rhodes, L. J., "Effects of Failure Modes on Redundancy," Proc. lOth Natl. Symp.
on Rel. and Qual. Control, pp. 360-364 (Washington, D.C., January 1964).
Rieff, G. A., "Interplanetary Spacecraft Telecommunication Systems," IEEE Spectrum
(April 1966).
Robins, R. S., "On Models for Reliability Prediction" IRE Trans. on Reliability and
Quality Control, Vol. ii, No. i, pp. 33-43 (May 1962).
Rogers, J., and J. King, "The Case for Magnetic Logic," Electronics, Vol. 34, No. 17,
pp. 40-47 (June 1964).
Rosenheim, D. E., and R. B. Ash, "Increasing Reliability by Use of Redundant Machines,"
IRE Trans. on Electronic Computers, Vol. EC-8, pp. 125-130 (June 1959).
Rostkovskaya, S. E., "On the Probability Characteristics of Component Reliability,"
Automation and Remote Control, Vol. 22, No. ii (April 1962).
Roth, P. J., "Diagnosis of Automat Failures: A Calculus and a Method," IBM J. Res.
Dev____.,Vol. i0, No. 4, pp. 278-291 (July 1966).
Rothstein, J., "Residues of Binary Numbers Modulo Three," IRE Trans. on Electronic
Computers, Vol. EC-8, No. 2, p. 229 (June 1959).
Rubin, D. "Placement of Voters in Modularly Redundant Digital Systems," Working
paper presented at the Workshop on the Organization of Reliable Automata, Pacific
Palisades, California, (2-4 February 1966).
Rubio, J., "A Study of Some Self-Correcting Sequential Networks," Philips Res.
Reports, Vol. 17, No. 4, pp. 315-328 (August 1962).
Russo, R. L., "Synthesis of Error-Tolerant Counters using Minimum Distance Three
State Assignments:, IEEE Trans. on Electronic Computers, Vol. EC-14, pp. 359-366
(June 1965).
Sagalovich, Yu L. "A Method of Increasing the Reliability of Finite Automata,"
Problemi Peredaehi Informatsii, Vol. i, No. 2, pp. 27-35 (1965). Abstract: USSR
Scientific Abstracts, C.C.A.T., No. ii, p. 134.
373
276. Schneider, S. and D. H. Wagner, "Error Detection in Redundant Systems," Proc. of
the Western Joint Computer Conference, pp. 115-21 (1957).
277.
278.
279.
280.
281.
282.
283.
284.
285.
286.
287.
288.
289.
Schultz, W. R., and J. H. Brenner, "Application of Multi-Aperture Devices in Space-
borne Digital Control Equipment," Conference Proceedings - National Convention on
Military Electronics_ (IEEE), pp. 425-429 (September 1963).
Schwartz, Samuel A., "High-Efficiency Power Supply Uses Micrologic," Electronic
Industries (February 1966).
Seshu, S. "On an Improved Diagnosis Program" IEEE Trans. on Electronic Computers,
Vol. EC-14, No. i, pp. 76-79 (February 1965).
Seshu, S., and Freeman, D. N., "The Diagnosis of Asynchronous Sequential Switching
Systems": IRE Trans. on Electronic Computers, Vol. EC-II, pp. 459-465 (August 1962).
Sethares, G. "Closed Sets of Boolean Functions and the Reliability Problem for
Polyfunctional Nets," IEEE Trans. on Electronic Computersf Vol. EC-15, No. i,
pp. 115-117 (February 1966).
Shcherbakov, O. V., "Determination of the Reliability of Systems with Arbitrary
Laws Governing Repair," Automation and Remote Control, Vol. 25, No. 8, pp. i091-
1095 (August 1964).
Short, R. A., "A Theory of Relations Between Sequential and Combinational Realizations
of Switching Functions," No. 098-1, Stanford Electronics Laboratories, Stanford
University, Stanford, California (1960).
Short, R. A., "The Design of Complementary-Output Networks," IRE Trans. on Electronic
Computers, Vol. EC-II, pp. 743-752 (1962).
Short, R. A., "Two-Rail Cellular Cascades," Proc. of the 1965 Fall Joint Computer
Conference, AFIPS, Vol. 27, Part 1, pp. 355-369 (1965).
Short, R. A., and W. H. Kautz, "A Survey of Soviet Activities in Reliability,"
Working paper presented at the Workshop on the Organization of Reliable Automata,
Pacific Palisades, California (2-4 February 1966).
Sindeev, I. M., "Synthesizing Logical Schemes for Failure Detection and Control
of Complex Systems," Eng. Cyb., No. 2, pp. 16-23 (1963).
Slotnick, D. L., W. C. Borck and R. C. McReynolds, "The Solomon Computer," Proc.
Fall Joint Computer Conference (AFIPS) Vol. 22, p. 97 (1962).
Sinitsa, M., "On Reservation by Replacement Method," IRE Trans. on Reliability and
Quality Control, Vol. 9, pp. 6-13 (April 1960).
290. Smolitskii, Kh. L., and P. A. Chukreev, "The Problem of Optimum Redundancy of an
Apparatus," Izvesti_a Akademii Nauk SSSR_ Otdelenie Teknicheskikh Nauk; Energetika i
Avtomatika, No. 4 (1959).
• , . ., orlng pera ility & Finding Failures in Functionally
Connected Systems," Automation and Remote Control, Vol. 25, No. 6, pp. 874-882
(June 1964).
292. Spitzer, C. F., "The Ferroresonant Trigger Pair: Analysis and Design,"
Communication and Electronics, No. 26, pp. 407-416 (September 1956).
,t . _,
293. Steinbuch, K. and F. Zendeh, Self-Correctlng Translator Circuits, Proc. IFIP
Congress, 1962, pp. 359-365 (North Holland; Publishing Co., Amsterdam, Holland, 1963).
374
294.
295.
296.
297.
298.
299.
300.
301.
302.
303.
304.
305,
306.
3070
308.
309.
310.
311.
Stuart-Williams, R., "Magnetic Cores, Characteristics and Applications," Automatic
Control, Vol. 14, pp. 44-47 (April 1961).
Susskind, A. K., D. R. Haring, C. L. Liu, and K. S. Meuger, "Synthesis of Sequential
Switching Networks," Report ESL-FR-216, Contract AF 33(657)-11677, MIT Research
Laboratory of Electronics, Cambridge, Mass. (1964), AD-608 881.
Svechinskii, V. B., "Self-Correcting Design of Finite Automata," Automation and
Remote Control, Vol. 25, No. 5, pp. 623-628 (1964) and Automation Express, Vol. 7,
No. i, p. 24.
Svoboda, A., and M. Valach, "Operational Circuits (Operatorove Obovody), Stroje Na
Zpracovani Informaci, Vol. III, Prague, Czechoslovakia, pp. 247-295 (1955) (In
Czechoslovakian). Air Technical Information Center, WADD Translation No. F-TS-10126/V.
Szabo, N., "Sign Detection in Nonredundant Residue Systems," IRE Trans. on Electronic
Computers, Vol. EC-II, No. 4, pp. 494-500 (August 1962).
Teoste, R., "Design of a Repairable Redundant Computer," IRE Trans. on Electronic
Computers, Vol. EC-II, No. 5, pp. 643-649 (October 1962).
Teoste, R., "Digital Circuit Redundancy," IEEE Trans. on Reliability, Vol. R-13,
pp. 42-61 (June 1964).
Teoste, R., "Reliability of Redundant Computers," Lincoln Lab Report No. 21G-0029
(ASTIA 260494), MIT, Lexington, Mass. (March 1961).
Terris, I., "Self-Repairing Digital Computers," Hughes Aircraft Co. Technical
Report No. SSD-TR-65-52 (June 1965), AD 466123.
Terris,
Digital
Design
I., and M. A. Melkanoff, "Investigation and Simulation of a Self-Repairing
Computer," 1965 IEEE Conf. Record on Switching Circuit Theory and Logical
(October 1965).
Tooley, John, "Network Coding for Reliability," Comm. Elec., pp. 407-413 (January
1963).
Tryon, John, "Quadded Logic," in Redundancy Techniques for Computing Systems,
Wilcox and Mann, eds., pp. 205-228 (Spartan Books, Washington, D.C., 1962).
Tsertsvadze, G. N., "Stochastic Automata and the Problem of Constructing Reliable
Automata from Unreliable Elements I," Automation and Remote Control, Vol. 25,
No. 2, pp. 198-210 (February 1964).
Tsertsvadze, G. N., "Stochastic Automata and the Problem of Constructing Reliable
Automata from Unreliable Elements II," Automation and Remote Control t Vol. 25, No. 4,
pp. 458-464 (1964).
Tsiang, S. H., and W. Ulrich, "Automatic Trouble Diagnosis of Complex Logic Circuits,"
Comm. Elec. pp. 575-583 (January 1963).
Tsiang, S. H., and W. Ulrich, "Automatic Trouble Diagnosis of Complex Logic
Circuits," Bell System Tech J., Vol. 41, No. 4, pp. 1177-12OO (July 1962).
Urbano, R. H., "Matrix Criteria for Arbitrary Reliability in Iterated Neural Nets,"
IEEE Trans. on Electronic Computers, Vol. EC-14, pp. 627-629 (August 1965).
Urbano, R. H., "Reliability, Redundancy, Capacity and Universality in Polyfunctional
Nets," Draft Paper presented at the "Workshop on the Organization of Reliable
Automata," Pacific Palisades, California, 2-4 February 1966.
375
312.
313.
314.
315.
316.
317.
318.
319.
320.
321.
322.
323.
324.
325.
326.
327.
328.
329.
Urbane, R. H., "Some New Results on the Convergence, Oscillation, and Reliability
of Polyfunctional Nets," IEEE Trans. on Electronic Computers, Vol. EC-14, No. 6,
pp. 769-781 (December 1965).
Vaksov, V. V., "On Tests for Irreversible Switching Circuits," Automation and
Remote Control t Vol. 26, No. 3, pp. 515-518 (March 1965).
Van De Riet, E. K., D. R. Bennion, and J. M. Yarborough, "Feasibility Study for
Reliable Magnetic Connection Switch," Final Report--Phase I, Contract 951232 under
NAS7-100, Jet Propulsion Laboratory, Pasadena, California, SRI Project 5669,
Stanford Research Institute, Menlo Park, California (February 1966).
Varshamov, R. R. and V. M. Ostianu, "Application of the Theory of Finite Fields
to the Synthesis of Relay Mechanisms Using Structural Redundancy," Relay Systems
and Finite Automata L pp. 369-372 (Burroughs Corp., Paoli, Pennsylvania, 1964).
Vedeshnikov, V. A., A. F. Volkov, V. D. Zenkin, V. A. Trapeznikov, and
T. A. Turkovskaya, "A Digital Computer with Programmed Control," Byulletin Izobretenii
i Tovarnykh Znakov, No. 18, pp. 91-92 (1964).
Virene, E. P., "Reliability Abstracts and Technical Reviews," Vol. 5, No. 7,
Serial Numbers 2051-2100 NASA (July 1965).
Von Neumann, J., "Probabilistic Logics and the Synthesis of Reliable Organisms
from Unreliable Components," Annals of Mathematical Studies, No. 34, pp. 43-98
(Princeton University Press, Princeton, New Jersey, 1956).
Walendziewicz, "The D210 Magnetic Computer," Proceedings 1962 IRE Spaceborne
Computer Engineering Conference, pp. 117-127 (1962).
Watson, R. W., "Error Detection and Correction and other Residue Interacting
Operations in a Redundant Residue Number System," Ph.D. Thesis, University of
California, Berkeley (1965).
Watson, R. W., "Preliminary Report on Operations in Redundant Residue Number
System, Tech. Document R-RD 6330, Honeywell Military Products Group, St. Paul,
Minn. (April 1964).
Webster, L. R., "Choosing Optimum System Configurations, Prec. Tenth Annual
Symposium on Reliability and Quality Control, pp. 345-359 (1964).
Webster, L. R., "Optimum Redundancy for a Satellite Power System," Prec. Eleventh
Annual Symposium on Reliability and quality Control, pp. 568-574 (1965).
Weinstock, G. D., "Topological Analysis of Non-Series-Parallel Redundant Networks,"
IEEE Intl. Cony. Record, Part 6, pp. 34-40 (1963).
Weiss, G. H., "A Survey of Some Mathematical Models in the Theory of Reliability,"
in Statistical Theory of Reliability, Marvin Zelen, ed. (University of
Wisconsin Press, Madison, Wis., 1963).
Weiss, G. H., "Optimum Periodic Inspection Programs for Randomly Failing Equipment,"
J. Res. National Bureau of Standards, Vol. 67B, pp. 223-228 (1963).
Weiss, G. H., "Reliability of a System in Which Spare Parts Deteriorate in
Storage," J. Res. National Bureau of Standardst Vol. 66B, No. 4, pp. 157-168 (1962).
Welch, P. D., "On the Reliability of Polymorphic Systems," IBM Systems Journal,
Vol. 4, No. l, pp. 43-52 (1965).
Wilcox, R. H. and W. C. Mann (eds) Redundancy Techniques for Computing Systems
(Spartan Book Co., Washington, D.C., 1962).
376
330.
331.
332
333
334
335.
336.
337.
338.
339.
340.
341.
342.
343.
344.
345.
346.
347.
Wilkes, M. V., W. Renwick, and D. J. Wheeler, "The Design of the Control Unit
of an Electronic Digital Computer," Proc. IEE, Vol. 105, Pt. B., pp. 121-128
(March 1958).
Winograd, S., "Input-Error-Limiting Automata," J. Assoc. Comp. Mach., Vol. Ii,
pp. 338-351 (1964).
Winograd, S., "Redundancy and Complexity of Logical Elements," Information and
Control, Vol. 5, pp. 177-194 (1963).
Winograd, S. and J. D. Cowan, Reliable Computation in the Presence of Noise
(MIT Press, Cambridge, Mass., 1963).
Winter, B. B., "Introduction to Cyclic Replacement Systems," IEEE Trans. on
Reliability, Vol. R-12, pp. 36-40 (December 1963).
Yablonskii, S. V., and I. A. Chegis, "Tests for Electrical Networks," _
Matematicheskii Nauk, Vol. i0, No. 4, pp. 182-184 (1955).
Zelen, Marvin, ed., Statistical Theory of Reliability (University of Wisconsin
Press, Madison, Wis., 1963).
Zhozhikashvili, V. A., and A. L. Raikin, "Analysis of Reliability of Systems with
Fault Signaling," Automation and Remote Control, Vol. 23, No. 3, pp. 352-357
(March 1962).
"Astronavigation Computer Research," ASD-TDR-63-337, Contract AF 33(616)-8268,
Sperry Gyroscope Co., Great Neck, N.Y., for Research and Technology Division,
Air Force Systems Command, Wright-Patterson Air Force Base, Ohio (October 1963),
AD-423 704.
"Data Sheets for BR-801 Regulator," Bunker-Ramo Corporation, Canoga Park, California
(June 1966).
"Final Report for a Magnetic Computer Study," SSD-TR-64-208, Contract AF 04(695)-320,
Burroughs Corporation for Space Systems Division, Air Force Systems Los Angeles
Air Force Station, Los Angeles, California (September 1964), AD 447 910.
"Final Report on Phase II of Research on Failure Free Systems," NASA CR-343,
Contract NASw-572, Westinghouse Electric Corporation, Baltimore, Maryland (December
1965).
'_agnetic Logic in Space (A Report of Applications and Reliability)," NASA Accession
Number N64-21384, DI/AN Controls, Inc. (January 1964).
'_icropower Functional Electronic Blocks," Technical Reports 2 and 3, Fairchild
Semiconductor Division, Syosset, N.Y. (January and March 1966).
"Research on Failure Free Systems," NASA CR-105, Contract NASw-572, "Westinghouse
Electric Corporation, Baltimore, Maryland (November 1964).
"Survey/Study of the Interconnection Problems in Microelectronics," Final Report
of First Phase, NASA Contract NASw-919, Moore-Peterson Associates, Santa Barbara,
California (20 July 1965).
"Technology Survey--Microelectronies in Space Research," NASA SP-5O31, Research
Triangle Institute (August 1965).
"The Development and Use of Multiaperture-Core Flux Logic Devices to Perform
Logical Functions in Digital Data Processing," Vol. i, WADC-TR-59-648, Contract
AF 33(600)-31315, International Business Machines Corp., Owego, N.Y. (December
1959), AD-231 176.
377
348.
349.
350.
351.
"The Development and Use of Multiaperture-Core Logic Devices to Perform Logical
Functions in Digital Data Processing," Vol. 2, WADC-TR-59-648, Contract AF 33(600)-
31315, International Business Machines Corp., Owego, N.Y. (December 1959).
AD-231 177.
"The RW-400--A New Polymorphic Data System," Datamation_ Vol. 6, No. i, pp. 8-14
(January/February 1960).
Fedderson, A. P., and A. C. Shershin, "Redundancy Concepts and Optimality
Considerations," TM 024-43-RSA-16, Autonetics Div., Downey, California (18 March 1946),
AD-459 712.
Zubova, A. F., "On a Method of Calculating the Reliability of Redundant Systems,"
Automation and Remote Control, Vol. 26, No. 4, p. 701 (April-September 1965).
352. Massey, J. L., Threshold Decoding (MIT Press, Cambridge, Massachusetts, 1963).
ED
Securit v Cla._ _i fica'_ion
DOCUMENT CONTROL DATA - R & D
(St_('tzrity cla_l[i('atlon o[ rifle, I_c_lly o[ ._h_fr_(! _tt¢l itt(loxit _! _tlrlot_Hil*n [zltz.st /)e _tltt'r_.d wIt_tt t/l_ = f_v(:r_/I report is c]_t_ified)
1. ORIGINATING ACTIVITY ((_'orpor_teilllthor) 2a. REPORT SECUFWITY CLASSIF-ICA]ION
Stanford Research Institute
333 Ravenswood Avenue
Menlo Park, California 94025
2h, GROUP
UNCLASSIFIED
n/a
3 REPORT TITLE
TECHNIQUES FOR THE REALIZATION OF ULTRA-RELIABLE SPACEBORNE COMPUTERS
I 4. DESCRIPTIVE NOTES (_pe ol report and inclusive dates)
Final Report--Phase 1
5. AUTHOR(S)(First name, middle initial, last name)
Jacob Goldberg, Karl N. Levitt, Robert A. Short
6. REPORT DATE :7a. TOTAL NO. OF PAGES 17b. NO. OF REFS
September 1966 400 I 352
8_1. CONTRACT OR GRANT NO.
NAS 12-33
b, PROJECT NO.
9a, ORIGINATOR'S REPORT NUMBER(S)
Final Report--Phase 1
SRI Project 5580
9b. OTHER REPORT NO(S) (Any other numbers that may be assigned
this report)
10. DISTRIBUTION STATEMENT
11- SUPPLEMENTARY NOTES
1Na_P;on:INA:IrLon:RuYt;c:VlaTnd Space Administra-
tion, Electronics Research Center, 575
Technology Square, Cambridge, Mass. 02139
3. ABSTRACT
This is a report o1! a study of techniques for the realization of ultrareliable,
high-performance, spaceborne computers. The study included the evaluation of, and
several new contributions to, the most significant known techniques and the proposal
and investigation of several promising new techniques. The state of the art of
existing redundancy techniques for fault-detecting and fault-masking is assessed,
with special emphasis on multiple-line voting redundancy, error-correcting codes,
and redundant-state schemes for sequential networks. A number of directions for the
improvement of these techniques are described. Significant potential improvements in
reliability are available in designs allowing for a high degree of reconfigurability
in structure and programs, and system schemes and design techniques needed for such
behavior are proposed and investigated. In particular, we discuss the design of
minimal test schedules for fault detection and diagnosis_ the design of highly
modular processing networks and of programmable interconnect(on networks, and the
overall organization of maintenance and computation functions in a computer system.
The application of error control techniques to memory systems and to power supplies
is considered, and the possible use of all-magnetic logic networks is examined.
Included in the report is a critical and selective survey of the literature that is
relative to the attainment of reliable systems and networks through the judicious use
of redundant structures. Finally, recommendations are made for further research into
the development of techniques for ultrareliable system design.
DD ,'."o"v"5514 73 (PAGE 1)
UNCLASSIFIED
S/N 0101. 807. 6801 Security Classification
UNCLASSIFIED
Security Classification
4. LINK A LINK B LINK C
KEY WORDS
ROLE WT ROLE WT ROLE WT
spaceborne computers
reliability
redundancy
fault detection
fault diagnosis
fault masking
reconfigurable computers
error-correcting codes
reliable power supplies
reliable memories
literature on redundancy
DD,'.=:_".,1473 (BACK)
UNCLASSIFIED
(PAGE 2) _curity Classification
