Development of a fault-tolerant microprocessor based computer system for space flight by Montgomery, V. T.
  
 
 
N O T I C E 
 
THIS DOCUMENT HAS BEEN REPRODUCED FROM 
MICROFICHE. ALTHOUGH IT IS RECOGNIZED THAT 
CERTAIN PORTIONS ARE ILLEGIBLE, IT IS BEING RELEASED 
IN THE INTEREST OF MAKING AVAILABLE AS MUCH 
INFORMATION AS POSSIBLE 
https://ntrs.nasa.gov/search.jsp?R=19810025284 2020-03-21T11:01:43+00:00Z
(NASA -CR-161 873) DEVELOPMENT 0.11 A	 N81-33827	 ' '
FAULT-TOLERANT HICROPBUCESS UR BASED COMPUTER
SYSTEM FOR SPACE FLIGHT Final Report
(Sout6eru UniV.) 49 P HC A03/CIF A01	 Unclae
CSCL 098 G3/60 39010	 4 ,
FINAL REPORT
OF RESEARCH CONTRACT NSG-W53
"DEVELOPMENT OF A FAULT-TOLERANT
.ICROPROCESSOR BASED COMPUTER SYSTEM
FOR SPACE FLIGHT"
SUBMITTED TO
	
THE NATIONAL AERONAUTICS AND SPACE ADMINISTRAT 	
P 
,^o^Q,^,,a
484	
A
1 Qv4
	
^^	
(^	
r
	
`ay	 3
V
SUBMITTED BY: DR. V. T. MONTGOMERY
Electrical Engineering Department
Southern University
	
Baton Rouge, LA ^ !1813	 4
Phone: (504) 771-5357
aCi'1981
'' 	 f"ECEI VED
	
G7d -,-- 	 ^+ASA err ^A01LI
Ty
ACQ BRANCH
	 ^^	
i`
I	 i
FINAL REPORT
OF RESEARCH CONTRACT NSG-8053
"DEVELOPMENT OF A FAULT-TOLERANT
MICROPROCESSOR BASED COMPUTER SYSTEM
FOR SPACE FLIGHT"
SUBMITTED TO
THE NATIONAL AERONAUTICS AND SPACE ADMINISTRATION
SUBMITTED BY: DR. V. T. MONTGOMERY
Electrical Engineering Department
Southern University
Baton Rouge { LA 70813
Phone: (504) 771-5357
i
y:
This report is concern with developing a methodology to
be followed when attempting to design a tightly coupled, highly
reliable microprocessor based computer system. The concept
of Triple i,iodular Redundancy (TmR) with Sparing is usea . The
notion of syncronising by using a single crytal oct;illator is
is examined. The use of decoders to replace voters is also used.
The decoders not ot^,ly isolate the tailed module by also allows
error identification to be accomplished. Each module is tc
have its ok*n RAIM. memory. The necessary circuitry to select a
correct memory and the corresponding DPdn controller has been designed.
a1t
TABLE OF COUTENTS
I. Introduwtion 1
II. Approach 3
ill. System Partitioning and Configuration 5
IV. Synchronization 10
V. Error Definition 14
V. TVIR Controller 21
VI. Error Detection 27
Vil. Error Reporting 30
VIII. Niemory Exchange 36
IX. Summary 42
I	 )
r
11
12.
13.
14.
15.
16a.
16b.
17•
18.
19.
1
i
^ A
LIST OF FIGURES
l
Figure
1.
2.
3.
4•
5•
6a.
6b,,
r,
r•
8.
9•
10.
Page
Design Overview	 4
Ninimum System Configuration	 6
A Synchronization Technique
	
12
Typical Instruction Cycle
	
15
Op-code Fetch Machine Cycle
	
17
1emory READ Cycle
	
18
Memory VRITE Cycle
	
1s
Interrupt Acknowledge Cycle
	
19
TMR Controller State Diagram	 22
TMR Controller Block Diagram
	
24
New Processor Selectors
	
26
Error Detec z & Identification	 26
Error Concentrator
	
3 1
Correct Output Generator
	
i2
DMA Controller for Memory Update
	
53
Control (ALE) Concentrator	 3 4
Software to Store the State of Processors
	
37
Software to Restore State of Processors
	
7
State Dia,gr;:m for READ Select 	 3
Essential ROM Contents for iemory Select 	 40
READ Select Network
	
41
LIST OF TABLES
Table	 Page
1. Frequency Variation of Single
Crystal Technique
	 1
2. Error ^D Codes
	
52
____ _
IntrodLotion
The work in this project involves using commercially avail-
able microprocessors in an environment which requires a highly
reliable microcomputer system. The project specifically addresses
synchronization of microcomputer systems and tr y organization and
development of a redundant system with sparing. The Intel 8085
system design kit was used to form the basis of the discussions
within this report. however, the concepts projected in this report
can be applied to any of the oommerically available microprocessor
based systems.
It is assummed that commerically available microcomputer systems
will meet the specifications as stated by their manufacturer. These
conditions are not as strigent as space-flight requirements. Con-
sequently, the failure rate will be significantly higher. We do not
attempt to calculate that failure rate. Rather, emphasis has been
placed on timely, reliable response to a failed module.
i
	
	 'such attention was given to checking in real time with the use	 it
of hardware to perform as many checking functions as possible. By
using hardware, the system can run at the designed speed until an
error occurs. If or when an error does occur; hardware recovery
assures the designers of the fastest and most accurate recovery.
What has been attempted here is the development of e. very
tightly coupled Triple Modular Rundundanc,y (TMR) with Sparing.
A system designed using this philosophy will run exceptionally
1
2fast.
	 Errors will be detected in a minimum amount of time and
the recovery time will also be minimized. 	 "The result is a system
which is not slowed down by the checking features and in most
instances microcomputer modules can be si-tapped out in a real time
environment with a minimum effect on real time operations. ;f
rf
x
3Approach
A triple modular redundancy (VIR) with sparing system normally
has three basic computer modules which are being compared constantly
to detect errors. The errors are detected by voting on the addresses, t'
data and status information. If an error does occur, the module thate
in error is replaced by one of the spares.
Our approach involves using three microcomputer modules in the
basic TMR configuration and three spares (Figure 1). Addresses and
data are checked constantly. Every bus transfer is observed and
checked. The data selectors of Figure 1 are responsible for the
selection of microcomputer modules in the TMR configuration.
	
The
decoders detect error- and identify which module is in error. If
the system is not in an critical 1/0 operation when an error occurs,
.the control network will first store the state of the computer
network. The failed computer module is taken out of the network
by changing the code an the data selector. text, the error
detector is deactivated to allow the RAM memory to be updated.
One of the correct microcomputer modules is selected. It's
content is assummed to be correct and is broadcasted to all
RATA memory. We note that the state of the system is in RAM and
is transferred along with other data. The nodule counter now
points to spare to be activated. This number is transferred to
the data selector that handles the module that was in error.
k	 Finally, d, vector interrupt is forced by the controller causing
all microcomputer modules to reload the state and begin To
run. The controller will go back to look for another failure.
4I	 ^i^	 1
i	 Sou	 ^
f
vl
14
ow
	 1 0
0
- A
System Partitioning and Configuration
The configuration of the system v i,ll detert.1ne where and
how the TMR and sparing can be invoxed into an overall reliable
system. The main factors which determine the overall configuration'
are the number of 1/0 devices, the amount of cod ►
 that is
written for the system, the amount of ; ,.)nnec-U1*ed memory, and the
amount of data space requieed for the system to function properly.
The 8085 is an example of a very large scale integrated (VLSI)
chip. A minimum configuration (Figure 2) for basic control functions
consists of the processors, pn 1/0 and a read-only memory {ROM)
chip, and a random access ;memory (RAID and 1/0 chip. The 8085
processes the infor,% ,.tiori which it obtains from either the ROM
or the RAM memory and the I/O 6hips. The ROM memory contain
code which is permanently needed to run the system. The system we
refer to has a keyboard and dis play chip which is used to Input
control functions and data, and output register and memory values.
The RAM & 1/0 chip has two functions. The first is to hold the
user's program and temporary data. The second purpose is to
interface with external devices.
The total memory for a microcomputer system consist of both
ROM and RAN memory. The program and data size will deterruine
how the TWR microcomputer will be designed. There are several
choices. The first choice is to have a separate memory for eaoh
microprocessor based system. If we configure the ,r " :,stem in this
manner, it will be necessary to change memories when a processor
malfuncll: ions . But the new memory will have to be updated so far as
f
6a
rs
LL= AM101M
AL[
ADQ,o
Figure 2. MinimujA System Configuration (Intel)
4
i	 7
data is concerned. It is obvious that we want to keep the ratio
of ROM to RAK as high as possible so that recovery time is as short
as possible.
A variation on this scheme is to allow data to be written to
all RAM memories whether that RAN is currently a part of the basic
TMR configuration or a part of one of the spares. By writing to
all memories we can decrease the recovery time because updating would
be eliminated. The disadvantage of this method is all RAM has been
running for the same time and thus is just as likely to be faulted
as the system that it is replacing.
Another choice of configurations is to have three RANs triplicated
such that we vote on the correct output when da*v. is read from these
memories.
	 If aft error is encountered, it will be masked by the
values of vhe two correct memories.
Checking I/O operations presents a special problem because
the 1/0 chips are connected to the bus on one side and to the
1/O devices on the other.	 Once data is transmitted to
	 he 1/0
chip, we assume that the 1/0 chip is performing properly.
	 Data
transferred into the processor can be jhecked by the error
	 j
1 detecting network because this data is on the main genera.
	 ?
purpose bus.
	 We can triplicate the output of the 1/0 chips.
This would e:,sure that the 1/0 devices are receiving the correct
	
E
data.
At this point, we sea that we can be certain trax our system is
purformin,g properly simply by triplicating memories and memory
trunsf'era, and by triplicating the output of the I/O chips.	 We
must do the same for the bus which interfaces all these devices.
i
{
81
1
4
i
However, the amount of added logic far exceeds that of the basic
system configuration. This implies that if a fault should occur,
it is likely to be in the checking portion as opposed to the'
basic computer system. in an attempt to obtain an optimum con-
figuration, it was decided to check transfers .;n the main data
and address busses only. Each microcomputer module will have its
own memory. The memory will be divided into ROi yi and RAhi. The
RAN will be as small as possible, so that recovery will be minimal.
The configuration decided upon is shown in Figure 1. Three
of the microcomputer modules form the basic TPIR configuration.
The remaining three are spares. The solid line interconnecting
the processors to the data selectors represent the address and
data buses. The number of actual lines depend on processor size
and the bus configuration. Typically, 'there are sixteen address
lines and eight data lines. In the case of the 8085, there are
sixteen total lines because the lower address and data lines are
mt.ltiplexed on the same pins. If the system is designed such that
the address and data lines are demultiplexed between the processor
and the remainder of the system, then we have a normal address
and datrti bus. Otherwise, the checking network must be responsible
for checking when addresses are valid and when data is valid.
As can be seen from Figure 1, the selection of which micro-
computer module is currently in the TNR configuration is done
by the data selectors. The number of control lines determine
how many spares the system can have. A two-to-four line selector
cars have three spares. A three-to-eight line selector can have
seven spares. Therefore, the number of spares is two to the n
v
A.s
.«►
s
9
minus one, where n is the number of control lines for the selector.
The output of the selectors is fed into ..he °ioters and error
detector network. This section is responsible for the identifica-
tion of the failed microcomputer module and the generation of the
data signals. The error recovery network is responsible for the
sequencing of control information if an error does occur. This
network will interrupt the microcomputer modules and cause them
to store current state information.
The next Available Spare (NAS) counter will be incremented and
this information will be directed to the data selectors which is
in error. This causes another processor to be selected. The
TPA=R Cottroller will then cause the memories of the active processor
modules to re-write themselves. During this period, the error detectorj$
Ij
^
	
	 is ignored but the voter is relied upon to bring the new memory
up to date.
Finally, the TMR Controller will cause the microcomputer modules
in the active processor configuration to restore the state. At this
I	 point, the processing can continue and the Error ]detector can look for
{	 another fault.
10
1
lc'
R
el
f
g
Synchronization
One of the first problems to be solved is that of s,ynchronization.+
It may or may not be a problem. It depends on how the checking is to
be done on the microcomputer modules. It will also depend on what is
to be checked. Let's look at some possibl y; checking schemes. One	 ^1
ss
approach is to check data after a given number of instructions have
	
N
been executed. There are several advan'Cages to this scheme; 1)
timing is riot as critical because we can simply count op-code fetch
cycles until we are ready to test data, 2) microcomputer modules cari
be slightly out of sync; during an interval, and 3) the checking will
be minimized because we only use a few cycles to check. The dis-
advantages are; 1) an error may occur during the timing period that
will go undetected until a checking cycle is entered, 2) incorrect
	
.
output data or status can alter handshaking and data flow between
I/O devices and the microcomputer module, and 3) to make this scheme 1!
efficient, there must be restrictions on the programmer and ti%
software that is written for the system.
Another scheme is to only check when data is being transferred.
i
The philosophy here is "until incorrct datu or status is trans	
r.
mitted beyond the system, no fault has occurred." The disadvantage
r	 ;1
of this philosophy is that the memory may be completely degraded by thsll
time the fault is discovered. One solution to this problem is to
1
this problem is to triplicate the memory so that incorrect data can
x_	 not degrade the system. Of course this means that more logic has to
to be added to the system.
It was decided that minimizing the logic is the best approach.
11
i
1
A parts count was the determining factor that led to this decision.
If the number of , components is increased beyond that of the system,
then the checking network is more likey to be in error thtt1 is the
original microcomputer module. Consequently, it was decided that the
best approach was to check address and data transfers only. Piore
precisly, alb, addresses and data transfers to and from the processors,
memories and I/O devices are checked.
Checking all data transfers implies a tight coupling between
all microcomputer modules as well as the checking network. It
also means that all microcomputer modules must be executing the
same code in perfect sync with one another. Hence, it was necessary
to devise a scheme where only one clock is used for the entire
TMR system including the sparring modules as well as the voting
system.
By observing the timing mechanism of Figure 2, we can see another
problem. The clocking circuit for microprocessors has been placed
11	 inside t1=; chip with the logic so that the chip count can be reduced.
Synchronized timing can be obtained by the intelligent use of external
timing control lines such as READY, RESET-III, or HOLD.
An attempt to use the READY input to synchronize the micro-
processors was attempted. The circuit of Figure j was used to hold
the READY line of all processors until each was in the op-code fetch
state. V,hen a processor enters the op-code fetch state the status
lines IO/ti, S1 and S2 are in stag e (O1 1) . This causes the first one-
shot to fire changing the state of the op-code fetch flip-flop. When
all other processors have reached the same state, the second one-shot
TA
will fire. If the single-step switch is inactive, the RDY signal will
ti.
r^
ft,
i
M ,
DF- .A:'*lf
..I
iz
i
SI
So
Figure 3. A Synchronization Technique.
r
r,
n
^ k
13
be sent to all processors. The op-code fetch flip-flop is reset after'
a short delay and waits for the next instruction to go to the op-code
fetch state.
In observing an oscilloscope, all processors appeared to be in
perfect sync. However, we were not satisfied with this results because.
the start-up conditions were unknown. IT was decided to see
what would happen if the crystal oscillator of the first micro-
..
processor was placed across the clock circuit of all micro-
a
processors at the same time. With a, counter attached to the	 i
CZK-OUT (clock-out) signal, we measured the variation in frequency.
l
	
Table 1 shows the result of attempting to synchronize by using
a single crystal oscillator. As can be seen, the frequency will
decrease somewhat as more processors are connected.
of Processors
1
2
4
Measured Frequency
3.07437 Mhz
.07300 D"hz
3.07259 Nhz
3.07225 Nhz
Table 1. Frequency Variation of Single Crystal 'Technique.
Once we are sure that a group ef processors are executing
the same code at the same time, the next problem is to define what
is considered to be an error. Since the bus is being monitored
to determine if a fault has occurred, then addresses, data, and
status must be continually observed. Addresses may be generated
by the processors. Data may be generated by the processors,
memory or input devices. Status may b y generated by the processor
or input devices. In the case of the Intel 8085, the status lines
are WRITE, READ, I0/I4, UNTA, reset-out and the processor state
lines S1 and S2.
In our study, research as limited to commerical microprocessors.
Intel. 8065 System Design ,fits (SDK-85) obtained to form the basis of
this study has the extra 'board space on the kits which allowed a TMR
network to be plan.ed in close proximy to the system. The size of tiie
board also limited the error detecting to the addresses and the date,.
only. This was due to the limited space for adding busses necessary
for mnitoring.
Errors can exist on address, data and status lines. In
developing a TMR system, we must Know when to check for errors.
By observing the instruction cycle, Figure 4, we find that it is
y	 divided into a number of machine cycles namely, the opode-fetch,
memory read cycle, memory write cycle, I/O read cycle, 1/0 write
cycle, interrupt acknowledge cycles, and the bus idle machine
cycles. All instructions consists of at least an opcode fetch
cycle.
9
• r'
f	 ),	 is
INSTRUCTION CYCLE
i^ MACHINECYCLE M'... M7 MS ME —...........^
TSTATE T1	 I
	 111	 T3	 TA
11
T1	 T?	 TS	 s
1
it	 T7	 T? Tf	 T2	 TS
CLK 1
TYPE Of I I	 I
MACHINE CYCLE MEMORY READ MEMORY READ MEMORY READ MEMORY WRITE
y THE ADDRESS !CONTENTS OP THE THEE ADDRESS (PC 0 11 POINTS tHE ADDRESS IPC . 21 POINTS THE ADDRESS 1 S THE DIRECT
n' PRODRAM COUNTER) POINTS 70 THE TO THE SECOND BYTE OF TO THE THIRD BYTE OF THE ADDRESS ACCESSED IN Afte1 ADDRESS BUS FIRST BYTE IOPCOOEI OF THE THE INSTRUCTION INSTRUCTION ANO M3
INSTRUCTION
DATA BUS LOW ORDER BYTE OF THE HIGH ORDER BYTE OF THE CONTENTS OF THE
INSTRUCTION OPCOOEISTAI DIRECT ADORES$ DIRECT ADORES$ ACCUMULATOR
0
Figure 4. Typical Instruction Cycle.
y
^	 r
r
I
NY+ l
YI
1#	 111 11 6
Let's look at a typical instruction--store register.
First the processor will enter the opcode fetch cycle in which
the contents of the program counter is placed on the address
bus. The READ command will be issued and the memory will respond
by placing the opcode on the data bus. 11 ext we enter two memory
read cycles, one for the least significant address byte and one
for the moat significant address byte for the storage of the
register content. In each of these memory rez^:d cycles, Figure 6a,
the contents of the program counter , will be placed on the address
bus, the READ command will be issued and the memory will place a byte
on the data bus. Finally, the processor will enter a memory
write cycle (Figure 6b). The address obtained in the Two memory read
cycles will be placed on the address bus, the WRITE command. will be
issued and the contents of the register will be placed on the
data bus.
The ultimate goal in this
teoucsly and correct the system
' as possible. This means every
must be checked. By observing
that addresses are valid whene^
project is to detect errors instan-
b,y bring in a new spare as soon
address and every data transfer
the Figures 5, 6a., 6b, and 7 we see
ver the ALE line is active and data is
valid whenever WRITE, READ, or TNTA is active. These lines
can be used to derive the times for cheching addresses and data.
There are limitations on the size of the error detecting
network based on the speed of the microprocessor. The 8085
uses a 6.144 megahertz (Ehz) crystal. This frequency is
divided by a factor of two to obtain a clock out signal of
^.072 Mhz. The period of this clock: cycle is 525.5 nanoseconds
.	 y	 1	 •`
4,
l	 M! tOP! A
	
Mt
N	 $10NA6	 71	 T2	 T2	 T{ 	 TS	 Tg	 Y1
CLK
I
10 A,	
^C,M
$1 30
	
0.31+1,50.1	 1
'Y'4 's 7X- - KN
	 VNWC11 It0
OUT	 IN
77X PCI.	 00 07 10CM
I
arc	 !	 1 !
i
i	 Figure S. Op-code Fetch Machine Cycle (Intel).
17
3
{4
1f y
ig
T
<r
I
1
tf
SIGNAL
MN OA ION	 }	 MA ON Ion
TI ^ T7 -73—TX	
^TWAIT^TO S!	 •
CLK
1019,
al, so
A 9 AiS
A00 .AD7
LE
N
r	 1
10;41•OIMNI01%1110111, $1•I.to-0	 '	 !O/bl•Qmt,*AIROwI,	 i • 1,SQ-0
^	 r
1	 }	 I
OUT
AO'A7
IN
	 !	 OUT	 1	
IN
4	 0001	 I	 °'0'A7	 i 	 a	 OOnT	 }
1	 ^_^
w
i 1!
Figure 6a. Memory REVD Cycle (Intel).
SI{,NAL
_ MINOR low	 Mrl O pt low
T.	
I;.
Y	
T3
	
TO ,.^.... _... Y=
	 TI.IAIT T^
CLK
IOIM^
S1
A IT A 1S
A00.A0 7
ALE
1YN
NEAOV
11 ^I	 .^...^^rf	 ^^
IOtA^I	 o IMwI ON1IIOwb$1 • 0,so . 1	 1ICA.0INN! o pt I110110,S1•0,30•1
C	 1	 )	 I
I	 !	 y	 I
OUT
	 (	 BOUT
	 OUT	 i	 OUT
Apa17	 00.07. ^	 ^	 AO-A7	 I .	 .	 0007
!	 `	 i^	 1
I	 !
d
^.
^	 •
C
Figure 6b. Memory WRITE Cycle (Intel).
h ^ 	 i`^MFin^^Y.I.'tMr^'•.T.c 	 -	 _	 _..,.«...+
	 _ _	 s^-^ s.. _; __U^, , W »er.-^.;.::`r
... a-:.aacu^aa rc^,_
	 _
t
	
' 
*"""^....,.""'	 .,^	 `^	 ./f9ilY.s,RS:'^4,.r	 M..ww,,,..>^,.^....	 •.. .._	 «aw-...w..,.. .___-_^......,...^__	 _ ,...^^_	 .. .:	 .	 _.	 ..
i I
r
19
iI
M; I NA)M;^MMt M1 pNAF
DICNA4
ft f0 f+	 f2 /YlAif to Ty Ts	 tQ f
P T 
^	
f
^
1 j
ij
CLK
(
LT ^
to
MYR 1 ^^
^
In1A
` ' ^ 4
iota' oils=Q
1
P
y
! J	
i9^P
11
(psi so) 10, 1
	
I
IN OUT IN 1 O
h
AOOAOI O9`pl rC4	 0101ICALLP ,•9 1
AL[i
i
Ab
WAY
1
READY r I
rL	
, f
,
{f(}i
Figure 7. Interrupt Acknowledge Cycle (Intel) .
IIi
}
j
20
(no). From Figure 5 we sea that addresses are valid on one
clook circle and data is valid on the very next clock cycle. This
means that the checking must be accomplished in one clock cycle
timer period (525.5 no) yr less, or the system has to be slowed by
using a READY signal.
Should an error occur, the system must stare the state of the
processors, update a new memory, and restore the state before a.
spare can replace a failed microprocessor module. This will require
many machine cycles and hence the required time to perform this
operation is greater than X25.5 no. Therefore, the computer
system must be temporary stopped whan a new spare is to be nia.de
active.
The question Aries, 'I%hat if an error occurs during a, critical
timed 1/0 operation?" This condition can be handled by the
hardware of the Error Recovery System and the software generated by
the programmer for This system. Prior to entering a critical
tinted 1/0 operation, the software programmer can disable the
Error Recovery until the system exits the critical timing period.
The addresses and data used during this period can be the "voted"
addresses and data.
i
4i.
P 1
21
'AMR Controller
The key to the entire design is the TER controller. Its purpose
1:
is to 1) bring the microcowpater modules up in a iXR configuration 	 .
by selecting three modules as the heart ox the system, 2) allow the
system to continue to run until an error is detected, ) d sul'low	 4
any interruptions during a critical 1/0 operation, 4) detect and
6
identify failed modules,	 cause all modules to save their state of	 f
operation when an error is discovered, 6) select a correct module's
memory to copy and write the RAM section of the new processor, ^}
restore the state of operation after the memory update, and y)
L
start looi:ing for another error.
The state diagram for the ME controller is shown in Figure 8.
After the system is reset, the TMR controller will set the processor
selectors to 00. Thv s, the controller observes the buses of the three ,!
main microcomputer modules. The controller then goes to a RUN state.
The controller will remain in this state until either there is u
critical 1/0 operation or an error is discovered.
If a critical 1/0 ope ration is to take place the controller
will set the Crtioul. I/O flip-flop (010). When this operation
is complete, the CIO flip-flop will be reset.
If an error is detected, the controller will exit the FU!,
state and the CIO Flip-flop will be tested.. if the CIO flip-flop.
A
is set, the error detect and error recovery circuits will be di s-
abled.	 The controller will continue to RUN with and error until
the CIO flip-flop is reset. When this happens, the controller
t	 w1ll iesue a victo r interrupt to all active modules. The modules
5ET YAP-AT PAILAA-IL 154vM5 ti do
A
^^sr►ae,LF I:Reo^ R^co^^RY
E^•^..^^. ^Rac.e ca^.cTOO:
	
:
sir/RcsjLT' C MICA ^ 
KdbIL-r 7
GZO F 	 Zia SET
	 D
G
r2'EC.aep
1=aczoq
MOputS
I►luMg^	 E N T!q R cc o V s Q. -s-rab r u
='boij W	 V OCT c ft = NT IL%tjt o
D13A1SLE, r/d O DERA,T ICA1 S #'
`.
^y
^ U Q
A
w %Ta
FERRO SZ'oRlf. 15TATE OF- PVWC. .S
X,
UPDATE	 EVE MT 1v10RY
'
C=
RE%tT t
INCRlEME r NAT AVAILA13LJIE
RESTORE sTATla OF I^ELCCeswit
Figure S. TMR Controller State Diagram.
J
d23
stores all pertinent information regarding the present condition
of -the task. The failed module may have completely deteriorated
by this time but the remaining good modules will have accurate in-
formation in their state 'actor.
The TIE R controller roust now select a correct module and update
the memory of the new module to be brought into the main ANR con-
figuration or a correct memory can be read and the information
broadcasted to all memories of all processors. Along with the in-
formation transfer will be the state vector of the correct processor.
The next available spare (NAS) counter will be incremented
to select "h.e next module to be brought into The main configuration.
The command is issued to restore the state of the machine.
However, when this happen all processors have the same state
vector. Finally, the TMR controller will go back to the RM
state and look for new errors.
The TZAR controller block d%,dbram is shown in Figure 9.
The purpose of the Error Decoder is two-fold. First this
component will be responsible for detecting error and second,
it will identify which microcomputer module is in error. The
diagram of Figure G shows the Error Decoder as having only a
few output lines. In actuallity 'the lines shotrn must be multiplied
by the number of address, data and status lines to be checked.
wince only one error line and three D lines are required, the
concentrator circuits of Figure: 9 are required. The ALE, 76RllE,
and READ lines are required to establish timing signals. These
1signals	 ^e17. the. Ti^;R con
t
rol ler when ad dre ss e s 4	 ^t^,and da a i s valid.
Since we do not know which ALE, 4RITE, or READ signal will be
Y r
W	 lR
1
24
^	 I
1	 ^ A	 t v ^	 I
IEll
tt
	
W ^
	
o
	
^	
ae
	
7	 U0
N
vAl  0
J-1	 0Ui
LZ
Ta^
	
w
III ^-'^^ iz
W A
r
i
^fu"vl-v.. ,.	 .....	 ..:	 ^..,..:..KZM	 ...	 erA•'.'—';• .--_ WMi..tlifJG iM .MY7Y^iwB.e..u.i .b^.._ -i`rwre...«.x,.ww.rw....arw.....wnw..e.a.... ..v.a 3SM^ati+-.L+^M>.. ^^Yw^r,^nw...^	 - ^^
M .
i
25
valid, it is necessary to concentrate all of them and vote
•
9	 on when data is valid.
The new processor selector, (NPS) function is to keep track
of which processor is next to be brought into the TiriR network.
This is accomplished by using a clock and an
 counter. The counter
is enabled only when there is an error. The output of the counter
goes to a selector latch only if the corresponding microcomputer.
'	 module is the one at fault. Thr fretwork is shown in Figure 10.
The DMA controller has several fun tions. It is responsible
for 1) creating the correct RAM addresses, 2) sending the READ
command out to a correct microcomputer's memory, j ) rec,-iving
the data into a buffer, 4) broadcasting the WRITE command to all
memories, 5) checking for last address, and 6) incrementing the
address counter. The basis part of th. DMA controller is shown
in Figure 14.
a
i
ii
=a i
Ulm
\^
^^/^
\ ^
^\^^ ^	 .	 ^	 ^ 	 :	 . 	 ^ 	 \
I
26 }^
'ac.,
( \
Figure 10. New Processor Selectors.
F___	 i
27
	
I
Error Detection
In th.s section we want to identify the point at which
we detect errors. In the NR configuration, the error detector
is constantly observing; data transfers of the three microcomputer
modules. Lets look at the first data line D(0) of the three
modules that make of the fiF-tR portion of the network. gable 2
is a. list of the eight possible states of this data line. If
the error detector receives 000 or 111, then it is assummed that
all modules are performing as required and there are no faults.
If the results isa either 011 or 100, then it is assummed that
the first module is'at fault. A "011" Mould indicate that modules
2 and 3 a.re reporting a ' 1 ' and module 1 is reporting a '0'.
On the other rand, if "100" is recorded, then module 1 is
reporting a '1' while modules 2 and 3 are reporting	 001.
Using the same argument, we can see the.;. 11 101 11 and 11 010" would
indicate a fault on module 2 and a 11 110" or a 11001" would
indicate a fault on module 5.
One method of detecting these errors is with the exclusive-
or tree. However, the codes above could also be entered into a
	 1
decoder. she a.dvante e of using a decoder is obvious.	 If lines 0g	 g	 4
7 of the decoder is activated,	 then there is no fault.	 T,	 lines
jl
or 4 arE activated, microcomputer nodule ;;1
	
is at- fault.	 Nodule	 j
#2 is at fault when line 2 or 5 becomes ac tive, 	 a.nd module b3 is at
Y_
fault when line	 1	 or 6 is active.	 This portion of the error detection
network is sY;own in Figure 11	 as the check decoders.
` lie can use the sane decoders to determine the correct output.
m
7
3
ri
28
^^2ocESSc+R	 rxtssoq,	 T OC=ssc ges	 2	 3
l
	
4	 I
F	
Ni
j
f
II
^^R4ect D!'17cT	 ;''
D ae.oas ^
r,
7
O PERR,62o 
E 4 os^	
ii
l	
3 f
	
^^^ERQpR 
=p _1
EcC ps1
	
	
.y
14
r
Z . b ^ ERQaq sa	 ^
`
	13D
O =t> rw RLZDR T'D =. 3
EG psi
	
s	 ^	 i
!1
s
t	
1S,
Figure 11. Error Detect & Identification.
Y
Y
Z'1
4
n+'
1
r
	
d^	 f
i
s29
F"
By merely connecting the output of the decoder to a different
network. (Figure 15) the correct value can be obtained. This
may be necessary in situations where we can not stop the processors
to make a switch.	 By observing r'igure 13 we find thiit what is required'
k
is to take lines 3,5,6,and 7 to a four-input nand gate.
	 If the
output is to voted to be a, '1'
	
then one of these four lines will be
low causing the output to be a logical 111.
There is also a need for this error detector to be self-
checking.
	 In observing FIGURE 11 we see that whethere there is•a
fault or not, the nand gates must maintain a, condition in which
there is only one logical '1'.
	 The moss probable fault is that
the decoder does not decode, that is, all output remain high.
Of course, another possible fault is that two or more lines are
decoded at the same time.	 In either case, circuitry can be added
	
r
to detect these failure conditions.
	 The simpliest condition to
s
detect is all output remaining high.
	 This can also be done with
11
a single 4--input gate.
0
air 3 
Error Reporting
The scheme for detecting faults on one line is shown in
Figure 11.	 The first problem is to determine how many lines of the
bus we want to check.	 It was decided that the cheoking would be
done over the address and, data lines only.
	 This decision was
rean,hed by considering the cabling width and cable count of the
hardware that we used. 	 In an actual system, the cabling size
would not be a determining factor.
	 Reliability would have highest
priority.
The output of 1jand Sates 1	 through	 of Figure 11	 is fed to
the input of F i gure 12 such that, the signal line EDG1	 represents Ji
the error status of TMR module-1, EDGS2 represents the error status
of 1.11,1k module-2,	 and EDGS3 represents the error status of TMIR
modtile-3.	 These three signals are latched to preserve the error
information.
The output of Nand gate 0 of Figure 11	 is fed into the upper
network of Figure 12 to determine if a fault has occurred or not.
The output of this network is fed into a one-shot to produce a
latch pulse for the error detector network.
The next problem is to determine when we should look for an
error.	 Since the address and data lines are to be checked,
	
it is
obvious that we must check whenever addresses or data is to be
transferred over the bus.
	 Observing	 Figures 4,	 5, 6a, 6b and 7
we see that addresses are valid whenever ALE is high and data
is valid when eith^,r WRITE, READ, or INTERRUPT ACKNOWLEDGE is
slid.	 We use this information (in TRIM form)	 to fire the one-
Figure 12. Error Concentrator.
31
I
I)
1
EDCAS 0
F
1 1 .
54DSi
1 )
I
J)
SASO2
Do
zNCT t VOTIM j
PVT
I
32 i
Figure 13, Correct Output Generator.
E ^ `
A 8 C RESULTS
0 0 0 No Error
0 0 1 Error on Processor 3
0 1 0 „ 20 1 1
1 0 0
1 0 1 „	
„ 2
1 1 0 „	 „ 3
t 1 1 1 No Error
R
^
a, Table 2. Error ID Codes.
^.{^F Y-'Yfi4sYSV.""15:x:-..
4. I
Dmk FLIP-
 A L S
T	 I-Ey
IL
IL
V-rmA6b
FLIP-PEON	 ?
N.4% ITIL
PLOP-PLOP
C--JCLU
Figure 14. DMA Controller for Memory Update.
39
I
fi
I
yV
^'	 1
8y
i^
GLK COMM
S%CN"i.
S 5o'aecme
f.
.. ,	 Figure 15. Control (ALE) Concentrator.
I	 35
shot which will produce a latch pulse if there io a fault.
Cre problem tht should be solved is that of knowing,
which control lines SALE, READ, or I-tRITE) to use to Aenerat
the timing pulses for ttie error checking. Linder single fault
assumption, the failure is assumea to be aaarviated with the
address or data bus. We can therefore use the circuit of
Figure 15 to concentrate the control line si$nala.
I
1
Y
s
13 6
	
l
i
	
t	
4
Memory Exchange
The exchange of data between a good processor's memory and
a new processor to become a part of the THR con y: iguration is
handled in your steps. The first step-is to store the pres:nt
state of the processors. This is accomplished by using a vector
interrupt and the algorithm of Figure 16a. The second step is
D	 select a correct memory to read from. The third step is to use
a DMA controller to read data into a buffer, one datum at a time,
and to write the contents of that buffer to all memories. The
fourth step is to restor the state of the processors ';y using the
algorithm of Figure 16b.
A machine to select a correct memory to read from can be designed
by using the state diagram approach. The state diagram i's ahown in
Figure 17. The input to the state diagram is taken from the Error
ID Latch (EIDL) of Figure 12. If EiDL is (111), then there is no
Error and the machine will remain in state So . If EIDL is 01 1 ,
then the first microcomputer module has failed. Microcomputer module
2 or 3 can be use as the correct memory. The machine will go to
11	 state S, . The output associated with state S, must indicate that
memory 2 or 3 are available to be copied. We chose 2 to be specific.
Lets suppose that EIDL has 101 while the machine is in state S0.
^ 1	
Microcomputer module 2 has failed. The machine will go to state SZ.
^r
	 Memories 1 and 3 are available for coping. Since we must chose
w
	
one of these, we select the first one to copy. The selection of
Z) 1t	
memories continues until there is two or less processors in -,.lie
system. Presently, the machine is designed to go to a HALT state.
I!@- 6.VkV, more wort-, can b. dofiii; Cn lhiz	 °c,
0
.+	 37
PUSH
	
PSW
PUSH	 B	 -
PUSH	 D	 r
PUSH	 H
Mvi
	 B, ' OOFP'
PUSH	 B
LXI	 F,0
DAD	 SP
STORE	 H,L/ at special address
PUSH	 H
STA
	 8000
Figure 16a. SoftwRre to Score the State of Processors.
LOAD SP	 / from special address
POP B
POP H
POP D
POP B
POP PSty
POP PO
Figure 16b. Software to Restore State of Processors.
u'
i
39
degrade more slowly. On;^ final note on the machine. The bottom
part of each state indicate the status of the processors. Where
there is a G in a position, this represents an inactive processor.
The positions with a 1 indicate a good processor and a position
with an underlined 1,! indicates an active processor.
'
	
	 The machine is/ implemented in ROM. Figure 18 indicate the
essential locations , I and their content. The Address Field is
f
	
	
partitioned into two parts, the input from Figure 12 and the present
state. The Content Field is also partitioned into two parts, the
encoded output and the next state. Figure 19 is a physical realiza-
tion of the Nemory Select RON network.
The DMA controller consist of a small counter and decoder
to generate DMA microcommands and four hexadecimal counters to
generate the addresses for the memory exchange. The hexadecimal
counters are concatenated such that the carry out of one stage
becomes the clock-in of the next stage. she countess can be
parallel lorded to start at any address. The DMA .flip-flop
is controlled by external logic and is reset by the Last Address
Gate when the cycle is complete. The generated ALE, READ, and
WRITE signals are produced by the decoder. First the ALE signal
is generated to indicate that the addresses are valid. Then the
READ command is issued to only one processor as specified b y
 the
3
RON network.. At this point data is transferred into a buffer.
k1ext, the ALE signal is again generated and a WrITE command is
sent to all processors. finally, the Cycle Complete signui is
venerated. The controller is finished when a Cycle Complete
c
signal and a Last Address signal is received.
P
40
r
fr
J;
ADDRESS CONTENT ADDRESS CONTENT
IN P.S. OUT N.8. IN P.S. OUT N.S.
1'11 OOOCG 000 00000 110 00111 O11 10000
011 00000 010 00001 011 01000 100 01101
101 00000 001 00010 101 01000 001 10000
110 00000 001 00011 110 01000 001 10000
011 00001 010 00100 011 01001 010 01100
101 00001 011 00101 101 01001 001 10000
110 00001 010 00110 110 01001 001 01111
001 00010 011 00101 011 01010 111 10000	 HALT
110 00010 001 01000 011 01100 711 10000	 STATE
011 C0011 010 00110 011 01101 111 10000
101 00011 CU1 01000 011 01110 111 10000
110 00011 001 01001 011 01111 111 10000
O11 00100 010 01010 101 01010 111 10000
101 00100 011 01011 101 01011 111 100CC
110 00100 010 01 100 101 01 100 1 1 1 10000
011 00101 011 01011 101 01101 111 10000
101 00101 011 01011 101 01110 111 10000
110 00101 100 01101 101 01111 111 10000
01i 00110 010 01100 110 01010 111 10000
101 00110 100 01101 110 01011 111 10000
110 00110 010 01100 110 01100 111 10000
011 00111 011 01011 110 01101 111 10000
101 00111 001 01110 110 01110 111 100,00
110 01111 111 10000
Figure 18. Essential RCM Contents for Pemory Select
11
s.
is
1
a
1
41
Figure 19. READ Select Network.
N
ISummary
This project was started during a period when b-bit micro-
processors were the state of art. ;low 16-bit and '-'2-bit micro-
processor are almost common place. During this period microcomputers
on a single chip are also common but they have little memory avail-
able on the chip. The methodology presented in this report is com-
patible with any microprocessor based computer system. Ho.^Tever, if
there were more test points available, the job of increasing relia-
bility would be much easier.
Typical microprocessors have 40 pins available to the user.
This means that little more than addresses and data are available.
Y
By placing the clock circuit inside the processor, almost no checking
	
j
can be done on the timing circuit which is required to synchronize
a TMR system. A tightly coupled real-time system is required for
efficient aerospace computers. Therefore, the system designer must
make use of every test-point available.
There are a number of different approaches. We can triplicate
memories, add software checks or insert software breakpoints. 	 But,	 s
the most efficient system is one in which testing is transparent to
the processing of the system. 	 This report prs;.-sents such an approach.
Synchronization has bee,i accomplished by using a single cryst&l. 	 4
,j
As the the number of processors are increased, the frequency decreases.-
T
The decreases are generally less than 100 hertzs per processor.
This is s. minimum decrease in performance, and a significant reduction
in part count f-or this function.
Instead of us ing voters. our approach uses logic decoders for 	 r
42
43
detecting an error and identifying The failed module.	 Again.	 the
pest count is reduced by getting two functions performed for the
price of one.
The design also uses a ROM as a circuit element to select
a good memory to copy when an error occurs.	 "he alternative was to
use a triplicated memory.	 This technique has several short commings.
First,	 is one of the memory modules is lost due to the first
failure, then th.e second failure takes the system down.	 Second,
a system that is to be designed around one of the microcomputers-
on-a-chip cannot use this technique.
Overall, the techniques examia.ed during this report will aid
in the design of a high-speed aero-space computer irrega.rdliss of
whether that system is designed around	 8, 16, or 32 bit micro-
processors or microcomputer -on-a,-chip.
F
{
