Advanced information processing system: The Army fault tolerant architecture conceptual study. Volume 2: Army fault tolerant architecture design and analysis by Babikyan, C. A. et al.
NASA Contractor Report 189632, Volume II
Advanced Information Processing System:
The Army Fault Tolerant Architecture
Conceptual Study
Volume II- Army Fault Tolerant Architecture
Design and Analysis
I W-& __J
.
Y7
R. E. Harper, L. S. Alger, C. A. Babikyan, B. P. Butler, S.
A. Friend, R. J. Ganska, J. H. Lala, T. K. Masotto, A. J.
Meyer, D. P. Morton, G. A. Nagle, C. E. Sakamaki
The Charles Stark Draper Laboratory, Inc.
Cambridge, MA
Contract NAS1-18565
July 1992
N/ A
National Aeronautics and
Space Administration
Langley Research Center
Hampton, Virginia 23665-5225
,"',I
O
I'0 ,,t
O_ C ,..4
Z D O
N
,,D
w Z
I t9
t.) ,, Dey't._ tn
O'):E t,U _ _.J
N Z < _J I,-- _
OU_Z LJfO
I _ _-'_0 _
I _ _ ,',
_ uJ k- .< ",_
ZZ_O_Z,_
https://ntrs.nasa.gov/search.jsp?R=19920023857 2020-03-17T11:21:40+00:00Z
fJ
This page intentionally left blank.
Page ii
Executive Summary
Digital computing systems needed for Army programs such as the Computer-Aided
Low Altitude Helicopter Flight Program and the Armored Systems Modernization (ASM)
vehicles may be characterized by high computational throughput and input/output band-
width, hard real-time response, high reliability and availability, and maintainability, testa-
bility, and producibility requirements. In add:Lion, such a system should be affordable to
produce, procure, maintain, and upgrade.
To address these needs the Amly Fault Tolerzmt Architecture (AFTA) is being designed
and constructed under a three-year program comprising the Conceptual Study, Detailed
Design and Fabrication, and Demonstration and Validation phases. This report describes
the results of the Conceptual Study phase of the AF'I'A development. The scope of the
Conceptual Study was quite broad and cover_ topics ranging from mission requirements
to architectural synthesis and analysis to life cycle cost modeling.
AFTA is a militarized version of the Fault Tolerant Parallel Processor (FTPP) devel-
oped by the Charles Stark Dnlper Laboratory, Inc. AFTA is a hard-real-time Byzantine re-
silient parallel processor which is programmed in the Ada language. It supports testability
and redundancy management strategies which permit the dynamic reconfiguration of pro-
cessing sites to enhance sortie availability and mission reliability. It is composed largely of
Non-Developmental Items to reduce the development risk and cost and to facilitate up-
grades. Extensive analytical models and predictive verification and validation techniques
are provided with AFTA to allow application designers to engineer a configuration for spe-
cific missions with a high degree of confidence that the fielded configuration will meet the
mission requirements. As a part of AFTA, a fault tolerant data bus (FTDB) is being devel-
oped to providt_ a highly reliabie, fault tolerant networking system between AFTA and
other digital systems. The conceptual design of the FTDB covers many aspects of network
design, including media technology, media access control, topology, routing, OSI protocol
stacks, and fault detection and recovery. In addition to these traditional network topics, the
FTDB also encompasses techniques from the area of fault-tolerance, including Byzantine
resilience and authentication protocols. .....
AFTA's architectural theory of operation, the AFTA hardware architecture and compo-
nents, and the architecture of the AFTA Operating System have been defined during the
Conceptual Study, as well as a test and maintenance strategy for use in fielded AFTA in-
stallations. A format has been developed for representing mission requirements in a man-
ner suitable for first-order AFTA sizing and analysis. Preliminary requirements have been
obtained for two Army missions: a rotary winged aircraft mission and a ground vehicle
mission. An approach to be used in reducing the probability of AFTA failure due to com-
mon-mode faults has been developcd, as have analytical models for AFTA performance,
reliability, availability, life cycle cost, weight, power, and volume. A plan has been devel-
oped for verifying and wdidating key AFTA concepts during the Dem/Val phase, especially
those which cannot be cost-effectively validaied by accelerated life cycle testing. The ana-
lytical models and partial Army mission requirements developed under the Conceptual
Study have been used to evaluate AFTA configurations for the two selected Army mis-
sions. To assist in documentation and reprocurement of AFTA components, VHDL is
used to describe and design AFTA's developmental hardware. Finally, the requirements,
architecture, and operational theory of the AFTA Fault Tolerant Data Bus have been defined
and described.
The next phase of the development has begtm and will result in a Brassboard AFTA for
demonstration and validation.
Page iii
, _ PRECEOING PAGE BLANK NOT FILMED
This page intentionally left blank.
Page iv _ : - -: _ -_ .
Table of Contents
Executive Summary .iii
Table of Contents.. .v
List of Figures ......... ......... .xv
List of Tables .................... .xix
Introduction to Volumes I and II.. ....................................... xxi
4. AFTA Hardware Architecture ............................................................. 4-1
4.1. AFTA Physical Configuration .4-1
4.2. AFTA Virtual Configuration .... ...... .... :.. ......................................... 4-2
4.3. AFTA Functional Overview ...... 4-3
4.4. AFTA Network Element ............... _.._.:. ................... _............. ....... 4-6
4.4.1. Network Element Addressing Convention .................................... 4-6
4.4.2. Network Element Functional Description ..................................... 4-8
4.4.2.1. Data Exchange Primitives .............................. 4-9
4.4.2.1.1. Class 0 .................... ,,,.. ............................................. 4-10
4.4.2.1.2. Class 1 .................... ,:. .............................................. 4-10
4.4.2.1.3. Class 2 ..................... i: .............................................. 4-11
4.4.2.1.4. Broadcasts ................. 4-11
4.4.2.2. Configuration Table Updates .... ...................................... 4-11
4.4.2.3. Initial Synchronization ....... ,.,,.........:..,. ............................. 4-12
4.4.2.4. Transient NE Recovery ..................................................... 4-14
4.4.2.5. Voted Resets/Monitor Interlocks .......................................... 4-15
4.4.2.6. Syndrome Reports . .................................... 4-16
4.4.2.7. Timestamps ............................................................ 4-18
4.4.2.8. NE Debug Commands ....... ............................................. 4-19
4.4.3. Network Element Programming Reference ................................... 4-20
4.4.3.1. Processor/Network Element Interface .................................... 4-20
4.4.3.2. Memory Map ................................................................. 4-21
4.4.3.2.1. Data Segment ............................................................. 4-21
4.4.3.2.1.1. Outgoing (Transmit) Buffering .................................. 4-22
4.4.3.2.1.2. Incoming (Receive) Buffering ................................... 4-23
4.4.3.2.1.3. Information Block Fields ......................................... 4-24
4.4.3.2.1.3.1.
4.4.3.2.1.3.2.
4.4.3.2.1.3.3.
4.4.3.2.1.3.4.
4.4.3.2.113.5.
4.4.3.2.1.3.6.
4.4.3.2.1.3.7.
4.4.3.2.1.3.8.
4.4.3.2.1.3.9.
4.4.3.2.1.3.10.
4.4.3.2.1.3.11.
4.4.3.2.2.
4.4.3.3.
4.4.3.3.1.
4.4.3.3.2.
4.4.3.3.3.
4.4.3.3.4.
4.5. AFTA
Class... ....... . .... 4-24
ToVID ....... ,,_i... ........................................... 4-26
FromVID ............................................... 4-26
User Field ...4-26
Vote errors,: .................................................. 4-26
Clock Errors:., ............................................ 4-27
Link Errors . ........................................... 4-27
OBNE timeout ............................................... 4-27
IBNF timeout ........................................... .4-28
Scoreboard Vote Error ...................................... 4-28
Timestamp;: :. ;;, ............................................. 4-29
Buffer Manager .................................................... 4-29
Packet Formats .............................................................. 4-31
Data Packet ........... i.i .................................................. 4-31
CT Update Packet.. ................ ..4-31
Transient NE Recovery Packet ............................. ..,: ........ 4-34
Voted Reset Packet ....................................................... 4-35
Component Physical Descriptions ........................................... 4-36
ii • , ,i i
PRECEDING PA_E BLANK NOT FILMED
Page v
4.5.1.
4.5.2.
4.5.2.1.
4.5.2.1.1.
4.5.2.1.2.
4_.2.1.3.
4.5.2.1.4.
4.5.2.1.5.
4.5.2.1.6.
4.5.2.2.
4.5.2.2.1.
4.5.2.2.2.
4.5.2.3.
Processing Element (PE) Characteristics ...................................... 4-41
Network Element (N'E) Characteristics ........................................ 4-47
Network Element Overview ................................................ 4-47
VMEbus Interface ........................................................ 4-48
Network Element Data Paths ........................................... 4-49
Inter-FCR Communication System .................................... 4-51
Fault-Tolerant Clock ..................................................... 4-52
Global Controller ......................................................... 4-53
Scoreboard ................................................................ 4-53
Network Element Physical Characteristics ............................... 4-55
Circuit Board Layout ................................................ 4-55
Military Qualification of Baseline Network Element ................. 4-57
Effect of Implementation Technology on Network Element
Physical Characteristics ..................................................... 4-60
RISC Processor Scoreboard ............................................ 4-60
FPGA Implementation ................................................ 4-61
4.5.2.3.3. High-End Network Element ............................................ 4-62
4.5.3. Input/Output Controller (IOC) Characteristics ................................ 4-67
4.5.4. Power Conditioner (PC) Characteristics ...................................... 4-68
4.5.5. Cooling System .................................................................. 4-70
5. AFTA Software Architecture .............................................................. 5-1
5.1. Overview...,, .......................................................................... 5-1
5.2. System Specification and Initialization .............................................. 5-2
5.2.1. Virtual Group Configuration ................................................... 5-3
5.2.2. Rate Group Task Configuration ................................................ 5-7
5.3. Rate Group Tasking Services..., .... , ............................................... 5-10
5.3.1. Rate Group Tasking Initialization .............................................. 5-11
5.3.2. Rate Group Dispatcher .......................................................... 5-13
5.3.3. Rate Group Tasks ................................................................ 5-15
5.4. Time Management ..................................................................... 5-18
5.4. i. Time Management Initialization ................................................ 5-18
5.4.2. Time Management Operation ................................................... 5-19
5.5. Communication Services ............................................................. 5-20
5.5.1. Message and Packet Structure .................................................. 5-21
5.5.2. Communication Services Initialization ........................................ 5-24
5.5.3. Message Transmission .......................................................... 5-30
5.5.4. Message Reception .............................................................. 5-32
5.6. Fault Detection, Identification and Recovery ....................................... 5-37
5.6.1. System and Test Modes ......................................................... 5-37
5.6.2. Off-Line Fault, Detection, Isolation and Recovery .......................... 5-38
5.6.3. Local Fault Detection, Isolation and Recovery ............................... 5-39
5.6.4. System Fault Detection, Isolation and Recovery ............................. 5-39
5.6.5. Operational Modes ............................................................... 5-39
5.6.6. Fault Detection Mechanisms .................................................... 5-40
5.6.6.1. Enumeration of Mechanisms ............................................... 5-41
5.6.6.1.1.
5.6.6.1.1
5.6.6.1.1
5.6.6.1.1
5.6.6.1.1
5.6.6.1.I
5.6.6.1.1
5.6.6.1.2.
5.6.6,
Processor Self Tests ..................................................... 5-41
.1. CPU Tests .......................................................... 5-41
.2. Cache Tests ........................................................ 5-41
.3. Memory Tests ...................................................... 5-42
.4. MMU Tests .................................................... 5-42
.5. I/O Tests ............................................................ 5-44
.6. Miscellaneous Tests ............................................... 5-44
Network Element Self Tests..., ........................................ 5-44
1.2.1. Processor-Network Element Interface .......................... 5-44
Page vi
5.6.6.1.2.2. Network Element Data Paths..................................... 5-45
5.6.6.1.2.3. NetworkElementGi0_ Controller ............................ 5-45
5.6.6.1.2.4. Scoreboard .......... ,... ...... . .... 5-45
5.6.6.1.2.5. Inter-Fault Set Ctm_unication Links ........................... 5-45
5.6.6.1.2.6. Voted Reset ....... i_.211. ....................................... ...5-45
5.6.6.1.2.7. Fault Tolerant Clock ................................................ 5-45
5.6.6.1.3. FCR Backplane Bus Self Tests ......................................... 5-46
5.6.6.1.4. Input/Output Device Self Tests ......................................... 5-46
5.6.6.1.5. Power Conditioner Self Tests .......................................... 5-46
5.6.6.1.6. Mass Memory Self Tests. i_ .... 5-46
5.6.6.1.7. System Tests ...... ..................................... 5-46
5.6.6.2. Operational Constraints of Fault Detection Mechanisms ................ 5-48
5.6.6.2.1. I-BIT Mode Self Tests. ........................ .,15-50
5.6.6.2.2. M-BIT Mode Self Tests._2. ............................................. 5-51
5.6.6.2.3. I-BIT Mode System Tests ............................................... 5-51
5.6.6.2.4. M-BIT Mode System Tests ............................................. 5-52
5.6.6.2.5. C-BIT Mode Tests ....................................................... 5-52
5.6.6.3. Mapping of Fault Detection Mechanisms to Test Modes ............. 5-52
5.6.6.3.1. Processor Self Tests ..................................................... 5-53
5.6.6.3.2. Network Element Self Tes-ts ............................................ 5-55
5.6.6.3.3. FCR Backplane BusS'elf Tests ......................................... 5-56
5.6.6.3.4. Input/Output Device Self Tests ......................................... 5-56
5.6.6.3.5. Power Conditioner Self Tests .......................................... 5-56
5.6.6.3.6. Mass Memory Self Tests._. .......................................... 5-56
5.6.6.3.7. System Tests .......... ,Z ................................................. 5-57
5.6.7. Fault Diagnosis ............ ili_i!i... .... • ............................... :.'. ....... 5-57
5.6.7.1. Non-Fault Tolerant Operations ............................................ 5-57
5.6.7.2. Fault Tolerant Operations ................................................... 5-60
5.6.7.2.1. Local Fault Detection and Isolation .................................... 5-60
5.6.7.2.1. I. Intra Virtual Group Presence Test ............................... 5-62
5.6.7.2.1.2. Syndrome Analysis ..... .......................................... 5-62
5.6.7.2.1.3. Self Tests ........................................................... 5-63
5.6.7.2.2. System Fault Detection _d Isolation .................................. 5-64
5.6.8. Recovery options ................ .5-65
5.6.8.1. Response to Failure of Test ... ............................................. 5-66
5.6.8.2. System Recovery. ........................................ 5-67
5.6.8.2.1. Recovery from Processor Failure ...................................... 5-69
5.6.8.2. I.I. Graceful Degradation ............................................. 5-69
5.6.8.2.1.2. Processor Resync_onization .................................... 5-70
5.6.8.2.1.3. Processor Reintegration .......................................... 5-73
5.6.8.2.1.4. Processor Replacement ........................................... 5-73
5.6.8.2.1.5. Processor Replacement with Initialization ...................... 5-73
5.6.8.2.1.6. Task Migration ........................................... , ......... 5-74
5.6.8.2.2. Recovery from Network Element Failure ............................. 5-74
5.6.8.2.2.1. Network Element Resynchronization ........................... 5-75
5.6.8.2.2.2. Network Element Masking ....................................... 5-75
5.6.9. Transient Fault Analysis ................ ....... ................................. 5-75
5.6.9.1.
5.6.9.1.1.
5.6.9.1.2.
5.6.9.2.
5.6.9.3.
5.6.9.4.
5.6.9.5.
Transient Recovery Option ................................................. 5-76
Processor Recovery._ .................................................... 5-78
Network Element R¢c0very ............................................. 5-79
Wait and See Transient Ahalysis Option .................................. 5-79
No Transient Fault Analysis Option ....................................... 5-80
Hybrid Transient Fault Analysis Option .................................. 5-80
Intermittent Fault Analysis ................................................. 5-82
Page vii
Q6.1.
6.2.
5.6.9.6. Transient Fault Analysis Option and System Modes .................... 5-82
5.6.10. Fault Logging ..................................................................... 5-82
5.6.11. Fault Reporting ................................................................... 5-83
5.6.11.1. Cockpit D[splay Unit ........................................................ 5-84
5.6.11.2. Portable Intelligent Maintenance Aid ...................................... 5-86
5.6.11.3. Fault Annunciator Panel .................................................... 5-87
5.7. I/O Services ............................. ............................................... 5-87
5,7.1. The AFTA I/O User Interface .................................................. 5-88
5.7.2. Input/Output User View ......................................................... 5-88
5.7.3. I/O Request Construction ....................................................... 5-89
5.7.4. I/O Transactions .................................................................. 5-90
5.7.5, I/O Chains ........................................................................ 5-93
5.7.6. I/O Requests ...................................................................... 5-94
5.7.7, !/O Data Access .................................................................. 5-95
5.7.8. The Buffer Control Procedures ................................................ 5-95
5.7.9. The AFTA I/O Communication Manager ..................................... 5-98
5.7.9.1. The Nonpreemptable I/O Dispatcher ................................... 5-99
5,7.9.1.1. The I/O Request Tasks ............................................ 5-102
5.7.9.1.2. Dispatching ......................................................... 5-102
5,7.9.2. AFTA Input Output Services: Examples .............................. 5-103
5.7.9.2.1. Example #1: All I/O Requests can be Completed in 10
ms ................................................................... 5-103
5.7.9.2.2. Example #2: All I/O Requests can not be Completed in
10 ms ............................................................... 5-103
Fault-Tolerant Data Bus .................................................................... 6-1
Objective and Approach ............................................................... 6-1
Fault-To!erant Data Bus Requirements .............................................. 6-2
6,2.1,
6.2.1.1,
6.2.1.2.
6.2.2.
6.2.2.1.
6.2.2.2.
6.2.2.3.
6.2.2.4.
6.2.3.
6.2.3.1.
6.2.3.2.
6.2.3.3,
6.2.3.4.
6.2.4.
6.2.4.1.
6.2.4.2.
6.2.4.3.
6.2.4.4,
6.2.5.
6.2.5.1.
6.2.5.2.
6.2.5.3.
6.2.5.4.
6.2,6.
6.2.6.1.
6.2.6.2.
6.2.6.3.
Packet Requirements ............................................................ 6-2
Word Length ................................................................. 6-2
Packet Length ................................................................ 6-2
Network Control Requirements ................................................ 6-2
Access Control Modes ...................................................... 6-2
Address Modes .............................................................. 6-2
Uncontrolled Transmit Inhibit ............................................. 6-3
Flow Control ................................................................. 6-3
Network Function Requirement ................................................ 6-3
Broadcast and Multicast Functions ........................................ 6-3
Periodic and Aperiodic Transfers .......................................... 6-3
Packet Ordering .............................................................. 6-4
Station Identification ........................................................ 6-4
Topology and Architecture Requirements ..................................... 6-4
Growth ....................................................................... 6-4
Topology ..................................................................... 6-4
Station Insertion and Removal ........................... .................. 6-4
Bridges for Interconnected Buses ......................................... 6-4
Physical Requirements .......................................................... 6-5
Serial Transmission ......................................................... 6-5
Media Support ............................................................... 6-5
Electrical Isolation ........................................................... 6-5
Station Separation ........................................................... 6-5
Fault Tolerance Requirements .................................................. 6-5
Packet Delivery .............................................................. 6-5
Synchronization ............................................................. 6-5
Source Congruency ......................................................... 6-6
6.2.6.4.
6.2.6.5.
6.2.6.6.
6.2.6.7.
6.2.6.8.
6.2.6.9.
6.2.6.10.
6.2.6.11.
6.2.6.12.
Connectivity ................. ................................................. 6-6
Stationto NetworkInterface_,............................................. 6-6
Redundancy .6-6
StationRedundancy.......... ............................................... 6-6
Error Detection ............................................................... 6-6
Diagnosability .... _- ................ 6-7
Self-Test..................... ...................................... ...... 6-7
ByzantineResilience......... ............................................. 6-7
FaultIsolationandContainment........................................... 6-7
nm.. ......6.2.7. Performance Requireme ................................................. 6-7
6.2.7.1. Message Priorites ........ . .............................................. 6-7
6.2.7.2. Network Bandwidth ......................................................... 6-8
6.2.7.3. Initialization Time ........................................................... 6-8
6.3. FTDB Architecture Study ..... ,.?_ ............................................ 6-8
6.3.1. Broadcast Buses .................. _ .............................................. 6-8
6.3.2. Token Rings. . ..... . ............................. 6-11
6.3.3. Circuit Switched Network ....................................................... 6-13
6.3.4. Packet Switched Network ....... _i_.. ............................................ 6-15
6.3.5. Fiber Optic Networks . ...... 6-16
6.3.6. Authentication Protocols ......... ,,,. ............................................. 6-16
6.4. Existing and Proposed Standards .................................................... 6-20
6.4.1. AIPS Intercomputer Network .................................................. 6-20
6.4.2. SAVA High-Speed Data Bus...,. .............................................. 6-21
6.4.3. JIAWG High-Speed Data Bus ................................ . ................ 6-22
6.4.4. Fiber Distributed Data Interface (FDDI) ....................................... 6-23
6.4,5. SAFENET II .......... ......... .... ....................... . ............... :6-24
6.4.6. Summary ................ i...ii,._ , ..... . ....... 6-25
6.5. FTDB Brassboard Design Proposal ........... ;... ............ . ..................... 6-26
6.511. Physical Layer ............... , ......... ......................................... 6-34
6.5.1.1. Physical Layer Protocol .................................................... 6-34
6.5.1.2. Physical Layer Medium Dependent .................................... ,..6-35
6.5.2. Data Link Layer ....... 6-35
6.5.2.1. Media Access Control ....................................................... 6-36
6.5.2.2. Station Management_?_.L .......................................... . ....... 6-37
6.5.2.3. Logical Link Control.. .... . ............................. 6-37
6.5.3. Network Layer ................. ...., ............................................. 6-37
6.5.3.1. Byzantine Resilient Network Protocol (BRNP) ......................... 6-38
6.5.3.2. Authentication Protoco!_ (ATP) ............................................. 6-40
6.5.3.3. Address Resolution Prot_o! (ARP) ...................................... 6-41
6.5.4. Transport Layer . ........................................... 6-41
6.5.4.1. Periodic Datagram Protocol (PDP) ........................................ 6-42
6.5.4.2. Services Transaction Protocol (STP) ..................................... 6-42
6.5.4.3. Asynchronous Datagram Pi-otocol (ADP) ................................ 6-42
6.5.4.4. Network Data Stream Protocol (NDSP) .................................. 6-43
6.5.4.5. Network Diagnostic ProtocbI_ (NDP) ............................... 6-43
6.5.4.6. Echo Protocol (EP) .......................................................... 6-43
6.5.4.7. Time Management P_toc01 (TMP) ....................................... 6-44
6.6. FTDB Development Plan ........................................ 6-44
6.6.1. Developmental and Non-developmental Items .6-44
6.6.2. Proposed FTDB Brassboard Development Plan ......................... ,...6-45
6.6.2.1. Subtask 1-Authentication Protocols ....................................... 6-45
6.6.2.2. Subtask 2-Byzantine Resdlence ..... 6-45
6.6.2.3. Subtask 3-Network FDIR, ........................................ ......... 6-46
6.6.2.4. Subtask 4-Transp0rt LayerProtocols ..................................... 6-46
6.6.3. FTDB BrassboardDevelopment Schedule....................................6-47
7. TestabilityandMaintainability............................................................. 7-1
7.1. Level of testing ......................................................................... 7-1
7.1.1. Component self tests ............................................................ 7-1
7.1.2. System tests ...................................................................... 7-2
7.2, Test Modes ............................................................................. 7-2
7.3. Operator interface ...................................................................... 7-6
7.4. FTPP (22 Network Element Tests ................................................... 7-6
7.4. I. Off-line Standalone NE Diagnostic Tests ..................................... 7-8
7.4.2. Functional Block: Processor-Network Element Interface ................... 7-9
7.4.3. Functional Block: Network Element Data Paths ............................. 7-11
.7.4.4. Functional Block: Network Element Global Controller ..................... 7-13
7.4.5. Functional Block: The Scoreboard ............................................. 7-14
7.4.6. Functional Block: The Inter-Fault Set Communication Links ............. 7-15
7.4.7. Conclusions ...................................................................... 7-15
7.5. AFTA Maintenance .................................................................... 7-16
7.6. AF'rA Line Maintenance Procedure ................................................. 7-16
8. Common Mode Fault Study ............................................................... 8-1
8.1. Objective ............................................................................. 8-1
8.2. Approach ............................................................................... 8-1
8.3. Enumeration of Common Mode Fault Sources .................................... 8-2
8.3.1. Classification by Nature ......................................................... 8-2
8.3.2. Classification by Origin ......................................................... 8-2
8.3.3. Classification by persistence .................................................... 8-3
8.4. Enumeration of Common Mode Fault Avoidance, Removal, Tolerance
Techniques ............................................................................. 8-4
8.4.1. Cominon Mode Fault Avoidance ............................................... 8-5
8.4.1.1.
8.4.1.2.
8.4.1.3.
8.4.1.4.
8.4.1.5.
8.4.1.6.
8.4.1.7.
8.4.1.8.
8.4.1.9.
8.4.1.10.
8.4.2.
8.4.2.1.
8.4.2.2.
8.4.2.3.
8.4.2.4.
8,4.2.5.
8.4.2.6.
8.4.3.
8.4.3.1.
8.4.3.2.
8.4.3.3.
8.6.
Formal Methods ............................................................. 8-5
Formally Verified Components ............................................ 8-5
Mature Components ......................................................... 8-5
Design Automation Tools .................................................. 8-6
Architectural Considerations ............................................... 8-7
Design Diversity ............................................................. 8-8
Use of Standards ............................................................ 8-8
Good Software Engineering Practices .................................... 8-9
Conservative Hardware Design Practices ................................ 8-9
Shielding, Packaging and Thermal Management .......................... 8-11
Common Mode Fault Removal ................................................. 8-11
Design Reviews ............................................................. 8-11
Simulations ................................................................... 8-11
Testing ........................................................................ 8-11
Fault Injection ................................................................ 8-12
Discrepancy Reports ........................................................ 8-13
Automated Theorem Provers ............................................... 8-16
Common Mode Fault Tolerance ................................................ 8-16
Common Mode Fault Detection ............................................ 8-16
Common Mode Fault Recovery ........................................... 8-18
Performance Overheads of Common Mode Fault Tolerance
Techniques ........ ,., ........................................................ 8-19
Common Mode Fault Examples ................. .......................... 8-19
Effectiveness of Common Mode Fault Avoidance, Fault Removal, Fault
Tolerance Techniques ................................................................. 8-22
Suitability of Common Mode Fault Avoidance, Fault Removal, Fault
Tolerance Techniques for AFTA ..................................................... 8-24
Page x
8.7. Plan for Implementation of CMF Avoidance, Removal, Tolerance
Techniques . ...................................... 8-30
8.7. I. Demonstration/Validation Phase.... ................... , ....................... 8-30
8.7.1.1. AFTA System ............. . ................................. 8-30
8.7.1.2. Hardware Design ............................................................ 8-31
8.7.1.3. Software Design . ...... 8-31
8.7.1.4. Hardware-Software Test and Integration ................................. 8-32
8.7.2. Full Scale Development Phase .... ............................................. 8-33
8.7.3. Production Phase . ..... 8-34
8.7.4. Deployment Phase. . ..... 8-34
8.7.5. Pre-Planned Product Improvement Phase .................................... 8-35
9. Analytical Models ..... 9-1
9.1. Performance Model ............... ........ . ..................................... 9-2
9.1.1. Delivered Throughput ............ , _. ............................................ 9-2
9.1.2. Intertask Communication ........ ............................................ 9-7
9.1.3. Input/Output ....................... ii .......... . ............... ..-.., ............. 9-10
9.2. Reliability and Availability Models .,,...... ......................................... 9-11
9.2.1. Formulation for Graceful Degradation Class of Fault Recovery ........... 9-18
9.2.2. Formulation for Processor Replacement Class of Fault Recovery ......... 9-20
9.2.3. Failure Rate Calculation Methodology ........................................ 9-22
9.2.3.1. Environmental Effects .... ..9-22
9.2.3.2. PE Failure Rate Caloulations_;. ........................................... 9-23
9.2.3.2.1. Hiatus: Ground Fixed . .......................................... 9-24
9.2.3.2.2. Aircraft Mission ......... . i.__ _........................................... 9-24
9.2.3.2.3. Ground Mission .................................................. 9-24
9.2.3.3. NE Failure Rate Calculations ............................................. 9-25
9.2.3.3. I. Methodology ...................... 9-25
9.2.3.3.2. Assumptions ...... _- ..... . ..... 9-25
9.2.3.3.3. AFTA Hiatus NE Failure Rate .......................................... 9-28
9.2.3.3.4. AFTA Aircraft Mission _Failure Rate .............................. 9-30
9.2.3.3.5. AFTA Ground Mission _ Failure Rate .............................. 9-32
9.2.3.3.6. Implications and Indicated Course of Action ......................... 9-33
9.2.3.4. IOC Failure Rate Calculations ............................................. 9-37
9.2.3.5. PC Failure Rate Calculations ............................................... 9-38
9.2.3.5.1. Hiatus ..................... .............................................. 9-38
9.2.3.5.2. Aircraft Mission ...... . ......................................... 9-38
9.2.3.5.3. Ground Mission .......................................................... 9-39
9.3. Physical Characteristics (WPV) Models ............................................ 9-39
9.3.1. Weight ........................... ,... .............................................. 9-39
9.3.2. Power ............................................. 9-39
9.3.3. Volume ...._ .9-40
9.4. Fleet Life-Cycle Cost per Service Uiiit (FLCCPSU) Model ..................... 9-40
9.4.1. Assumptions and Analysis Inputs ............................................. 9-41
9.4.2. Application Scenario ...... ,., ................................................... 9-43
9.4.3. Procurement Cost ............... .._.......,., ...... . .... . ........................ 9-43
9.4.4. Manpower Cost due to Repairs ..... .9-44
9.4.5. Cost due to Spares ............... , ............................................... 9-45
9.4.6. Cost due to Unreliability of AFTA ................ . ............................ 9-46
9.4.7. Total FLCCPSU... ......................................... 9-46
10. VHSIC Hardware Description Language_. .............................................. 10-1
10.1. VHDL Overview ....... . ................ 10-1
10.1.1. Behavioral vs. Structural Models .............................................. 10-2
10.1.2. Overview of a VHDL Description 10-4
10.2. Use of VHDL for AFTA... ........................................................... 10-5
Page xa
10.2.1. Design............................................................................. 10.6
10.2.1.1. Behavioral.................................................................... 10-6
10.2.1.2. Structural ..................................................................... 10-7
10.2.2. Simulation ..................................................................... 10-7
10.2.3. Testing ........................................................................ 10-8
10.2.4. Documentation ................................................................... 10.8
10.2.4.1. Custom Devices ............................................................. 10-8
10.2.4.2. Standard Devices ......................................................... 10-10
10.2.5. Candidate VHDL Tools for AFTA NE Design ............................... 10-12
10.2.6. Compliance with Data Item Description ....................................... 10.12
10.2.6.1. Reference Documents ....................................................... 10-12
10.2.6.2. VHDL Model Hierarchy .................................................... 10-12
10.2.6.3. Leaf-Level Modules ......................................................... 10-12
10.2.6.4. Entity Declarations .......................................................... 10-13
10.2.6.5. Behavioral Body .................................................. 10-14
10.2.6.6. Structural Body .............................................................. 10-14
10.2.6.7. VHDL Simulation Support ............................................. 10-14
10.2.6.8. Error Messages .............................................................. 10-15
10.2.6.9. Annotations .................................................................. 10-15
10.2.6.10. Reference to Origin .......................................................... 10-15
10.2.6.11. VHDL Documentation Format ......................................... 10-15
11. AFTA Validation and Verification ......................................................... 11-1
11.1. Verifiable AFTA Attributes ........................................................... 11-4
11.2. Verification of Byzantine Resilience and Operational Correctness ............... 11-5
11.2.1. Fault Containment ............................................................... 11-6
11.2.2. NE Synchronization ............................................................. 11-6
11.2.3. !,_teractive Consistency .......................................................... 11-6
i 1.2.4. Voting ............................................................................. 11-7
11.2.5. Message-Release Authorization (Scoreboard) ................................ 11-7
11.2.6. Reconfigurability ................................................................. 11-8
11.2.7. Functional Synchronization ..................................................... 11-9
11.2.8. Byzantine Resilient Virtual Circuit Abstraction ............................... 11-9
11.2.9. Rate Group Scheduling ......................................................... 11-10
11.2.10. lntertask Communication Services ............................................. 11-12
11.2.11. I/OSystem Services ................................... , .................... 11-12
11.2.12. Redundancy Management (FDIR) Software .................................. 11-13
I 1.3. Verification of Performance Predictions ............................................ 11-14
11.3.1. Delivered Throughput per VG .................................................. 11-14
11.3.2. Available Memory per VG ...................................................... 11-16
11.3.3. Effective Intertask Communication Bandwidth and Latency ... 11-16
11.3.4. Effective I/O Bandwidth and Latency ......................................... 11-16
11.3.5. Iteration Rate of a Task .......................................................... 11-18
11.4. Verification of Reliability and Availability Predictions ............................ 11-18
11.4.1. Component Failure Rate ........................................................ 11-19
11.4.2. Fault Reconfiguration Time ................................ 11-20
11.4.3. Fault Reconfiguration Coverage ............................................... 11-21
I 1.4.4. VG Redundancy Levels ......................................................... 11-22
11.4.5. Mission/Hiatus Time ............................................................ 11-22
11.5. Verification of Cost Predictions ...................................................... 11-22
11.6. Verification of Weight, Power, and Volume Predictions ......................... 11-23
12. AFTA Architecture Synthesis .............................................................. 12-1
12.1. AFTA Architecture Synthesis ........................................................ 12-1
12.1.1. Configurable Parameters ........................................................ 12-1
12.1.2. AFTA Architecture Synthesis Procedure ...................................... 12-2
Page xii
12.2. AFTAAo_hitectttrcSyn_csi_. ....,,,,,,,,,,,,,,,,:.........,........................12-3
12.2.I. AFTA Charaot_r]sri_aC_mmon w Both M!ssions,,,,,,,,,,,,......, ......,12-3
12.2,!.1. P©livcrcd Throughput, i,._,,,,,,,,,,,, ....... ,, ....... ,,,, ........ , ......... 12-3
12.2.2. AFTA Configuration for TF_A/NOF.,/FCS Mission ..... ,,,, ............... 12-4
!2,2,2.1. Analytical Results ..... i,,,,,i,.,.,,,i, ...................... , ..... , ........ ,12-5
!2.2.2.1.11 Failm Rates. ..... i.',,,,ii,__'.',,' ...... , .......... , ..... ' ............... !2-5
12 2 2 1 2 Reliability 12-5
• . • ¢ . t. e ...4 |_ t .t .t ttt_t te._ t..#e_t t ¢e..o. t _.,_?. ttt tit.. _¢..9t, *t
12.2.2.1.3. Throughput-Reliability Tradeoff ....................................... 12-8
12.2.2.1.4. Effect of VHSIC/VLS! Network Element Technology. on
Reliability 12-9
12.2.2.1.5. Unavailability ............ ., .., ............................................ 12-10
12.2.2.1.6. Weight ............. , ....... ,,, .......................................... 12-12
12.2.2.1.7. Power ...................................................................... 12-13
12.2.2.1.8. Volume ...................... ....................... , ..................... 12-14
12.2.2.1.9. Cost .................................................................... 12-15
12.2.3. AFTA Analysis for Ground Vehicle Mission ................................ 12-20
12.2.3.1. Throughput .................. _,, ............................................. 12-20
12.2.3.2. Reliability ................... ,. ................................................ 12-20
12 2 3 3 Weight 12-24
12.2.3.4. Power ........... _............................................................ 12-25
12 2 3 5 Volume 12-26
Appendix A. References ....................... .... . ..................................... A-1
Appendix B. Glossary of Terms and Acronyms ............................................. B-1
Page xiii
Thispageintentionallyleft blank.
Pag_xiv
List of Figures
Figure 4-1.
Figure 4-2.
Figure 4-3.
Figure 4-4.
Figure 4-5.
Figure 4-6.
Figure 4-7.
Figure 4-8.
Figure 4-9.
Figure 4-10.
Figure 4-11.
Figure 4-12.
Figure 4-13.
Figure 4-14.
Figure 4-15.
Figure 4-16.
Figure 4-17.
Figure 4-18.
Figure 4-19.
Figure 4-20.
Figure 4-21.
Figure 4-22.
Figure 4-23.
Figure 4-24.
Figure 4-25.
Figure 4-26.
Figure 4-27.
Figure 4-28.
Figure 4-29.
Figure 4-30.
Figure 4-31.
Figure 4-32.
Figure 4-33.
Figure 4-34.
Figure 4-35.
Figure 4-36.
Figure 4-37.
Figure 4-38.
Figure 4-39.
Figure 4-40.
Figure 5-1.
Figure 5-2.
Figure 5-3.
Figure 5-4.
Figure 5-5.
Figure 5-6.
Figure 5-7.
Figure 5-8.
Figure 5-9.
AFTA Physical Configuration . .......................................... 4-2
AFTA Virtual Configuration ................................................. 4-3
Network Element Addresses . .......................................... 4-7
Absolute Mask .................. _._. ........................................... 4-8
Relative Mask .................................................................. 4-8
ISYNC Procedure .............. , .............................................. 4-14
NE Memory Map .............................................................. 4-21
DPRAM Memory Map ...... 4-22
I_i111tl ll[l[_l IOill[llll_llllllllllillillllll Iil IIiI1.1
Packet Class Field ...... .... ,,. _:: ,. ........................................... 4-24
Vote Error Field ................................................................ 4-26
Clock Error Field .............................................................. 4-27
Link Error Field ............ ,, ...................................... .4-27
OBNE Timeout Field.,.:_.... ................................. 4-28
IBNF Timeout Field ........................................................... 4-28
Scoreboard Vote Error Field. _ .......................................... ..4-28
Buffer Manager Memory Map .. .................. ........ 4-30
Next Port Format ............ ..._.. ..................................... . ....... 4-30
Ready Port Format ............................................................ 4-31
CT Update Packet Format..., iii-.. ...................................... ' .... 4-32
Redundancy Level Field ....... _. ............................................. 4-32
PE Mask Field .................. . .............................................. 4-32
Processor Specification Field ............................................. 4-33
NE Mask Format .... . .... , :: :::_;:: ............................................ 4-33
TNR Receive Packet Format _,. ............................................. 4-34
TNR Result Byte Format ....... i....... ....................................... 4-35
VRESET Packet ............................................................... 4-35
VRESET Command Byte.i ................................................... 4-36
AFTA FCR Architecture.....4-37
SAVA-based AFTA FCR:. ........... . ........... 4-38
SAVA-based AFTA LRM .... .............................................. 4-39
JIAWG-based AFTA FCR ...................................................... 4-40
JIAWG-based AFTA LR_.._ ......... ; .......................... 4-41
Functional Block Diagram o_ AFTA Processing Element ............. 4-42
Functional Block Diagram of the AFTA Network Element ............... 4-48
Inter-FCR Fiber-Optic Network ............................................. 4-51
Normal FTC Adjustment Period ............................................. 4-52
Self-Ahead FTC Adjustment Period ......................................... 4-53
Self-Behind FTC Adjustment Period ........................................ 4-53
Network Element Brassboard Layout ....................................... 4-56
Functional Block Dia_ of the AFTA Power Conditioner ............. 4-69
AFTA System Software Organization ....................................... 5-1
Example AFTA Configuration ............................................... 5-3
Rate Group Frame Phasing ................................................... 5-5
VG Configuration Table ...................................................... 5-6
Task Configuration Table ..................................................... 5-8
Task Configuration Table Initialization ..................................... 5-9
Rate Group Task Lists ....... . ................................................ 5-11
Initialize Rate Group Tasking Procedure ................................... 5-12
Task Priority .............. ., _:.k_,........................ ...................... 5-13
n
P_--,EDtNG PA_E BLANK_NOT FILMED
Page xv
Figure 5-I0.
Figure 5-II.
Figure 5-12.
Figure 5-13.
Figure 5-14.
Figure 5-15.
Figure 5-i6.
Figure 5-17.
Figure 5-18.
Figure 5-19.
Figure 5-20.
Figure 5-21.
Figure 5-22.
Figure 5-23.
Figure 5-24.
Figure 5-25.
Figure 5-26.
Figure 5-27.
Figure 5-28.
Figure 5-29.
Figure 5-30.
Figure 5-31.
Figure 5-32.
Figure 5-33.
Figure5-34.
Figure 5-35.
Figure 5 36.
Figure 5-37.
Figure 5-38.
Figure 5-39.
Figure 5-40.
Figure 5-41.
Figure 5-42.
Figure 5-43.
Figure 5-44.
Figure 5-45.
Figure 5-46.
Figure 5-47.
Figure 5-48.
Figure 5-49.
Figure 5-50.
Figure 5-51.
Figure 5-52.
Figure 5-53.
Figure 5-54.
Figure 5-55.
Figure 5-56.
Figure 5-57.
Figure 5-58.
Figure 5-59.
Figure 5-60.
Figure 5-61.
Figure 5-62.
Mapping of RG Frames to Minor Frames .................................. 5-14
Frame Start Procedure ............... , ........................................ 5-15
Example Rate Group Task .................................................... 5-17
Wait for Next Frame Procedure .............................................. 5-17
Initialize Time Keeping Procedure ........................................... 5-19
Task-to-Task Message and Packet Formats ................................ 5-23
Transmit and Receive Queues ................. ............................... 5-25
Transmit and Receive Packet Queue Entries ................................ 5-27
Message Class and Message Data Structure ................................ 5-28
CID Queue Table .................................................. ............ 5-29
Initialize Communication Procedure ......................................... 5-29
Send Message Procedure ........................... .......................... 5-30
Queue Message Procedure .................................................... 5-31
Send Queue Procedure ........................................................ 5-32
Message Pending Table ....................................................... 5-33
CID Status Table ............................................................... 5-34
Read Message Procedure ..................................................... 5-35
Update Frame Marker Procedure ........................................... 5-36
Retrieve Message Procedure '..... '.... ........... ,............... ........... 5-36
System mode and test mode interactions .................................... 5-38
Operational modes ...................... .,,.. ....... , .......................... 5-40
Test Mode Sequences ..................................................... 5-50
Off-Line FDI Overview ....................................................... 5-59
Synchronous FDI Overview .................................................. 5-61
Qualitative evaluation of recovery methods ................................. 5-68
Lost Channel Synchronization ............................................... 5-72
Transient Recovery Algorithm ........................ ....................... 5-77
Wait and See Transient Fault Analysis algorithm .......................... 5-79
No Transient Fault Analysis algorithm ...................................... 5-80
Hybrid Transient Fault Analysis algorithm ................................. 5-81
Possible Mapping of Transient Analysis Options to System
Modes 5-82
#*,_0))).),,*)),,soi*,,,l**,*=,)*e)**)**o)n)))t o IO))I**011*IIIG I_ 0*ll**_.._
AFTA System Level Display ................................................. 5-84
LRU Level Display ............................... , ............................ 5-85
LRM Level Display ............................................................ 5-86
The AFTA I/O Services ....................................................... 5-87
The I/O User Interface and I/O Communication Manager ................ 5-88
I/O Transactions, I/O Chains, and !/O Requests .......... ................. 5-91
An Input Transaction Record ................................................. 5-92
An Output Transaction Record ............................................... 5-92
An Input/Output Transaction Record ........................................ 5-92
The Create_Transaction Procedure .......................................... 5-93
The Create_Chain Procedure .................................................. 5-93
The Create_I/O Request Procedure .......................................... 5-94
The Lock_I/O_Request_Buffer Procedure ................................. 5-96
The Lock_I/O_Request_Buffer Procedure ................................. 5-96
The I/O_Buffer_Contention Exception and Error Handler ............... 5-97
The I/O User Interface Status Retrieval Procedures ....................... 5-97
The I/O Communication Manager ........................................... 5-99
The Nonpreemptable I/O Dispatcher ........................................ 5-101
I/O Requests for Example #1 ................................................. 5-105
Example #1 ..................................................................... 5-105
I/O Requests for Example #2 ................................................. 5-106
Example #2 ..................................................................... 5-106
Page xvi
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
6-1.
6-2.
6-3.
6-4.
6-5.
6-6.
6-7.
6-8.
6-9.
6-10.
6-11.
6-12.
6-13.
6-14.
6-15.
7-1.
7-2.
7-3.
7-4.
7-5.
8-1.
9-1.
9-2.
9-3.
9-4.
9-5.
9-6.
9-7.
9-8.
9-9.
9-10.
10-1.
10-2.
10-3.
10-4.
10-5.
10-6.
12-1.
12-2.
12-3.
12-4.
12-5.
12-6.
12-7.
12-8.
Broadcast Bus Topology ...................................................... 6-9
Token Ring Topology .... ;,_ .................................................. 6-11
Fully Braided Chordal Ring .................................................. 6-12
Dual Counter-rotating Ring ................................................... 6-13
Example Circuit Switched Topology ........................................ 6-14
AIPS IC Network ............................................................. 6-21
ISO/OSI Model of FTDB ...... , ii ......................................... 6-27
FTDB Architecture . ...... 6-30
Triply Redundant VG TI Sim_neously Writes Output Data to
Signer/Checker Components of Fault Tolerant Data Bus ................. 6-31
Step 1 of FTDB Message Transfer .......................................... 6-32
Step 2 of FTDB Message Transfer .......................................... 6-32
Step 3 of FTDB Message Transfer .......................................... 6-33
Step 4 of FTDB Message Transfer .......................................... 6-33
Step 5 of FTDB Message Transfer .......................................... 6-34
FFDB Brassboard Developmefit Schedule ................................. 6-47
System mode and Test M0deilfiteractions .................................. 7-4
Test Mode State Diagram ....................................................... 7-5
Block Diagram of FTPP C2 Network Element ............................ 7-7
Maintenance-Related AFTA Features ....................................... 7-17
AFTA LRU .... ...: ........... ................................................ 7-19
Typical Discrepancy Report Format ......................................... 8-15
AFTA Methodology Information Flow ..................................... 9-1
Mapping of RG Frames to Minor Frames .................................. 9-3
RG Message Passing .......................................................... 9-8
Outgoing Message Processing ............................................... 9-9
Graceful Degradation of Quadruplex VG I ................................. 9-13
Processor Replacement Redundancy Management for Quadruplex
VG1 ............................................................................. 9-15
AFTA Hiatus NE Failure Rate Constituents ................................ 9-30
AFTA Flight Mission NE Failure Rate Constituents ...................... 9-31
AFTA Ground Mission NE Failure Rate Constituents .................... 9-32
Helicopter TF/TA/NOE Mislsion Scenario State Diagram ................. 9-41
Hierarchical VHDL Model .................................................... 10-2
Behavioral Model of a D Flip_Flop .......................................... 10-3
Structural Model of a D Flip'Flop ........................................... 10-3
Behavioral Model of a NAND Gate ......................................... 10-3
Reprocurement Options for Custom Devices ............................... 10-9
Reprocurement Options for Standard Devices ............................. 10-11
AFTA Delivered Throughput vs. Number of Processing Elements ..... 12-4
Probability of AFTA Failure for 1-hour Rotary Wing Aircraft
Mission ........ ,................ , .............................................. 12-6
Probability of AFTA Failure for 4-hour Rotary Wing Aircraft
Mission . ...... 12-7
Delivered Throughput vs. AFTA Failure Probability for 1-hour
Rotary Wing Aircraft Mission ................................................ 12-8
Probability of AFTA Failure for 1-hour Rotary Wing Aircraft
Mission using VHSIC/VLSI-based Network Element .................... 12-9
Ratio of Probability of AFTA Failure for 1-hour Rotary Wing
Aircraft Mission ................................................................ 12-10
AFTA Unavailability after 23-hour Hiatus for Six-VG/Four-FCR
MDC ........................ ,I_L_, ............................................. 12-11
AFTA Unavailability after 23-hour Hiatus for Six-VG/Five-FCR
MDC ............................................................................ 12-12
Page xvii
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
12-9.
12-I0.
12-11.
12-12.
12-13.
12-14.
12-15.
12-16.
12-17.
12-18.
AFTA Weight for Helicopter Mission ....................................... 12-13
AFTA Power for Helicopter Mission ....................................... 12-14
AFTA Volume for Helicopter Mission ...................................... 12-15
AFTA Unreliability for Eight-Hour Ground Vehicle Mission ............ 12-21
AFTA Unreliabilityfor24-Hour Ground VehicleMission ...............12-22
AFTA Unrcliabilityfor 168-Hour Ground Vehicle Mission .............12-23
AFTA Unreliabilityfor720-Hour Ground Vehicle Mission .............12-24
AFTA Weight forGround VchiclcMission ................................12-25
AFTA Power forGround VehicleMission.................................12-26
AFTA Volume forGround VchiclcMission ...............................12-27
Pag "'e gVlll
List of Tables
Table 4-I.
Table 4-2.
Table 4-3.
Table 4-4.
Table 4-5.
Table 6-1.
Table 7-1.
Table 8-1.
Table 8-2.
Table 8-2.
Table 8-3.
Table 8-4.
Table 9-1.
Table 9-2.
Table 9-3.
Table 9-4.
Table 9-5.
Table 9-6.
Table 9-7.
Table 9-8.
Table 9-9.
Table 9-10.
Table 9-11.
Table 9-12.
Table 9-13.
Table 9-14.
Table 11-1.
Table 11-2.
Table 11-3.
Table 11-4.
Table 11-5.
Table 11-6.
Table 12-1.
Table 12-2.
Table 12-3.
Table 12-4.
Table 12-5.
Table 12-6.
Table !2-7.
Characteristics of Radstone PMV 68M CPU-3A Processing
Element ........ . .............. 4-44
Characteristics of Lockheed Sanders STAR MVP Processing
Element . .... 4-45
Characteristics of SAVA GPPM _essing Element ..................... 4-46
Characteristics of AFTA Ne_w6rk Element ................................. 4-55
Military Device Availability for Network Element ......................... 4-58
Comparison of Standards ..................................................... 6-26
AFTA Maintenance Time Line ,, ............................................. 7-16
Classification of Common Mode Faults .................................... 8-3
Commonly Observed Error Symptoms of Common Mode Faults ....... 8-21
Commonly Observed Error Symptoms of Common Mode Faults
(Cont.) .......................................................................... 8-22
Effectiveness of CMF A/R/T Techniques ................................... 8-23
Application of CMF A/R_ Techniques to AFTA .......................... 8-29
Completed/Started RGs vs. Frame Boundary .............................. 9-4
Environmental Failure Rate Multipliers for Monolithic
Microelectronic Devices ..... , ................................................ 9-23
PE Cited Failure Rate Data ........................................... 9-23
PE Hiatus Failure Rate Data..., .............................................. 9-24
PE Aircraft Mission Failure Rate Data ....... ............................... 9-24
PE Aircraft Mission Failure Rate Data ...................................... 9-24
AFTA Baseline NE Parts Hiatus Failure Rates ............................ 9-29
Ab'TA Baseline NE Hiatus F_!ure Rate and Constituents ................ 9-30
Ab'TA Baseline NE Flight Mission Failure Rate and Constituents ...... 9-31
AFFA Baseline NE Ground Mission Failure Rate and
Constituents... .... . .9-32
Summary of AFTA Baseline NE MTBF Data .............................. 9-33
Gates, Pins, and Power Consumption of High-End NE ................. 9-33
Comparison of MTBFs of Baseline and High End AFTA Network
Element ........................... . .............................................. 9- 37
PC Cited Failure Rate Data ................................................... 9-38
Verifiable AF'FA Attributes ................................................... 11-4
Completed/Started RGs vs. Minor Frame Boundary ...................... 11-11
Verification of Delivered Throughput ....................................... 11-15
Verification of Intertask Communication Bandwidth and Latency ....... 11-17
Verification of I/O Communication Bandwidth and Latency ............. 11-17
VG Failure Probability Due to Attrition and Near-Coincident
Faults.. .11-21
AFTA Component Failure Rates for Helicopter Mission Scenario ...... 12-5
AFTA Component Weights for Helicopter AlZTA ......................... 12-13
AFTA Component Powers for Helicopter AFTA .......................... 12-14
AFTA Component Volumes for Helicopter AFTA ........................ 12-15
FLCCPSU Input Parameters for Rotary Wing Aircraft Mission ......... 12-16
FLCCPSU Output Parameters for Rotary Wing Aircraft Mission,
Four-FCR MDC with no spare FCR ........................................ 12-17
FLCCPSU Constituent Costs for Four-FCR MDC with no spare
FCR ................. . ....................................... 12-18
Page x_--
Table 12-8.
Table 12-9.
Table 12-10.
Table 12-11.
Table 12-12.
Table 12-13.
Table i2-14.
Table 12-15.
FLCCPSU Output Pzu'ameters for Rotary Wing Aircraft Mission-
Four-FCR MDC with one spare FCR ....................................... 12-18
FLCCPSU Constituent Costs for Four-FCR MDC with one spare
FCR .... ' .................................. i ..................................... 12-18
FLCCPSU Output Parameters for Rotary Wing Aircraft Missiotv-
Five-FCR MDC ................................................................ 12-19
Minimum-FLCCPSU Constituent Costs for Four-FCR MDC with
one spare FCR ................................................................. 12-19
AFTA Component Failure Rates for Ground Vehicle Mission
Scenario ......................................................................... 12-21
AFTA Component Weights for Helicopter AFTA ......................... 12-24
AFTA Component Powers for Ground Vehicle AFTA ................... 12-25
AFTA Component Volumes for Ground Vehicle AFTA .................. 12-26
Page xx
Introduction to Volumes I and II
The long-term objective of the AITI'A program is to develop and deploy the Army Fault
Tolerant Architecture (AFTA) on a variety of Army programs such as the Computer-Aided
Low Altitude Helicopter Flight Program and the Amaored Systems Modernization (ASM)
vehicles. Applications such as these may be characterized by a combination of computa-
tional intensiveness, real-time response requirements, high reliability and availability re-
quirements, and maintainability, testability, and producibility requirements.
The AFTA architecture is based on the Charles Stark Draper Laboratory, Inc. Fault
Tolerant Parallel Processor (FTPF'). AFTA is a real-time computer possessing high relia-
bility, maintainability, availability, testability, and computational capability. It achieves the
first four properties primarily through adherence to a theoretically rigorous theory of fault
tolerance known as Byzantine Resilience, through which arbitrary failure modes can be tol-
erated. It is designed for verifiability and quantifiability of key system attributes with a
high degree of confidence, in part due to its theoretically sound basis and in part due to
plausible parameterizations of fault tolerance and Operating System overheads. Through
the use of parallel processing, Ab-'I'A achieves sufficient throughput for future integrated
avionics and control functions. To be useful for a variety of Army applications, the num-
ber and redundancy level of processing sites in AFTA may be varied from one application
to another, and AFTA is programmed in the DoD-mandated Ada language. AFTA is in-
tended to be relatively easy to produce and upgrade through extensive use of Non Devel-
opmental Items and compliance with well-accepted electrical, mechanical, and functional
standards.
Over the past few years NASA and the Strategic Defense Initiative Office (SDIO) have
sponsored the Advanced Information Processing System (AIPS) program at Draper Labo-
ratory. The overall goal of the AIPS program is to produce the knowledgebase necessary
to achieve validated distributed fault tolerant computer system architectures for advanced
real-time aerospace applications I llar91bl. As a part of this effort, an AIPS engineering
model consisting of hardware building blocks such as Fault Tolerant Processors and Inter-
Computer (IC) and Input/Output (I/O) networks and software building blocks such as Lo-
cal System Services, IC and I/O Communications Services was constructed. AFTA can be
considered to be a high-throughput AIPS building block which can be interfaced to the
AIPS IC network. Section 3.7 describes the AIPS engineering model in more detail and
illustrates how it can be interfaced with AVI'A.
Page xxi
This reportdescribestheresultsof theConceptualStudyphaseof theAFI'A develop-
ment,andconsistsof fourteensectionsin two volumes.VolumeI is introductoryin nature
andcontainsSections1through3. Section1introducestheAFTA program,its objectives,
andkey elementsof its technicalapproach.Section2 definesa format for representing
missionrequirementsin a mannersuitablefor first-orderAFTA sizingandanalysis,fol-
lowedby adiscussionof the current state of mission requirements acquisition for the tar-
geted Army missions. Section 3 presents an overview of AFTA's architectural theory of
operation.
Volume II contains detailed technical information and analyses in Sections 4 through
14. Section 4 describes the AFTA hardware architecture and components, and Section 5
describes the architecture of the AFTA Operating System. Section 6 describes the architec-
ture and operational theory of the AFTA Fault Tolerant Data Bus. Section 7 presents the
test and maintenance strategy developed for use in fielded AFTA installations. Section 8
describes an approach to be used in reducing the probability of AFTA failure due to com-
mon-mode faults. Section 9 develops analytical models for AFTA performance, reliability,
availability, life cycle cost, weight, power, and volume. Section 10 presents the approach
for using VHDL to describe and design AFTA's developmental hardware. Section 11 de-
scribes a plan for verifying and validating key AFTA concepts during the Dem/Val phase,
and Section 12 utilizes the analytical models and partial mission requirements to generate
AFTA configurations for the TF/TA/NOE and Ground Vehicle missions. References are
contained in Section 13, and a glossary of terms and acronyms is included in Section 14.
Because some readers may wish only to read individual volumes, Volumes I and II
contain some redundant information.
Page xxii
4. AFTA Hardware Architecture
Section 4 describes the hardware architect_e of the AFTA. The AFTA architecture
consists of a cluster of processing sites interco_ted by a fault-tolerant network system
constructed from custom-built Network Elements_nd fiber optic interconnect. The cluster
also contains controllers for communication between the AFTA, other computer systems,
and I/O devices, k
4.1. AFTA Physical Configuration
A diagram of the physical AFTA configuration is shown in Figure 4-1. The AFTA
consists of 4 or 5 fault-containment regions (FCR). Each FCR contains a Network Element
(NE), 0 to 8 Processing Elements (PE), and 0 or more I/O controllers (IOC).
The Network Elements provide communication between PEs, keep the FCRs synchro-
nized, and maintain data consensus among FCRs. The NE is designed to implement the re-
quirements for Byzantine resilience [LSP82].
The Processing Elements are the computational sites. Each PE consists of a micropro-
cessor, private RAM and ROM, and miscellaneous support devices, such as periodic timer
interrupts. The PEs may optionally have private I/O devices, such as ethernet, RS-232, etc.
The microprocessor may be either a general-purpose processor or a special-purpose pro-
cessor for signal or image processing.
The I/O controllers connect the AFTA to the outside world. These I/O devices can be
anything that is compatible with the bus connecting the elements within the FCR. I/O con-
trollers may have a programmable processor on board which actually drives the I/O. These
devices are referred to as smart I/O. Other I/O controllers may require an off-board proces-
sor to act as the controlling processor over thebus. These devices ate referred to as dumb
I/O. Smart I/O can exist in the virtual AFTA configuration as a simplex virtual group.
Dumb I/O must be controlled by another processor, which could be either a simplex group
or a single member of a fault-masking group. Redundant I/O (such as a dual redundant in-
terface bus) is treated as multiple simplex devices by the AFTA.
Page 4-1
PRECEDING PAGE BLANK NOT FILMED
Fault I/0 B_(_es)
Containment
Region
- independent Power
- Independent Clocking
- Dielectric Isolation
- Phyzieal bJola_on Standard Bus
Network
Element_
- Voting
- Synchronization
AFTA
Hiah Speed
Fiber Optic.
Network
Input/Output Controllers
- ND[ Components
- Redundancy from 1 to 4
- Application Software
- Adit Run Time System
Figure 4-1. AFTA Physical Configuration
The devices in an FCR are interconnected using one or more backplane buses. PEs
communicate with the NEs and IOCs through the bus(es). Data communication is usually
between a PE and the NE, or between a PE and an IOC. Normally, direct PE to PE com-
munication should not be used. If a PE wants to communicate with another PE, the ex-
change primitives provided by the NE should be used.
4.2. AFTA Virtual Configuration
A parallel processor is usually characterized by a network that provides interconnection
between multiple processing sites. Data is passed between processing sites using a mes-
sage-passing paradigm. In the AFTA, the ensemble of Network Elements provides a virtual
bus topology connecting the processing sites.
Page 4-2
TheAFTA has the capability of grouping processors on the virtual bus into virtual
groups. Members of a virtual group execute the same code on the same data set. These
members can compare results, using the Network Elements, to mask a failure in any one of
the FCRs.
The virtual bus topology of the AFTA is shown in Figure 4-2. This figure shows sev-
eral example virtual groups. Virtual groups consisting of only one processing site are called
simplexes. These groups are not fault-masking, since there is only a single member and it
is not possible to determine the validity of a single piece of data without prior knowledge of
the proper value. The other types of virtual groups are triplexes and quadruplexes, consist-
ing of three and four processing sites, respectively. These groups are called fault-masking
groups (FMG), since a fault in any single member will be detected and masked by the other
members.
Input and output controllers are also considered in the virtual configuration. I/O devices
are assigned, either statically or dynamically, to a specific virtual group. The virtual group
to which an I/O device is assigned is responsible for executing the device driver code to
communicate with the device. There are three basic types of I/O configurations in the
AFTA: simplex I/O assigned to a non-FMG, simplex I/O assigned to an FMG, and redun-
dant I/O assigned to an FMG. Each of these configurations is shown in Figure 4-2. See
Section 5.7 for a more complete discussion of I/O drivers in the AFTA system.
I INetwork Element Virtual Bus
Ouedruplex Simplex Triplex Ouradruplex Triplex Triplex Simplex Simplex Simplex
with I/0 with I/0 with I/0
Figure 4-2. AFTA Virtual Configuration
4.3. AFTA Functional Overview
During normal operation, members of a virtual group communicate .between themselves
and with other virtual groups by passing messages through the virtual bus. The virtual bus
can be modeled using an abstraction • known as a Byzantine Resilient Virtual Circuit
Page 4-3
_RVC) [I-Iar87].Thisabstractionhasseveralcharacteristicsthatmakeit suitablefor usein
afault-tolerantsystemasdescribedbelow.
• packetdeliveryisreliable,soavirtual groupwhichsourcesapacketcanexpect
deliveryof thepacket,assumingthatthespecifiedreceivingvirtual groupexists.
• packetsaredeliveredin thesamerelativeorder,i.e. if avirtual groupsourcestwo
packets,packetA followedbypacketB, destinedfor thesamevirtual group,the
receivingvirtual groupreceivepacketA beforeit receivespacketB.
• each member of a virtual group will receive packets in the same order as all other
members of the virtual group.
• each functioning member of a virtual group will receive bitwise identical copies of
every packet delivered to the group.
• packet delivery is synchronous among members of a virtual group.
The characteristics described above are used to implement various functions. These
functions are designed to satisfy the requirements of Byzantine resilience.
Reliable message delivery is a requirement for a fault-tolerant system, such as the
Ab-q'A. Building reliable message delivery on top of an unreliable packet delivery system is
tedious and can never guarantee complete coverage for all random faults [BG87]. By pro-
viding reliable Facket delivery, there is no need for additional software protocols to ensure
reliable message delivery.
Another requirement of fault-tolerant systems is to guarantee consensus among func-
tioning members of a fault-masking group. This requirement is met by voting packets in the
Network Element. The NE will perform a source congruency on single-source data. Rela-
tive packet ordering is also necessary to guarantee consensus.
Fault-tolerant computers must also be synchronized. Synchronous packet delivery is
used as a method of synchronizing processors in a virtual group. A group can synchronize
itself by sending a packet to itself, then waiting for that packet to be delivered. A timeout is
used so that if a member of a group does not respond within an allowed period the same
way as the majority of the group members, that member will be ignored and the remaining
members will continue uninhibited. This type of synchronization is known as functional
synchronization.
The Network Element implements several support operations in addition to the packet
delivery functions.
Page 4-4
Oneof themost importantsupportfunctions is systemreconfiguration.The AFTA
supportsdynamicreconfiguration,allowing thegroupingof physicalprocessingsitesinto
virtual groupsto bechangedin real-time.Themechanismfor reconfiguringthesystemis
theCT update.The configurationtable,or CT, is a tablestoredinternallyon theNetwork
Element.Theprocessorshavenodirectaccessto theCT,but theycaneffect changesin the
CT using the CT update. Processors may keep copies of the CT in local processor mem-
ory. In previous FTPP designs, a single distinguished virtual group referred to as the re-
configuration authority was given sole authority for performing the CT update process.
While a re.configuration authority-based protocol may be used for certain AFTA reconfigu-
ration modes, there is no hardware support planned in the AFTA for enforcing a single re-
configuration authority. However, the NE only permits fault-masking groups to perform a
CT update.
The Network Element contains a global synchronous timer which is synchronized to
the fault-tolerant clock (FTC). This timer is used as the basis for calculating timeouts by the
scoreboard and for providing timestamps on packets. Because the timer is synchronized to
the FTC, the value can be considered congruent among all FCRs. The timer is initialized to
zero upon system reset and is realigned by voting during the reintegration process.
Another support function is initial synchronization, or ISYNC. When the AFTA is In'st
powered up, the Network Elements arengt synchronized. ISYNC is the procedure by
which the Network Elements become synchronized. The two subsections of the NE that
require synchronization are the fault-tolerant cl_k and the global controller. The fault-toler-
ant clocks, designed using standard FTC techniques, will become synchronized automati-
cally within 190_s. The global controllers, however, must explicitly synchronize them-
selves. The ISYNC procedure involves C0_tinual exchanges, using the 2 round source
congruency exchange, to determine which NEs are ready to synchronize.
Recovery of a Network Element following a transient fault in the NE uses a process
similar to the ISYNC function. An NE that is trying to recover performs continual 2 round
exchanges. The remaining NEs in the working group reintegrate the recovering NE by per-
forming a single 2 round exchange. If the working group detects the recovering NE, the
NEs are resynchronized and realignment is initiated. During the realignment process, the
configuration table and global synchronous timer are exchanged and voted. This operation
ensures that the state of the newly reintegrated NE is consistent with the rest of the system.
At the conclusion of realignment, the recovering NE is considered completely reintegrated
with the working group.
Page 4-5
TheNetwork Elementis alsoresponsiblefor monitoringcertainaspectsof thesystem
to aid in thediagnosisof faults.The NetworkElementmaintainsbit vectors,calledsyn-
dromes,to indicatewhencertainunusualbehavioris observed.Thesesyndromescanindi-
cateproblemswith virtual groupmembers,NetworkElements,or FCRinterconnects.The
syndromesaredeliveredwith eachpacketexchangedbythevirtual bus.Someof thesyn-
dromesapplyspecificallyto the associated packet; others are simply an accumulation since
the last packet delivery. Note that all Network Elements will not necessarily see the same
error conditions, therefore the syndromes must be treated as single-source data.
4.4. AFTA Network Element
The Network Element is the core of an AFTA cluster. The Network Element connects
on one side to a number of processing sites, and on the other side to the other Network El-
ements in the cluster. The ensemble of Network Elements forms a virtual bus network
through which the processors communicate.
4.4.1. Network Element Addressing Convention
An applications program for the AFTA can almost always ignore the physical AFTA
configuration, and use only the virtual configuration as the programming model. Using the
virtual configuration, there is no need to refer to a specific Network Element or FCR, since
these concepts only exist in the physical configuration. Systems programs (including de-
vice-drivers) however, must occasionally refer to a specific Network Element or FCR. Ex-
amples of these situations include reading or writing I/O devices, performing system re-
configuration (via CT updates), and diagnosing of faults.
The method of referencing an NE or FCR is known as Network Element addressing.
Each Network Element is assigned a unique ID number. The NEIDs are assigned as shown
in Figure 4-3. Note that successive NEIDs can be found in a counter-clockwise pattern
around the ensemble of NEs.
Page 4-6
NE A
Figure 4-3. Network Element Addresses
FCR IDs are taken from the NEID of the Network Element in the FCR. Henceforth, the
terms FCR ID and NEID will be synonymous.
Absolute addressing refers to a specific Network Element based on its position as ob-
served from outside the cluster. For absolute referencing (i.e. NE A, NE D, etc.) the IDs
are represented numerically as follows:
A=0
B=I
C=2
D=3
E=4
Relative addressing is used to refer to a Network Element based on the referenced NE's
position relative to the local NE. In previous versions of the FTPP, relative addressing
used the form of left, fight, opposite, mine. To provide consistent referencing for the
AFTA, with up to 5 NEs, relative addressing is determined by assigning the address CH 0
to the local NE. Then, continuing counter-clockwise beginning with the NE to the right of
the local NE, successive relative addresses are assigned to each NE. Relative IDs are repre-
sented numerically as follows: -
CH0--0
CHI=I
CH 2=2
CH3 = 3
CH4=4
By convention,this documentusesthe letterdesignations(i.e. NE A, NE B, etc.) to
indicatethe absoluteaddress,andthenumericalchanneldesignation(CH 0...CH4) to in-
dicatetherelativeaddress.However,theAFTA hardwareuses3-digit binarynumbersas
outlinedabovefor bothaddresses.
For situations where bit patterns represent masks or error syndromes, the numerical
address represents the bit position within the byte. For example, an absolute mask for the
AFTA has the form:
7 6 5 4 3 2 1 0
Figure 4-4. Absolute Mask
A relative mask for the AFTA has the form:
7 6 5 4 3 2 1 0
Figure 4-5. Relative Mask
To allow for ease of future expansion, unused bit positions are left undefined rather
than used for packing multiple masks or error syndromes into a single byte.
The absolute numerical address can be derived from the relative address by the follow-
ing:
absNEID = (relNEID + myNEID) % numNEs
Conversion of bit patterns from relative to absolute reference can be accomplished by
the following (assuming that undefined bit positions are cleared):
absMASK = (relMASK << myNEID) i (relMASK >> (numNEs - myNEID)
_ Network Element Functional Description
This section describes the functions of the AFTA Network Element. The functions
provided by the Network Element include the primary data exchange primitives and the
secondary system maintenance primitives.
4.4.2.1. Data Exchange Primitives
The AFTA Network Element provides a number of data exchange primitives for the
Processing Elements to use. The primary use of the primitives is to transfer data from one
virtual processing site to another. The primitives are also be used to vote common-source
data or distribute single-source data within a virtual processing site. The primitives also
have various side effects, including synchronization, time stamping, and syndrome report-
ing. A special set of primitives are provided which produce side effects that directly affect
the state of the Network Element. These special primitives include CT updates, transient
NE recovery, and voted resets. Most of these primitives are solely for the use of the AFTA
operating system. When the application program requests inter-VG communication, the
AFTA operating system transparently maps the task's communication request to the appro-
priate data exchange primitive in the manner described in Section 5.
The processor uses the same procedure to access any of the primitives. Data is trans-
ferred from a physical processor to the ass_iated Network Element through the proces-
sor's output buffers. First, the processor must select a contiguous segment of 64 bytes
within the output data block. Next, the segment is filled with the 64 bytes of data (unless
the class 0 primitive is being used). Then, the buffer descriptor located in the output info
block is f'dled with the appropriate information. The output info block specifies the primi-
tive to be executed, the destination virtual group, and the location of the data in the output
data block. Finally, the ownership of the output buffer is transferred to the Network Ele-
ment by performing the send operation on the ring buffer manager.
The Network Element transfers data to processors through the input buffers. The Net-
work Element selects the next free cell in the ring buffer for the processor and fills the data
and info fields in that cell. Then, the cell is enqueued for ownership by the processor. The
processor must access the cells in the order in which ownership is transferred from the NE
to the PE to preserve total packet ordering. When the PE detects a cell which it owns in its
input buffers, the PE must transfer the data and descriptor information from the cell into lo-
ca/memory and return ownership of the cell to the Network Element. The emptying of in-
put buffer cells must be a high priority operation since the longer buffers are allowed to re-
main full, the more likely flow control will be asserted.
The processors use the ring buffer manager to control ownership and determine owner-
ship status of its buffer cells.
Theprocessorusesasendoperationto transferownershipof anoutputbuffercell, and
areturnoperationto transferownershipof an inputbuffercell.Note thattheprocessorcan
only relinquishownershipof cells to theNE; theprocessordoesnothavethecapabilityto
overtlyacquireownershipof cells.
Thestatusof aprocessor'sbuffersis determinedby eitherthereadyoperation,for out-
put buffers, or the next operation,for input buffers. Eachof theseoperationsreturnsa
pointerto thebuffercell thatmustbeusedin theappropriatering buffer.In thecurrentNE
implementation,theoutputbufferonly hasonecell, sothereadyoperationalwaysreturnsa
pointerof 0. Thenextoperationreturnsa pointerbetween0 and31, inclusive.Eachopera-
tion alsoreturnsaninvalid bit that,whenclear,indicatesthatthepointerisvalid. If thepro-
cessordoesnotown anycellsin theassociatedbuffer,the invalid bit will beset.
4.4.2.1.1. Class 0
The class 0 primitive is used when only the side effects of a data primitive are needed.
The class 0 does not exchange any data. When a virtual group executes a class 0 primitive,
all of the descriptor information in both the output and input info blocks is defined. All of
the infomaation is valid, except for the vote syndrome, which is undefined.
The data in the output data block does not need to be defined, but the pointer in the out-
put info block must point into the processor's own output data block. The data in the input
data block is not guaranteed to be congruent among members of the destination virtual
group and must be ignored.
4.4.2. 1.2. Class 1
The class 1 primitive performs a singe round of exchange and vote on data from a fault-
masking virtual group (FMG). Only FMGs are allowed to execute the class 1 primitive,
since at least 3 independent copies of data are required for an unambiguous bitwise majority
vote.
The output data block of the source virtual group contains the copy of data to be voted
from the local processor. The input data block of the destination virtual group contains the
voted result. The contents of the input data block may be considered congruent among
members of the destination virtual group.
Page 4-16
4.4.2. 1.3. Class 2
The class 2 primitive performs a two round, or source congruency, exchange on data
from any virtual group. The source of the d_ita may be a simplex virtual group or a single
member of a fault-masking group. The class 2 primitive is the only mechanism by which
simplexes are allowed to communicate with other virtual groups.
The output data block of the source virtual group member contains the data to be dis-
tributed by the class 2 primitive. The data in the output data blocks belonging to virtual
group members who are not sourcing data is ignored by the Network Element. The input
data block of the destination virtual group contains the exchanged data. The contents of the
input data block may be considered congruent among members of the destination virtual
group.
4.4.2.1.4. Broadcasts
Broadcasts are a useful means of transmitting data to all active virtual groups in the
cluster. Broadcasts are more of a drain on system resources than the standard point-to-point
communication primitives, so only FMGs are allowed to send broadcasts. The use of
broadcasts should be minimized.
The broadcast primitive is invoked as a modifier to the existing exchange primitives.
Any of the primitives, including the data exchange and the special primitives, can be deliv-
ered as a broadcast. If a broadcast is used, the ToVID field in the output info block is ig-
nored. A virtual group can determine that a received packet was delivered as part of a
broadcast by examining the broadcast modifier bit in the class field of the input info block.
The contents of the input data block c_m be considered congruent among all operational PEs
in the cluster, unless the packet was delivered as pan of a class 0 broadcast primitive.
4.4.2.2. Configuration Table Updates
The Network Element must keep track of the grouping of physical processors into vir-
tual groups. The NE uses a data structure known as the configuration table, or CT, to con-
tain this mapping. The CT also contains information for timeouts and vote masks. The CT
is modified whenever any of this information must be changed. The CT update primitive is
used to update the CT in a synchronous and atomic manner. Only a fault-masking group is
allowed to execute the CT update primitive, unless there are no FMGs in the cluster. The
Page 4-11
CT update must be done with a class 1 exchange class, unless a simplex is performing the
CT update under the previous exception.
The configuration table on the NE consists of a number of entries, with each entry cor-
responding to a potential virtual group. Not all of the potential virtual groups will necessar-
ily be active at any one time. The CT update primitive modifies a subset of the CT entries in
one atomic action. Up to eight CT entries can be modified with a single CT update primi-
five, enough to allow one complete quadruplex to be formed or disbanded from or to its
constituent simplex virtual groups. Multiple virtual groups can be formed or disbanded by
one or more successive CT updates.
Each entry in the CT update packet is a direct replacement for the selected entry in the
configuration table. The entry contains the VID number, which selects the CT entry to be
updated. The entry also specifies the redundancy level of the new virtual group, a mask
detailing which members of the virtual group are considered to be functional (for data vot-
ing), and the value to be used for timeouts on the virtual group. Finally, the entry contains
a list of the physical processors that make up the virtual group. The list specifies the Net-
work Element ID to which the processor is connected and the buffer set the processor uses
_o communicate with its Network Element. The list is only as long as is indicated by the re-
dundancy level of the entry.
4_A4.2.3. Initial Synchronization
Initial synchronization, or ISYNC, is the procedure by which the Network Elements
become synchronized at power-up or after a system reset. The ISYNC process can be ini-
tiated by either a processor connected to an NE or by the global controller on the NE. The
latter is only an option for systems in which the microcode is non-volatile.
The following indicate the expected condition of the system after power-up.
• FFCs self-synchronize within S seconds
• < F FCRs are faulty
• all functioning FCRs were powered-up or reset within T seconds of each other
(i.e. maximum skew is T seconds.)
• Cardinality and connectivity requirements to survive F Byzantine faults are satis-
fied.
From the above assumptions, it is clear that the system meets the requirements for
Byzantine resilience if a means for exchanging single source data is provided. However,
Page 4-12
the skew is unacceptable for operational mode. The purpose of ISYNC is to reduce the
skew to acceptable levels.
The condition of being in ISYNC mode is treated as a piece of single source data. By
exchanging this data through a source congruency, each FCR reaches a consensus about
the relative synchronization state of the cluster.
A message suing indicates the synchronization state of the cluster. Each FCR has one
entry in the message suing. Nominally the message suing is:
m0 ml m2 m3 m4
which indicates that all FCRs are ready to synchronize. The presence of message mi
(mi = 0x0AEC6BF0, for all i) indicates that FCR i is ready to synchronize. A message xi,
where xi _ mi, indicates that FCR i is not ready to synchronize. All non-faulty FCRs
which are not ready to synchronize are designed to send xi rather than mi. An example of a
message suing which indicates that only FCRs 1 and 4 are ready to synchronize is:
x0 ml x2 x3 m4
Each message in the message string is a single-source message from the respective
FCR. Consequently, the messages must be exchanged using a 2-round source congruency
algorithm.
The algorithm for performing ISYNC is as follows. At startup the ISYNC message
suing is exchanged using a 2 round exchange. Each FCR broadcasts the message xi until
the ISYNC initiator requests message mi. The ISYNC initiator must walt at least (T+S)
seconds after power up before requesting mi. An ISYNC timeout with a timeout period > T
is started after 2F+l valid messages are observed in the message suing. ISYNC is termi-
nated when one of the following conditions is met:
• all valid messages are observed in the message string
• the ISYNC timeout period expires.
Figure 4-6 illustrates the timeline for 5 Network Elements performing the ISYNC al-
gorithm.
Page 4-13
A:
E:
:::::::::::::::::::::::::: • • % • % % % % % %
:::;::5;,::.:-:,;::::>::: % % % % % % _ _ % %
.+,:,:,:_.....-:+:_. % % % % %
_i!iiii:i!ii_!ii!i!i!!i!il s ," s _ _' _'
I
Ni!i!i!!ii!i!i!!!i!i!!ilI:
ISYNC
timeout set
:::::::5:::::: : :: :::::::5:: %
_"- T _
ISYNC timeout
would expire
M
D
D
Power-up/Reset period
FTC sync period
ISYNC period
Operational mode
Figure 4-6. ISYNC Procedure
ISYNC is attempted for a period of 2T seconds. If ISYNC has not succeeded by this
time, the NE terminates ISYNC and enters the transient NE recovery procedure. The tran-
sient NE recove"y procedure is described in Section 4.4.2.4.
4.4.2.4. Transient NE Recovery
The transient NE recovery procedure, or TNR, is similar to the ISYNC procedure.
Transient NE recovery is used to reintegrate a failed NE. An NE which has suffered a
transient failure may not have any permanent faults that prevent it from functioning as a
member of the cluster. However, the NE must be resynchronized and realigned with the
working group before it can be declared non-faulty.
An NE which has suffered a transient failure may have been reset by a voted reset, the
watchdog timer, or the power-on reset. The NE will enter ISYNC mode as a part of the
boot procedure. This NE has no way of knowing that a working group exists. It will at-
tempt to perform exchanges of the ISYNC message string as defined in Section 4.4.2.3.
Since the working group is performing other packet exchanges, the ISYNC procedure will
fail. After a period of 2T, the failed NE will terminate ISYNC and enter TNR.
The TNR procedure is similar to the ISYNC procedure. The same message string is
used to determine whether or not a particular NE is in TNR or not. The major difference is
that TNR is terminated immediately whenever any new Network Elements are observed in
Page 4-14
themessagesuing.ThecurrentNE mask(asdefinedin Section4.4.3.3.2)is usedasa ref-
erencefor themessagestring.For thisreason,the workinggroupmust settheirNE mask
to reflectthecurrentworkinggroupconfigurationbeforeexecutingTNR.
ThefailedNE entersTNRafterfailing ISYNC.TheworkinggroupentersTNR whena
virtual grouprequeststo sendatransientNE recovery(TNR)packet.A singleTNR packet
exchangeis performed,andtheresultingmessagesuingis examinedbytheNEsto deter-
mine if anynew NEs, ascomparedto theNE mask,arepresent.If no new NEsareob-
served,thefunctioningNEsreturnto theoperationalmodeunscathed.UponenteringTNR
from ISYNC, a NetworkElementremainsin theTNR stateindefinitely unlessanduntil a
successfulTNR exchangeis observed.If theTNR operationis successful,as indicatedby
oneor morenewNEsobservedin themessagestring, theCT is exchangedandvotedinto
thenewlyrecoveredNE(s).
Successfulcompletionof TNR requiresalignmentof thestateof thereintegratedNE to
reflect thestateof theworkinggroup.Theconfigurationtableis alignedbyexchangingand
voting theentirecontentsof theCT.Timeoutsin thescoreboardarealignedbyresettingall
timeouts,whicheffectivelyrestartsall packetreadyandflow control conditions. The global
synchronous timer is realigned by exchanging and voting the timer value. One consequence
of voting the timer is that the timer value effectively stops incrementing until the realign-
ment of the timer is complete. The packet buffers in the recovered NE are set to their initial
condition after power up.
Normally all PEs in the reintegrated NE are declared as inactive spares (i.e. simplexes).
The state of these PEs is assumed to be the same as a freshly powered-up processor. These
spares can be integrated into an existing virtual group by a CT update following successful
completion of TNR. Alignment of these PEs with an existing virtual group is the respon-
sibility of the reconfiguration authority task. The reconfiguration authority is also respon-
sible for broadcasting the current system configuration to these new PEs.
The functioning NEs assume that the recovering NE has resynchronized its FTC to the
remaining ensemble and has entered TNR mode. For this reason, a functioning ensemble
must wait at least 2T seconds after performing a voted reset before attempting TNR.
4.4.2.5. Voted Rescta'/Monitor Interlocks
The AFTA architecture is designed to tolerate any single random fault. In addition, the
system is designed to allow attempts to restart a failed element or reconfigure around the
Page 4-15
failedelement.Voted resets are a mechanism to attempt recovery from transient faults, and
monitor interlocks are one mechanism for reconfiguration around permanent faults. Voted
resets and monitor interlocks are exchanged along with data exchanges on the interconnect-
ing fiber optics. No additional interconnects are required to distribute this information.
If a virtual group detects a Processing Element or Network Element in error, the virtual
group may try to restart the element by asserting a voted reset. A voted resetcan be directed
to either a single Processing Element within an FCR, or to the entire FCR. Since resetting a
Network Element also resets all processors attached to the Network Element, there is no
separate provision for just resetting a Network Element. Only a fault-masking group is al-
lowed to assert a voted reset, unless there are no FMGs in the cluster. The voted reset
primitive must be done with a class 1 exchange class, unless a simplex is performing the
voted reset under the previous exception.
When an entire FCR is lost due to some sort of catastrophic failure, all of the I/O on the
lost FCR becomes unavailable. Critical I/O is usually replicated in multiple FCRs, so that a
one FCR can take over I/O activity for another FCR if the second FCR fails. The first FCR
must obtain access to the I/O device, and the second FCR must be disabled from driving
ti_e device. A monitor interlock can be used to attempt to turn off the driver in the failed
FCR to allow the new FCR to assume ownership of the device. Only a fault-masking
group is allowed to assert a monitor interlock, unless there are no FMGs in the cluster. The
voted reset primitive must be done with a class 1 exchange class, unless a simplex is per-
forming the monitor interlock under the previous exception.
The built-in support on the Network Element for voted resets and monitor interlocks
includes a special packet type for executing the primitives and a set of discrete outputs for
invoking the desired side effect. The discrete outputs are only asserted for one FTC cycle
following the exchange of the voted reset or monitor interlock packet. Additional support
circuitry may be required to interface the discrete output to processing or IIO elements.
4.4.2.6. Syndrome Reports
The Network Element places syndromes in the input info block whenever a packet is
successfully delivered to a virtual group. The syndromes must not be considered congruent
across all members of a destination virtual group.
Page 4-16
The syndromesindicatevariousanomaliestheNetworkElementobservesduring the
executionof anexchangeprimitive.Thesyndromescanbedividedinto twomajorclasses:
NE syndromesandscoreboardsyndromes.
TheNE syndromes indicate any anomalous behavior detected anywhere on the NE ex-
cept within the scoreboard. The NE syndromes, located in the second longword of the in-
put info block buffer cell, include indications of vote errors, fault-tolerant clock synchro-
nization errors, and fiber-optic link errors.
The vote syndrome indicates that one or more channels did not agree with the final
voted result. For a class 1 exchange, a vote syndrome means that the channel did not pro-
duce the expected output. For a class 2 exchange, a vote syndrome means that the channel
did not forward the correct value during the second exchange round. Since the second
round of a class 2 is completely contained within the Network Element, the Network Ele-
ment is indictext by a vote syndrome on a class 2 exchange. Either the processor or the
Network Element is indicted by a vote syndrome on a class 1 exchange. The vote syn-
drome is undefined for a class 0 exchange.
The clock and link syndromes are not necessarily associated with the packet on which
they are delivered. Each syndrome represents an occurrence of the indicated error at some
time between delivery of the previous packet and delivery of the current packet.
The clock syndrome indicates that the rising edge of the fault-tolerant clock (FTC) sig-
nal from the associated channel did not fall within the acceptable skew with reference to the
local FTC signal. Since the local channel is always synchronized with itself, the clock syn-
drome bit corresponding to the local channel, bit 0, will always be zero.
The link syndrome indicates that a violation was reported by the TAXI receiver chip for
the associated channel. A violation indicates that the TAXI received an invalid pattern over
the fiber-optic interface. Violations are usually a sign of catastrophic failure in the affected
Network Element or a break in the physical fiber-optic link. Since there is no TAXI re-
ceiver chip associated with the local channel, the link syndrome bit for the local channel, bit
0, will always be zero.
The scoreboard syndromes indicate anomalous behavior detected by the scoreboard
during SERP processing. The scoreboard syndromes, located in the third longword of the
input info block buffer cell, include indications of scoreboard vote errors, OBNE timeouts,
and IBNF timeouts.
Page 4-17
Thescoreboardvotesyndromeindicatesthatone or more channels of the source virtual
group did not agree with the voted result for the class, destination VID, or user byte. Data
in the SERP is exchanged using a 2 round exchange, voted in the data paths, and voted
again in the scoreboard. The scoreboard vote syndromes only indicate vote errors detected
during voting in the scoreboard; there is no indication of vote errors during data path voting
of the SERP.
The OBNE timeout syndrome indicates that the associated virtual group member(s) did
not place a packet in their output buffers within the timeout skew of the majority of the vir-
tual group members. The timeout skew is specified by the rime, out field in the CT entry. All
members are expected to transmit a packet simultaneously, within the timeout skew. If a
majority' but not a unanimity, of virtual group members are observed with packetsin their
output buffers, a timeout is initiated. If the timeout expires before the other members
transmit the packet, the remaining members are ignored, the packet is exchanged, and an
OBNE timeout is recorded.
The IBNF timeout syndrome indicates that the associated virtual group member(s) did
not deassert flow control on their input buffers within the timeout skew of the majority of
the virtual group members. The timeout skew is specified by the timeout field in the CT
entry. All members are expected to free space in their input buffers simultaneously, within
the timeout skew. If a majority, but not a unanimity, of virtual group members are ob-
served with deasserted flow control on their input buffers, a timeout is initiated. If the
timeout expires before the other members empty their input buffers, the remaining members
are ignored, flow control is deasserted for the virtual group, and an IBNF timeout is
recorded.
4.4.2.7. Timestamns
The Network Element places a timestamp in the input info block for each packet suc-
cessfully delivered to a virtual group. The timestamps are congruent across all members of
the destination virtual group, and across all active processors in the case of a broadcast.
The timestamp is a 32-bit quantity that indicates the relative time within the cluster. An
external time source (such as a GPS reference or a time-of-day clock built into the PEs) can
be used to add a constant to the timestamp to gauge absolute time. The resolution of the
timestamp value is 1.28t_s. The maximum timestamp value is 4,294,967,295, or
0xFFFFFFFF, which corresponds to approximately 5500 seconds. When the timestamp
Page 4-i8
counter reaches the maximum value, it wraps around to 0 with no indication to the proces-
sor. The processors must obtain timestamps often enough to detect wraparound.
The timestamp counter is initialized to zero during ISYNC and increases monotonically
thereafter, except during transient NE recove_ (TNR). When a Network Element is reinte-
grated using TNR, the timestamp counters are realigned as part of the recovery process to
ensure congruency of the timestamps. This realignment causes the timestamp counters to
cease increasing until the realignment is complete. Consequently, the timestamps after TNR
will be slightly smaller than they would be if TNR were not performed. The processors can
correct this error by estimating the timestamp error by measuring the duration of TNR us-
ing internal timers and applying a correction constant to the new timestamps.
4.4.2.8. NE Debug Commands
The following describes the debug commands supported by the Network Element de-
bugger. These commands are implemented in a special version of the microcode. The de-
bug commands can be used to debug new Network Element hardware and for performing
stand-alone self-testing of the Network Element.
wrap_vme(bytel,byte2)-Copies byteI to the VDAT bus register, then copies the
VDAT bus register to byte2.
wrap_serp(proc,serp_entry)-Generatcs the SERP entry for proc and returns it in
serp_entry.
wrap_to_input(pack,proc)-Copies Pack (64 byte packet) to the selected input
buffer.
reflect.from(channeO-Reflects the packet stored in the selected channel to the
VDAT bus. Any channels which have their debug routers enabled or are
connected to a fiber-optic wrap path will receive the reflected data.
vote_deliverl(proc)-The packet in the data path FIFOs is voted and stored in the in-
put buffer selected by proc. Voting rules for Class I (voted) exchanges are
used (PE mask ANDed with _ mask)
votedeliver2(proc)-The packet in the data path FIFOs is voted and stored in the in-
put buffer selected by proc. Voting rules for Class II (source congruency)
exchanges are used (NE mask with source masked out).
clear_output(proc)-Clears the selected processor's output buffer.
clear_input(proc)-Clears the selected processor's input buffer.
clear_datap(channel)-Clears the selected channel's data path FIFO.
write..panern(pattern)-Writes a pre-determined byte pattern into pattern.
write_pe_maz'k(mask)-Writes mask to the PE mask register in the data path voter.
write ne mask(mask)-Writes mask to the NE mask register in the data path.voter.
ct_enTer(-dtentry)- The configuration table entry specified by ctentry is copied into
the CT on the NE. A scoreboard CT update is not performed.
ct_update(ctentry)-The configuration table entry specified by ctentry is copied into
the CT on the NE. Then, a scoreboard CT update is performed.
mask_update(mask)-Updates the NE mask in the configuration table.
Page 4-19
process__serp( serp ,class,proc Jromvid, tovid,pemask, nemask )-serp is entered into
the DPRAM of the scoreboard. Then, the global controller requests the
scoreboard to process serp. The results of the processing are returned to the
debugger through the VME DPRAM.
next_message(class,proc fromvid, tovid,pemask, nemask)-The scoreboard is re-
quested to continue processing a previously entered serp and return the next
message ready to be sent.
lerp_to_dpram(lerp)-Generates a LERP message for the local NE and copies it to
the data DPRAM. This LERP is never exchanged or used by the score-
board.
return_time(timestamp)-Returns the 32-bit value stored in the global synchronous
timer.
write_debug_enables(enablepanern)-Selects each debug router (for external chan-
nels only) to receive data either from the external interface (via the TAXI re-
ceiver for that channel) or from the local data source (the VDAT bus).
infinite loopO-The global controller jumps to a single-state infinite loop. The
-watchdog timer is not kicked during the loop, so the NE should be reset by
the watchdog timer after the timeout period has expired.
4.4.3. Network Element Programming Reference
This section describes the procedures for accessing the Network Element from the Pro-
cessing Elements. The section defines the memory map of the NE, the format of data and
control registers on the Network Element, and packet formats used by the data exchange
primitives to transmit data between virtual groups via the Network Element.
4.4.3.1. Processor Network Element Interface
Processors communicate with the NE via the VMEbus. The processors must be capable
of performing VME bus master functions A24 and/or A32, and D32. The processors must
also function as A16 and D08(o) slaves. Packet data is transferred by the processor to and
from the Network Element using the processor master modes. The NE delivers mailbox
interrupts to the processor using the slave modes.
The Network Element responds to both supervisor and user address modifier codes,
allowing applications software to write directly to the Network Element. Should it be nec-
essary to prevent user accesses, simple modifications can be made to the Network Elements
so that only supervisor address modifiers are acknowledged.
Byte ordering in the NE VMEbus ports follows the Motorola convention, a.k.a, big
endian, with the most significant byte in the lowest address. Any Processing Element se-
lected for use in the AFTA is expected to comply with the byte ordering convention speci-
fied by the VMEbus.
Page 4-_0
The baseaddressof the NE must bepn a 64K byte boundary (i.e. lower 16 bits
cleared). All accesses to the Network Element by the PEs are 32 bit cycles. Some NE ports
may not define all 32 bits of the accessed word. In fact, some locations are accessed for
their side-effects and have no data associated with them.
The Network Element has the capability of delivering a signal to the processors. The
NE can perform I308(o) master cycles in the A16 address space of the VMEbus. The actual
location and the data to be written can be specified for each processor. The location may be
a memory location in the processor's RAM or it may be a register for a processor mailbox
interrupt, depending on the Processing Element chosen. The signal can be delivered on
packet transmission, packet reception, or on the input buffer full (IBF) condition. Mi-
crocode changes select the signal delivery condition.
4_4.3.2. Memo_. Map
A memory map of the NE as viewed from the VMEbus is shown in Figure 4-7. The
memory map is divided into two main segments. The first is the data segment, which is
used to transfer data between the processors and the Network Element. The data segment
also contains status and control registers for each processor. The data segment maps into
the DPRAM memory on the NE. The second segment is the buffer manager. The buffer
manager regulates the use of the dual-port RAM to prevent contention for data resources
between the processors and the NE.
NEbase+O0000
10000
i
Data
segment
BufMgr
segment
I Dual-Port RAMBuffer Manager
Figure 4-7. NE Memory Map
Each Network Element is attached to a maximum of 8 processors. Each processor has
its own window of addresses in the DPRAM and its own set of ports in the buffer man-
ager.
4.4.3.2.1. Data Segment
The data segment is implemented with 4 4K x 8 dual-port RAM devices. These devices
provide data buffering between the NE and the PEs. Because it is dual-ported, each side
can access the data segment asynchronously, provided that they do not access the same 1o-
Page 4-21
cation.Arbitration to preventsimultaneousaccessis performedby thebuffer manager.A
PE's adherenceto thebuffermanagerarbitrationmustbeensuredby theoperatingsystem
messagepassingsoftware,i.e. thereis nohardwareenforcement.Systemsoftwareon the
PEmustbewritten to follow theownershiprulesof buffercellsin thedatasegment.Fail-
ure to adhere to the ownership rules describedin Section 4.4.3.2.2 could result in
overwritinganincomingpacketor, in theworstcase,thedesynchronizationof theNE.
Thedatasegmentis dividedintoequalsizedwindowsfor eachof themaximumof eight
processorsper NE. Eachprocessorhasan identical structuresuperimposedon this win-
dow.Thestructureof thedatasegmentis shownin Figure4-8.
Eachprocessorwindow in thedatasegmentis dividedinto 5 blocks;four areusedfor
packettransferandoneis unused.Thefour usedblocksarepairedinto input andoutput
datablocks.Oneof the blocksin eachpair is usedfor actualdatacommunication;these
blocksarereferredto asdatablocks.Theotherblocks,knownasinfo blocks,containin-
formationpertainingto thedatain thecorrespondingdatablock.
00000
O2OOO
040O0
O6OO0
O8O0O
OAOOC
OCO00 i
OEOOC
I
Proc_Or |
0 Buffers I
Processor
1.u_._. I 70ooo
Processor I /] /looo
3 Bullets I..____ 1800
PrOcessor I
4 Buffers I
1C00
Processor Ir r |
6 Buffers I
Processor
7 Buffem !
Input I
Dala Block I
Output I
Data Block I
Input I
lnfo Block I
od_t I--
Info Block I
!
IC10 Reserved I
Figure 4-8.
NEba_,
r_O_ I IBUFO ! 08o00
04C _ 0804O
7COl leUFS 3 I °sTc°
IBUF 0
:! 122To VIDuser
bufplr
0 Class 09810
From VIO
u.sa¢
4 Vote errors 09814
Clock errors
09818
I _ime_tamp 0981C
DPRAM Memory Map
4.4.3.2. l. 1. Outgoing CDansmit) Bufferin_
The output ring buffer is used to send packets to the NE virtual bus. The output data
block and the output info block comprise a ring buffer of 1 cell. The output ring buffer cell
consists of 64 bytes in the output data block and 8 bytes in the output info block. Only the
Page 4-22
=locations in the output info block are associated a-priori with the buffer. The bufptr field in
the info cell points to the beginning of the data cell in the output data block. The output data
block is a flat memory-mapped section that can be used at the processor's discretion. The
output data cell is considered to be a contiguous 64 byte block starting at the location indi-
cated by the value of bufptr. If the output info Cell is owned by the NE, the data cell is also
considered to be owned by the NE. All other locations in the output data block are owned
by the PE.
When the processor wants to send a packet to the NE, it first makes sure that the output
buffer is empty by either polling the buffer manager waiting for the OBF (Output Buffer
Full) bit to be deasserted (low), or by waiting for the packet transmit signal from the previ-
ous packet transmission (if the NE-PE signal is enabled for the packet transmit condition).
Next, the processor finds an unused 64 byte cell in the output data block. A 64 byte block
must be allocated even if no data is to be exchanged. The pointer to the cell is entered in the
bufptr location of the output info cell. The header information is copied into the other loca-
tions in the info cell, and the data (if any) is copied into the data cell allocated above. Fi-
nally, the packet is sent by informing the buffer manager that the ring buffer cell contains a
valid packet. The processor is informed, either through the buffer manager OBF bit or the
packet transmit signal, when the packet is sent and the ring buffer cell can be used for an-
other packet.
When the NE observes an output buffer which contains a valid packet, the NE ensem-
ble determines, using the scoreboard, whether this packet represents a packet to be ex-
changed, voted, and delivered by the NE data path hardware. If the packet is validated by
the scoreboard, the NE reads the packet from the buffer and returns ownership of the
buffer to the processor. The processor must not access the buffer until it is returned by the
NE.
4.4.3.2.1.2. Incoming (Receive) Bufferin_
The input ring buffer is used to receive packets from the NE virtual bus. The input data
block and input info block together comprise a ring buffer of 64 cells. Each cell in the input
ring buffer consists of 64 bytes in the input data block and 16 bytes in the input info block.
Each cell in the input data block corresponds to a cell in the input info block.
Once a packet is validated for transmission, the NE exchanges and votes the packet.
The voted packet is then delivered to the receiving virtual group's input buffer along with
Page 4-23
the headerinformation.WhentheNE wantsto delivera packetto a processor,it f'trstob-
tainsa ring buffer cell from thebuffer manager.Then, thepacketdatais written into the
properdatacell, andtheheaderinformationis writteninto the infocell.Next, theNE trans-
fers ownershipof thebuffer to thePEusingthebuffer manager.Thebuffer cell is consid-
eredownedby theprocessoruntil theprocessorexplicitly returnsownershipof thecell to
theNE. Finally, if thepacketreceivesignalis enabled,theNE sendsasignalto theproces-
sorswhichmakeup thereceivingvirtual group.
Whenaprocessorobservesapacketdelivery,eitherby polling theIBE bit or by recep-
tion of thepacketreceivesignal, theprocessorobtainsthe locationof the newpacketby
readingthereadylocationin thebuffermanager.TheReadyIB field indicatesthecell num-
berof theoldestunreadpacketin theprocessor'sinput ring buffer. Theprocessorcopies
thedataandcorrespondingheaderinformationfrom theinput ring buffer cell into thepro-
eessor'slocal memory.When theprocessoris donewith thering buffer ceil, thecell is
freedfor useby anotherpacketby accessingthereturnlocationin thebuffermanager.
Broadcastpacketsare deliveredto all input ring buffers within an AFTA cluster,
whetheror not theassociatedprocessoris amemberof anactivevirtual group.
4.4.3.2.1.3. Information Block Fields
The information blocks contain control and status information associated with data lo-
cated in the data blocks. In previous versions of the FTPP, this information was either pre-
fixed or appended to the packets in the data FIFOs, or entered into the class FIFO. A brief
discussion of each field is contained below. In the case of the error fields, a set bit corre-
sponds to an observed error, and a cleared bit corresponds either to observed normal be-
havior or to an undefined channel or FCR.
4.4.3.2.1.3.1. Class
The packet class selects the data exchange primitive to be executed by the NE. The
packet class is a full 8 bit field, yielding a maximum of 256 different packet classes. Indi-
vidual bit fields in the class field define particular aspects of the packet class as shown be-
low in Figure 4-9.
7 6 5 4 3 2 1 0
[_m pack,,ttyp,,I  x h"'Qoc'a's Jmode
Figure 4-9. Packet Class Field
Page 4-24
The exchange class defines the protocol to be used when exchanging the packet. Cur-
rently, the following values are valid:
0-class 0 (no data)
1-class 1 (one round exchange)
2-class 2 (two round exchange) from member on Network Element A
3-class 2 (two round exchange) from member on Network Element B
4-class 2 (two round exchange) from member on Network Element C
5-class 2 (two round exchange) from member on Network Element D
6-class 2 (two round exchange) from member on Network Element E
The packet type defines the contents of the ring buffer data cell. Data packets are the
normal mode of communication between visual groups. Data packets are treated as a con-
tiguous stream of 64 bytes. There is no structure enforced by the NE on data packets. The
other packet types, however, have specific formats that must be adhered to as described in
Section 4.4.3.3. The following are the current valid packet types:
O-data
1-configuration table update
2-transient NE recovery
3-voted reset
The mode determines how the packet is to be distributed. Two modes are supported:
normal (bit 7 is cleared) and broadcast (bit 7 is set). In the normal mode, the packet is de-
livered to the virtual group specified in the ToVID field. In broadcast mode, all processors
(including the sender), regardless of whether or not they are a member of an active virtual
group, will receive a copy of the packet. The ToVID field is ignored for broadcast packets.
Not all packet classes are allowed in all circumstances. The following outlines the
packet exchange rules.
Only a fault-masking group is "allowed to send a packet with exchange class of 1.
Only a fault-masking group is allowed to send a CT update packet.*
The CT update packet must be exchanged using a class 1 (one round) exchange.*
Only a fault-masking group is allowed to send a voted reset packet.*
The voted reset packet must be exchanged using a class 1 exchange.*
Any virtual group can send a class 0 or class 2 packet.
Any virtual group can send a data packet or an isync packet.
Any virtual group can send a normal packet.
Only a fault-masking group is allowed to send a broadcast packet.*
* Unless the HLF (higher-life-form) bit in the scoreboard is not set.
Page4:25
Undefinedfields andvaluesin the packet class are reserved. Undefined fields must be
set to zero.
4.4.3.2.1.3.2. _D
The ToVID is an 8-bit field specifying the virtual group to which the packet is to be
sent. The VID numbers may range from 0 to 255. Not all VID numbers will be valid, since
there will be at most 40 active virtual groups in the system. If the NE detects an attempt to
send to a non-existent virtual group, the packet is removed from the sending virtual group's
output buffer and discarded.
The ToVID field is ignored for broadcast packets.
4.4.3.2.1.3.3. FromVID
The FromVID field is an 8-bit field specifying the virtual group that sent the packet.
The VID numbers may range from 0 to 255. This field is always valid.
4.4.3.2.1.3.4. User Field
The user field is an 8-bit field for arbitrary use by the processor. The value in the user
field is exchanged and voted along with the SERP data. SERP voting rules are used on the
user field instead of standard class 1 voting rules. The user field can be used to send out-
of-band data between virtual groups.
4.4.3.2.1.3.5. Vote errors
Vote errors indicate if the data emanating from a participant during the packet exchange
disagreed with the majority in any way. For class 1 packets (one round exchanges), the
syndrome bits are only defined for NEs on which the virtual group has members. For class
2 packets (two round exchanges), the syndrome bits are defined for all NEs except the NE
on which the source member resides. Undefined syndrome bits will be cleared. The format
of the vote error field is shown in Figure 4-10. The vote error field is in the relative NE
format.
7 6 5 4 3 2 1 0
Figure 4-10. Vote Error Field
Page 4-26
4.4.3.2.1.3.6. Clock Errors
Clock errors indicate that sometime sincere last packet was exchanged by the NE, the
FTC signal from the indicated NE fell outside the allowable skew window. A clock error
signals a potential problem with the indicated NE or the cable linking the indicated NE with
the local NE. The format of the clock error field is shown in Figure 4-11. The clock error
field is in the relative NE format. The bit corresponding to the local NE (bit 0) is undefined.
7 6 5 4 3 2 ! 0
Figure 4-11. Clock Error Field
4.4.3.2.1.3.7. Link Errors
Link errors indicate that sometime since the last packet was exchanged by the NE, an
error was detected on the indicated fiber-optic link. An error detected by the TAXI receiver
devices is indicated by assertion of the VLTN (violation) signal, which usually indicates
loss of synchronization with the transmitter, A link error signals a potential problem with
the indicated NE or the cable linking the indicated NE with the local NE. The format of the
link error field is shown in Figure 4-12. The link error field is in the relative NE format.
The bit corresponding to the local NE (bit 0) is undefined.
7 6 5 4 "3 2 1 0
Figure 4-12. Link Error Field
4.4.3.2.1.3.8. OBNE timeout
The OBNE timeout (Output Buffer Not Empty) field indicates that the members of the
source virtual group corresponding to the set bits did not request to send the packet within
the allowable timeout skew. These members are considered desynchronized from the other
members of their virtual group until a reintegration procedure is performed on the virtual
group.
The format of the OBNE timeout field is shown in Figure 4-13. The OBNE timeout
field is in the absolute NE format. Only bits corresponding to NEs on which the source
virtual group has members are defined.
Page 4-27
7 6 5 4 3 2 1 0
Figure 4-13. OBNE Timeout Field
4.4.3.2.1.3.9. IBNF timeout
The IBNF timeout (Input Buffer Not Full) field indicates that the members of the desti-
nation virtual group corresponding to the set bits did not free enough space in their input
buffers to hold the incoming packet within the allowable timeout skew. These members are
considered desynchronized from the other members of their virtual group until a reintegra-
tion procedure is performed on the virtual group.
The format of the IBNF timeout field is shown in Figure 4-14. The IBNF timeout field
is in the absolute NE format. Only bits corresponding to NEs on which the destination vir-
tual group has members are defined. The IBNF timeout field is undefined for broadcast
packets.
7 6 5 4 3 2 1 0
Figure 4-14. IBNF Timeout Field
4.4.3.2.1.3.10. Scoreboard Vote Error
A scoreboard vote error indicates that the corresponding virtual group member did not
agree with the majority regarding the type of packet to be exchanged. Scoreboard vote er-
rors are only collected on data contained in the SERP, which includes the packet class, the
destination virtual group (ToV1D field), and the user field.
The format of the scoreboard vote error field is shown in Figure 4-15, The scoreboard
vote error field is in the absolute NE format. Only bits corresponding to NEs on which the
source virtual group has members are defined. The scoreboard vote error field is generated
by the scoreboard and reflects discrepancies observed by the the scoreboard during SERP
processing. Vote errors occurring in the data path voters during the voting of the SERP
during the second round of the SERP exchange are not detected.
7 6 5 4 3 2 I 0
NEE] NED [NEC I NEBI NEAI
Figure 4-15. Scoreboard Vote Error Field
Page 4-2_
4.4.3.2.1.3.11. Timestamp
The timestamp field is a 32 bit field representing the time that the packet was ex-
changed. To be exact, it is the time at which the scoreboard determined that a valid packet
condition existed to allow the packet to be exchanged. The timestamp value is determined
from the global synchronous timer. This timer is initialized during ISYNC or recovery and
increments synchronously with the FTC, The timer wraps around from 0xFFFFFFFF to
0x0 with no indication. The wraparound period is over 90 minutes, which should be plenty
of time to detect wraparound in the operating system. The resolution of the least-significant
bit of the timestamp is 1.281,ts.
4.4.3.2.2. Buffer Manager
The buffer manager controls the status of the input and output buffers for each of the 8
processors connected to the NE. Processors use the output buffers to send packets through
the Network Elements. The Network Elements use the input buffers to deliver packets to a
processor.
Each buffer is owned by either the processor or the Network Element. Ownership de-
pends on the type and the state of the buffer. A processor or the NE must only access
buffers which it owns. Output buffers are i0itially owned by their associated processor,
and input buffers are initially owned by the Network Element. A port can temporarily relin-
quish ownership of a buffer by invoking the SEND operation in the buffer manager. The
buffer is returned to the original owner when the second port invokes the RETURN opera-
tion.
The buffer manager is accessed by the processor using the VMEbus. The processor
status/control ports on the buffer manager are mapped to locations in the VMEbus address
space as shown in Figure 4-16.
Page 4-29
0000
0100
0200
0300
0400
0500
0600
0700
Processor
0 BufMgr
Processor
1 BufMgr
Processor
2 BufMgr
Processor
3 BufMgr
Processor
4 BufMgr
Processor
5 BufMgr
6 BufMgr
7 BufMgr
_
Input
BufMgr
Output
BufMgr
Ol readylB 1103004 r turn IB 4
i
0 I next OB 110310
4 [ sendOB I 10314
Figure 4-16. Buffer Manager Memory Map
For output buffers, the processor has access to the next and send ports. Reading the
next port yields the OBF bit for the output buffer. Accessing (either reading or writing) the
send port activates the SEND operation, which transfers ownership of the output buffer to
the Network Element. The format of the next port is shown in Figure 4-17. The OBF
(Output Buffer Full) bit is cleared if the buffer is empty and set if the buffer is full. The data
read from or written to the send port is meaningless and must be ignored.
31 28 24 20 16
15 t2 8 4 0
Figure 4-17. Next Port Format
For input buffers, the processor has access to the ready and return ports. Reading the
ready port yields the IBE bit and the Ready-IB pointer. Accessing (either reading or writ-
ing) the return port activates the RETURN operation, which transfers ownership of the in-
put buffer back to the Network Element. The format of the ready port is shown in Figure 4-
18. The IBE (Input Buffer E npty) bit is cleared if the input ring buffer contains at least one
unread packet and is set if the ring buffer is empty. The Ready-IB pointer indicates which
ring buffer cell contains the oldest unread packet. The data read from or written to the re-
turn port is meaningless and must be ignored.
Page _130
31 28 24 20 16
I; E -
15 12 8 4 0
Ready IB ]
Figure 4-18. Ready Port Format
System software on the PE must be writ!en to follow the ownership rules of buffer
cells in the data segment. Failure to adhere to the ownership rules described above could
result in overwriting an incoming packet or, in the worst case, the desynchronization of the
NE.
4.4.3.3. Packet Formats
The following section defines the four types of packets that can be sent through the
Network Element.
4.4.3.3.1. Data Packet
A data packet can be exchanged using any combination of the available exchange
classes and modes. The format of a class 1 ora class 2 data packet is simply a contiguous
string of 64 bytes. Any structure imposed on a data packet is done so by the Network Ele-
ment driver software. A class 0 data packet has no data.
4.4.3.3.2. CT Update Packet
The CT update packet is used to modify the configuration table on the Network Ele-
ments. The configuration table (CT) is contained within the Network Element. The CT de-
scribes the mapping of physical processing sites into virtual groups. Processors make
changes to the CT using the CT update packet.
The format of a CT update packet is shown in Figure 4-19. ACT update packet can
update from 0 to 8 CT entries. Unused bits in the CT update entries (shown as shaded re-
gions in the figures below) must be set to zero.
Page 4-31
CTEntry
CT Entry
, CT Entry
CT Entry
0
!
OT Entry
CT Entry
2
3
CT Entry 4
5
6
CT Ent,ry 7
Figure 4-19.
VID
Redundancy Level
PE Mask
Timeout
Member 0
Member 1
Member 2
Member 3
CT Update Packet Format
Each CT entry consists of 8 contiguous bytes. The first byte of the CT entry in the
packet indicates which virtual group is to be updated. The next byte, shown in Figure 4-20,
specifies the redundancy level to be used for the virtual group. The redundancy level (0-4)
indicates how many members are to be included in the virtual group. A redundancy level of
0 indicates an inactive group.
7 6 5 4 3 2 1 0
_..".,. , .... i _..-.:_-i._: ,,_........_ redundancy level I
Figure 4-20. Redundancy Level Field
The next byte, shown in Figure 4-21 is the PE mask. This field is used to mask out se-
lected members of the virtual group during data voting. A set bit in the PE mask indicates
that the corresponding member should be included in the data vote. Bits corresponding to
NEs with no members in the virtual group must be cleared. The virtual group member is
only masked during voting of data; the member is still considered during timeouts for the
OBNE and IBNF conditions, and during scoreboard data voting. To eliminate the timeout
penalty, the virtual group should be reconfigured to eliminate the faulty member either by
reducing the redundancy level of the virtual group or by incorporating a spare processor to
replace the faulty one.
7 6 5 4 3 2 1 0
Figure 4-21. PE Mask Field
The next byte in the CT update entry is the timeout value. This value selects the timeout
to be used for the virtual group when calculating the OBNE and IBNF conditions. The
timeout is specified with a resolution of 1.281.ts. The maximum timeout value is 326.4p.s
(timeoutfield = 255).A timeoutvalueof zeroenablesan infinite timeoutfor thevirtual
group.Thetimeoutfield is calculatedusing:
field =1 timeout value [
tirneout [ 1.28 Its J
Following the timeout byte is a list of the processors which make up the virtual group.
The processors are specified in a format that uniquely identifies a single processor in the
cluster. The format for the processor specification field is shown in Figure 4-22. The abso-
lute NEID refers to the FCR containing the specified processor. The PEID refers to a single
processor within the FCR. The processor listing starts with the 5th byte in the entry and
continues until enough processors are specified to satisfy the redundancy level. Any extra
bytes are unused. However, to avoid vote errors, these unused bytes must be defined.
7 6 5 4 3 2 1 0
Figure 4-22. Processor Specification Field
VID #255 is unique in that it refers to the Network Element rather than an ensemble of
processors. Updating VID #255 in the CT packet is used to modify the NE mask. The NE
reads a new NE mask from the location normally reserved for the PE mask. All other en-
tries for VID #255 are unused. The format of the NE mask field is shown in Figure 4-23.
7 6 5 4 3 2 1 0
Isrcunlk_ NEE !NED I NEC ]NEB [NEA !
Figure 4-23. NE Mask Format
The bit field <4:0> in the NE mask performs the same function as the corresponding bit
field in the PE mask. Setting a bit in this bit field enables the Network Element. Clearing
the bit disables the Network Element's dataand clock inputs. If a Network Element is dis-
abled, all processors connected to that Network Element are also disabled, even if their PE
mask bit is set.
The src unlk bit in the NE mask allows the source of a two round (or source congru-
ency) exchange to be enabled on voting of the packet. Enabling the source during voting of
source congruency packets when the syste m is in a degraded mode (3 or fewer Network
Elements) allows additional anticipated failures to be tolerated. In a fault-masking mode
(either 4 or 5 Network Elements enabled), the src unlk bit must always be cleared. Setting
this bit violates the rules for Byzantine resilience and may allow single point failures to dis-
Page 4-33
rupt thesystem.However,in a degraded mode, Byzantine resilience is undefined, so set-
ring the src unlk bit is permissible.
ACT update packet must have exactly eight valid C_ entries. CT entries can be re-
peated in the CT update packet with no undesirable (or noticeable) consequences. Also,
unused virtual groups (those with a redundancy level of 0) can be used to pad the CT up-
date packet to a full 8 enlries.
4.4.3.3.3. Transient NE Recovery Packet
The TNR packet is used to invoke the transient NE recovery (TNR) procedure. The
TNR procedure is used to reintegrate a desynchronized FCR. NEs which are reset (by a
watchdog timer, voted reset, or other means), suffer a power supply interruption, or are
powered on after the other NEs have completed ISYNC enter the TNR phase during the
boot procedure. By performing transient NE recovery, a working group of NEs can syn-
chronize a new NE with the working group.
The TNR procedure is invoked by sending a TNR packet. The format of a TNR packet
for transmit is undefined. A TNR packet cannot be used to send user data to another virtual
group. A TNR packet can be of any exchange class, and can be either normal or broadcast
mode. Regardless of the exchange class or the data specified in the output data buffer, 64
bytes, as described in Figure 4-24, will be delivered to the recipient virtual group(s).
The first 5 entries (4 bytes each) of the TNR receive packet are the TNR messages as
sourced by each Network Element. The expected TNR message for each NE is
0x0AEC6BF0.
M(A) .
M(B)
M(C)
M(D)
M(E)
| Result
.-_ .
Figure 4-24. TNR Receive Packet Format
_age 4-34
Thenext entryin theTNR receivepacketis a byte which indicates which NEs sou.rced
the expected TNR message. If the bit in the result byte corresponding to a particular NE is
set, the message entry for that NE should be the expected TNR message.
7 6 5 4 3 2 1 0
Figure 4-25. TNR Result Byte Format
The bytes in the TNR receive packet following the result byte are undefined.
4.4.3.3.4. Voted Reset Packet
The VRESET packet is used to perform the voted reset or monitor interlock operation.
Voted resets enable a working group of FCRs to reset selected pieces of hardware in an-
other FCR, presumably one which has suffered a transient fault or otherwise lost synchro-
nization with the working group. The votedreset operation assumes that certain parts of the
FCR to be reset are functional. Therefore, the voted reset function is a best-effort function;
it is not guaranteed to work in all situations_ There is no risk of catastrophic failure to the
FCRs performing the voted reset.
The format of the VRESET packet is shown in Figure 4-26. The transmit packet and
receive packet are identical. The first byte of the packet contains the VRESET command.
The remaining 63 bytes in the packet are user defined.
VRESET Commanc
-_
Figure 4-26. VRESET Packet
The VRESET packet is always exchanged using a class 1 exchange protocol. Specify-
ing a class 0 or class 2 exchange class in the packet class field is illegal. Only fault-masking
groups are allowed to source a VRESET packet. The packet is delivered to the destination
virtual group (or all virtual groups, in the case of a broadcast) as any other standard class I
data packet. The VRESET command is also loaded into the VRESET transmitter after being
Page 4-35
votedonall NEs.Thecontentsof theVRESETtransmitteraresentto theNE to bereseton
thenextfalling edgeof theFTC.TheVRESETcommandisdeletedfrom thetransmitteron
therisingedgeof theFTC.
Theformatof theVRESETcommandbyteis shownin Figure4-27.TheFCRbit field
indicateswhethera singleelement(bit=0) or the entire FCR (bit=l) is to be reset.The
NE1Dfield selectsoneof the 5 FCRs,usingtheabsoluteaddressingmode,on which to
performthevotedreset.Valid valuesfor theNEID field rangefrom 0 to 4. ThePE/IOCbit
field selectswhethera PE (bit=0) or an IOC (bit=l) is to be resetor interlocked,respec-
tively.The elementnumberfield selectsoneof theprocessorsor I/O elementsto bereset.
Theelementnumberfield andthePE/IOCbit mustbesetto zeroif theFCR bit is set.Un-
definedVRESETcommandsareignored.
7 6 5 4 3 2 1 0
I FCR I NEID ! PE/IOCl PE/tOC Number
Figure 4-27. VRESET Command Byte
4.5. AFTA Component Physical Descriptions
The AFTA is designed for application in hostile embedded environments, such as mili-
tary avionics bays, ground vehicles, or launch vehicles. As such, the AFTA is designed to
meet stringent military requirements for such environments. The military specifies a num-
ber of standards that can be used to build military-qualified hardware. Two examples are
the Standard Army Vetronics Architecture (SAVA) for use in ground vehicles and the Joint
Integrated Avionics Working Group (JIAWG) advanced avionics architecture (A3) for air-
craft. This report studies AFTA designs based on these two standards. It must be noted,
however, that the selection of these standards is made for concreteness of presentation; the
AFTA design is not irrevocably tied to any such decision, nor is any endorsement by
CSDL of these standards for use in the AFTA to be implied.
The AFTA is composed of either four or five fault containment regions (FCRs), each of
which is housed in a line replaceable unit (LRU). The terms LRU and FCR are used inter-
changeably in this report. Figure 4-28 depicts a block diagram of an FCR. The LRU en-
closure comprises a Faraday cage with environmental and EMI-resistant gaskets on all ac-
cess hatches and ports. Each LRU contains a number of line replaceable modules (LRMs).
An LRM is either a Processing Element (PE), network element (NE), input/output con-
troller (IOC), or power conditioner (PC). Except for the power conditioner, an LRM is
usually comprised of a single circuit tx)ard.
Page 4-36
TheLRMs areinterconnectedby a backplane bus for data exchange and power distri-
bution. The PEs access the AFTA Network Element and I/O controllers over the bus. The
backplane bus may use redundancy (for instance, a dual PI-Bus) for additional FCR relia-
bility, or a split bus (for example, a VMEbus with VME Subsystem Bus (VSB)) for en-
hanced throughput. For a SAVA-based AFTA, the backplane bus is the System Backplane
Bus (SBBUS) based on the VMEbus specification; a JIAWG-based AFTA uses the PI-
Bus.
From Vehicle
Power Buses
Fault
Tolerant
Data Bus
Backplane Bus
To/From other NEs
Figure 4-28. AFTA FCR Architecture
Hardware faults are isolated to an LRM using the fault detection, identification, and
testing methods detailed in Section 5.6. Each LRM is separately removable from the AFTA
according to the maintenance procedure outlined in Section 7. Live insertion or removal of
an LRM depends on whether or not the backplane bus in use in the particular AFTA im-
plementation supports live insertion/removal. If live insertion/removal is not possible, the
FCR of which the LRM is a part must be powered down before replacing the LRM. The:
LRMs and LRUs are packaged for exposure to a forward operating environment to permit
field replacement.
Page 4-37
Figure4-29showstheoveralldimensionsof aSAVA-basedAFTA FCR(LRU). This
diagramdoesnot showthepowerconditioners.Note the fiber optical bundleemanating
from theconnectorson thefrontof theenclosure.
Network Element
Conditioner other FCRs
Figure 4-29. SAVA-based AFTA FCR
Figure 4-30 shows the overall dimensions of a SAVA-based single-card LRM. The
96-pin DIN connectors (3 rows) may be modified to contain 4 rows (128 pins) if the LRM
is to reside in an SBBUS section of the FCR; the polyimide multi-layer board (MLB) may
be changed as well depending on military qualification criteria.
i
Page 4-38
9.187 in.
Integrated Circuits
(Representative Layout)
r--q r--q i
i I [---q i
I---11 Fq i
ii rDr
I---1i
i i
i
Polyimide MLB
4
I r---n
i r----1
II l
II I
V---1
]1
II
IILl'"El V---IV---1 r----1r---1r----1 ! I
r--n_ i j
I Ii i v---1V--11 I
I lDI---II----II I
I Ir Ii i
II----II II I
II----II II I
I I I! I
IL . 6.299 in. _--
i::i: _ .:
il},i.....
l 96-Pin
DIN
Connectors
_J
E:P2il
Figure 4-30. SAVA-based AFTA LRM
For comparison purposes, Figures 4-31 through 4-32 depict the physical dimensions of
an Ab-'TA fabricated to the JIAWG A3 (version 3.1) standards. The main differences visi-
ble at the current level of detail, besides differences in the bus interface, are that the LRMs
are double-sided SEM-E cards with 250 pins available for connecting to the FCR back-
Page 4-39
plane. Figure 4-31 shows an LRU based on this packaging standard, while Figure 4-32
shows the dimensions of the LRM.
Gasket.
Conduction or Liquid-Cooled
SEM-E Module with Circuit
Boards and Component on
Both Sides.
Baseplate Mounted and Cooled
Chassis with Circular-MIL
Connectors on One End and Single
Side Access for Modules.
Network
Element
Length Dependent
on IX) and
Processing Suite
O_tical Fiber
undle to
Other FCRs
Optical Fiber
Bundle from
Other FCRs
I . , 12in. ,, I
Figure 4-31. JIAWG-based AFTA FCR
Page 4-40
Leaded Surface Mount
Components
(Rel_.Jentativc Polyimidc IVlLB
Layout)
5.88 in.
Modular Connector Frame
IV, Format E
C-28754/92-2
Modular Connector
T_c IV, 250 Contacts
L-C-28754/101-2
6.4 '.in.
mm
:::+
Figure 4-32. JIAWG-based AFTA LRM
4,5.1. Processing Element (PE) Characteristics
The Processing Elements (PEs) are the computational sites in the AFTA. Multiple PEs
can be grouped to form a virtual simplex processing site for increased reliability. Non-criti-
cal tasks can be executed by a single Processing Element to maximize utilization of the pro-
cessing resources. The mapping of PEs into virtual groups, or VGs, is maintained by the
Network Elements. The mapping can be changed in real-time upon a request from the PEs.
Page 4-41
!!
The AFTA may contain more than one type of PE in any given AFTA implementation.
The PEs for the AFTA are generally assumed to be available as non-developmental items
(NDI), and may vary widely from implementation to implementation. The choice of PE
and instruction set architecture (ISA) are completely up to the user of the AFTA. A func-
tional block diagram of the typical components comprising a generic AFTA PE is shown in
Figure 4-33. The PE contains a central processing unit (CPU) with an optional floating-
point unit; multiple CPUs may reside on the PE LRM and may or may not comprise mem-
bers of an AFTA VG, at the user's option. Typically, a local bus is used to communicate
with on-board RAM, ROM, and I/O devices. Timers and oscillators are provided to gen-
erate a local time-of-day clock, time interval measurements, and timer-based interrupts.
Optional local I/O may reside on the PE LRM. A system bus interface is used to gain ac-
cess to the NE over the backplane bus.
CpU/
FPU i l RAM ROM
Timers/
Oscillators
i
i
_ Local I/0
(Optional)
Local Bus
i
System
Bus Interface
Figure 4-33. Functional Block Diagram of an AFTA Processing Element
For illustration purposes, three NDI PEs have been selected for use in the AFTA:
Radstone PMV 68M CPU-3A
Lockheed Sanders STAR MVP
SAVA GPPM
It should be stressed that selection of these PEs is made for concreteness of presenta-
tion; the AFTA is not irrevocably tied to any such selection, nor is any endorsement by
CSDL of these PEs for use in the AFTA to be implied. It should be borne in mind at all
times that the AFTA architecture is very flexible; a wide range of PEs can be used in the
Page 4-42
AFTA, both in initial installations and in p3Is. Selection of appropriate Processing Ele-
ments for a particular AFTA application should be made based on throughput, commonal-
ity, compatibility, and other requirements as dictated by the particular situation.
Relevant characteristics of the selected PEs were gained from preliminary engineering
documentation provided from the vendors and do not reflect commitments on the part of
CSDL or the vendors.
Page 4-43
The RadstonePMV 68MCPU-3Asingle board computer [Rad90] has the characteris-
tics listed below.
Processor Type 25 MHz 68030/68882 FPU
Throughput DAIS Whetstones ri2hLxlll_l_ _ MIPS
2.576M* 5.13M* 8.95M* 8t
Memory
Weight (Estimated)
1.5 Mbyte SRAM, 512KByte EPROM
2 pounds
Power (Estimated) 25W
Volume Height
Depth
Thickness
9.187 inches
6.299 inches
0.063 inches (board)
0.800 inches (front panel)
Failure rate 16,982h MTBF at Ground, Mobile, 45°C
Operating
Temperature
Range
-55 to +85°C
Storage
Temperature
Range
Relative humidity
(Operating)
Cooling
Requirements
-62 to +125°C
0% to 95%, MIL-STD-810D Method 507.4 Procedure III
Conduction cooling through thermal management layer to short card
edges; wedge-lock connection to ATR enclosure
Cost (1991) $23K !i n quantity)
Table 4-1. Characteristics of Radstone PMV 68M CPU-3A Processing Element
* Calculated by Draper.
t Obtained from vendor literature.
Page 4-44
TheLockheedSandersSTARMVP singleboardcomputer[San90] has the characteris-
tics listed below.
Processor Type 25 MHz R3000]R3010 FPU
Throughput DAIS Whetstones Dhrystones VUPS MIPS
• 20f
Memory (typical)
i
Weight
iPower
16 Mbyte DRAM, 256KByte Cache, 1MByte EPROM
2 pounds
20W
Volume Height
Depth
Thickness
9.187 inches
6.299 inches
0.063 inches (board)
0.8_inches:(frbnt panel)
Failure rate
Operating
Temperature
Range
32,000h MTBF at Airborne, Uninhabited, 40°C
-54 to +55°C, continuous
Storage
Temperature
Range
Relative humidity
(Operating)
-62 to +85°C
100%, condensing
Cooling
Requirements
Conduction cooling through thermal management layer to short card
edges; wedge-lock connection to ATR enclosure
Cost (1991) $29K (uS.q._titff 1)
Table 4-2. Characteristics of Lockheed Sanders STAR MVP Processing Element
t Obtained from vendor literature.
The SAVA GeneralPurposeProcessingModule (GPPM) [MIL-STD-344] has the
characteristicslistedbelow.
IProeessorType 16MHz 68020/68881FPU
Throughput DAIS _. betstones Dhrvstones VUPS MIPS
1.03M* 2.05Mr 3.58M? 3.2tt
Memory 128KByte SRAM, 128KByte EEPROM, 64KByte ROM, 4KByte
DPRAM
Weight <2.25 pounds
Power <15W
Volume Height
Depth
Thickness
9.187 inches
6.299 inches
0.063 inches (board)
0.800 inches (front panel)
Failure rate >31,000h MTBF at Ground, Mobile, 85°C @ module edge
Operating
Temperature
Range
-31 to +78°C (NB: evidently inconsistent with failure rate spec)
Storage
Temperature
Range
-57 to +85°C
Relative humidity
(Operating)
94% @ 149°F
Cooling
Requirements
Conduction cooling to short card edges
Cost (1991) . $10K
Table 4-3. Characteristics of SAVA GPPM Processing Element
* Calculated by Draper.
# Measured using XDAda compiler.
tt Obtained from vendor literature.
Page 4-46
4.5.2. Network Element (NE) Characteristics
The Network Elements (NEs) form the core of the AFTA. Each FCR must contain an
lifE; the NE connects the FCR to all other FCRs in the AFTA cluster. The NEs include
hardware to implement functions necessary for the Byzantine resilient properties of the
AFTA. These functions include data exchanges, synchronization, syndrome recording, and
monitor interlocks.
The Network Element design presented in this section is designed for use with the
VMEbus backplane bus. Thus, the NE is compatible with both commercial and military
grade VMEbus systems. The NE is also compatible with the SAVA SBBUS, with the
simple replacement of the 96-pin Eurocard connectors specified by the VMEbus with the
128-pin SAVA backplane connectors. It must be noted, however, that the selection of the
VMEbus is made for concreteness of presentation; the AFTA design is not irrevocably tied
to this selection. Since the brassboard versioB of the NE is being built with a VMEbus in-
terface, compatibility with military VMEbus systems and SAVA systems is virtually free.
The NE can be used with other bus interfaces (PI-Bus, for example) with the expenditure
of additional design effort.
4.5.2.1. Network Element Overview
A functional block diagram of the NE is depicted in Figure 4-34. The NE is divided
into six major subsections: the VMEbus interface, the NE data paths, the inter-FCR com-
munication system (IFC), the fault-tolerant clock (FTC), the global controller (GC), and
the scoreboard. Most of these sections are tied together by the VDAT bus. This bus is used
to transfer data from the dual-port buffer RAM to the IFC transmitter, from the voter output
to the dual-port buffer RAM, and from the voter output to the scoreboard. The VDAT bus
is also used by the global controller to load initializing parameters into the scoreboard,
fault-tolerant clock, vote mask, and ring buffer manager.
Page 4-------------------_-
VMEbus
Mailbox
Lookup Table
Interrupt
Queue
Addres
Register
Dual Port RAM
Ring Buffer
Manager
OlobJ
Corttroller VDAT Bus
Scoreboard
DPRAM
8core DPRAM
Address Peg.
Voter
Debug Router
Aeync Deta
Path Control
il ii
Sync Dew
Path Corttrol
Key:
A-Address line
D-Data lines
C-Control lines
Figure 4-34. Functional Block Diagram of the AFTA Network Element
4.5.2. I. 1. VMEbus Interface
The VMEbus interface connects the Network Element to the VMEbus in the backplane
of the FCR. The NE is designed so that the VMEbus interface section contains the only
Page 4-48
components that are specific to the VMEbus, The NE can be redesigned for a new type of
backplane bus by simply replacing the VMEbus interface section with an interface to the
new bus.
The interface to the VMEbus is further divided into two subsections. The slave interface
is used to transfer data between the PEs and the NE and for accessing the ring buffer man-
ager. The master interface is used to deliver mailbox interrupts to the PEs.
The slave section includes a dual-port buffer RAM for containing packet data either to
be exchanged by the NE, or to be delivered to a PE. Since the buffer memory is a dual-port
device, both sides may access the device simultaneously as long as they don't both access
the same location. The ring buffer manager prevents such contention by assigning owner-
ship to either the PE or the NE on a buffer cell by buffer cell basis. The PE and NE must be
designed to adhere to the ownership specified by the ring buffer manager; there is no hard-
ware enforcement of the ownership rules.
The master interface delivers mailbox interrupts to PEs. A mailbox interrupt can be de-
livered on either a packet transmission, packet reception, or input buffer full condition. Se-
lection of the interrupt condition is done by modifications to the microcode in the global
controller. Most PEs have mailbox interrupt Capabilities located in the short address space
(A 16) of the VMEbus. If a particular pE for the AFTA does not have mailbox interrupt ca-
.....
pabi_ities_ or the capabilities do not meet the requirements expected by the NE mailbox in-
terrupt delivery mechanism, that PE can not take advantage of the mailbox interrupt. How-
ever, it must be noted that the mailbox interrupt capability is optional; a PE can determine
the same information by polling on the ring buffer manager. The purpose of the mailbox
interrupt is to minimize, if possible, the amount of polling required.
4.5.2 .I .2. Network Element Data Paths
The data paths of the NE perform the necessary data exchange patterns to correctly ex-
change, vote, and deliver data in the presence of faults. The data paths consist of a voter,
synchronization FIFOs, controllers for the asynchronous and synchronous portions of the
data paths, and a transmitter/receiver pair for voted resets and monitor interlocks.
The voter handles the resolution of multiple copies of data into a single copy. The voter
has five inputs; however only one, three, or four copies are voted at a time. Voting of one
copy simply involves delivering that copy to the output. The voting of three or four copies
Page 4-49
I
E
requires a bitwise vote of each bit in the redundant copies. The voted result will always be
correct if at most one of the inputs is faulty, unless simplex data is voted.
The selection of inputs to the voter is performed by the vote mask. The vote mask con-
tains a number of registers, including the PE mask register, the NE mask register, and the
source mask register. These registers are used in various combinations depending on what
kind of packet exchange primitive is being executed. The loading of the vote mask registers
and mask selection are done by the global controller.
The inputs to the voter come from a bank of five first-in/first-out (FIFO) devices. One
FIFO is associated with each FCR in the system. Reference to a FIFO is made using the
relative addressing mode. The purpose of the FIFOs is to synchronize the data coming into
the NE. The NEs are synchronized to within a predetermined skew; however this skew is
non-zero. Thus, data may arrive from each FCR at slightly different times. The FIFOs are
used to buffer data as it arrives at the NE. The NE uses its internal synchronization refer-
ence, the fault-tolerant clock, to determine when the data is expected to arrive. The data is
not read from the FIFOs until the data is guaranteed to be present (unless faults are pre-
sent.)
The data path controllers shift data into and out of the FIFOs. The asynchronous con-
troller shifts data into the FIFOs for each of the remote NEs, FIFOs CH 1-CH4. The shift-
in signals for the asynchronous FIFOs are derived from the inter-FCR communication
(IFC) system. The synchronous controller shifts data into the FIFO for the local NE, FIFO
CII0 and also shifts data out of all FIFOs. The shift-in signal for the synchronous FIFO,
and the shift-out signals for all FIFOs, are derived from signals emanating from the global
controller (GC).
The final major section of the NE data paths is the voted reset transmitter and receiver.
The voted reset transmitter sends a voted reset command to all other NEs when a voted re-
set primitive is executed by the NEI When the NE receives a voted reset command over the
IFC, the command is delivered to the voted reset receiver. The receiver votes the command
received from each FCR and selects one or more discrete signals to assert based on the
voted result. The discrete signal is asserted for a single FTC cycle following the voting of
the voted reset command. The signal can be used to reset an element within the FCR, or to
assert a monitor interlock. The one cycle assertion is sufficient to reset most elements in an
FCR. If an element requires a longer reset signal, or if the discrete signal is to be asserted
indefinitely, the signal must be latched. The latch circuitry is not a part of the NE.
Page 4-50
4.5,2.1.3. lnter-FCR Communication System
The Network Elements are interconnected by the inter-FCR communication (IFC) sys-
tem. The IFC system includes a transmitter, a fiber-optic network (not shown in Figure 4-
34), and a bank of receivers.
The transmitter section of the NE converts incoming bytes from the NE into a serial bit-
stream. The data is encoded on the bit-stream using a 4B/5B code [AMD89b] to provide
sufficient transitions to ensure proper operation of the clock recovery circuitry on the re-
ceiver. The bit-stream is then converted to an optical signal for transmission over the fiber-
optic network.
The fiber-optic network that interconnects the FCRs within an AFTA cluster provides a
high bandwidth, high isolation interconnection network. Each NE drives a single output
fiber, this fiber goes to a splitter where the optical signal is replicated four times. The four
outputs of the splitter are delivered, one to each NE (all except the original transmitter). The
fiber-optic network contains a splitter for each NE. A diagram of the fiber-optic network is
shown in Figure 4-35.
FCR A
Figure 4-35. Inter-FCR Fiber-Optic Network
The data receivers on the NEs perform the inverse function of the transmitters. First,
the optical signal from the fiber-optic network is converted to an electrical signal. The data
clock is recovered from the serial signal and is used to convert the incoming signal back
into an 8-bit wide data stream. The receiver asserts a signal, based on the data clock, to
signal the asynchronous data path controller that the data on the output of the receiver is
valid.
Page 4-51
4.5.2 .I .4. Fault-Tolerant Clock
The fault-tolerant clock (FTC) circuit is a free-running digital phase-locked loop. The
FTC in each FCR tries to maintain synchronization with the perceived median FTC signal
from the other FCRs. Adjustments are made by adding or deleting a single clock cycle from
the normal F_C period. All adjustments are made during a known adjustment period; the
NE is designed to tolerate one more or one fewer clock cycles during the adjustment pe-
riod.
A signal called a bound is generated for each remote FCR from the IFC receivers. The
four bound signals are voted by a median-edge voter to select the second observed edge.
The voted signal, called median bound, is compared to the local FTC signal delayed by the
expected inter-FCR delay (from empirical measurements). If the local signal is perceived to
be ahead of the median bound, a self-ahead adjustment is made by adding a single clock
cycle to the adjustment period. If the local signal is perceived to be behind the median
bound, a self-behind adjustment is made by deleting a clock cycle from the adjustment pc-
hod.
The Figures 4-36 through 4-38 demonstrate the three types of FTC adjustments. The
rising edge of the median bound is compared to windows relative to the local FTC signal.
If the rising edge occurs within the normal (N) window, no adjustment is made. If the edge
occurs in the self-ahead (A) window, a self-ahead adjustment is made, and if the edge oc-
curs in the self-behind (B) window, a self-behind adjustment is made. The error window
(E) indicates that the local FTC is skewed too much from the median bound to be consid-
ered synchronized with the other FTCs. The rising edge of median bound should never oc-
cur within the error window except during initial synchronization of the FTC signals.
dock
MYFTC
med_n
bound i
E [BINIA[
l, I
E 4 cycle adjustment
period (no adj.)
Figure 4-36. Normal FTC Adjustment Period
Page 4-52
MYFTC L
median
bound
E
I
I I
IBI.IAI
I I
E 5 cycle adjustment
period (ahead adj.)
_ Figure 4-37. Self-Ahead FTC Adjustment Period
clod,
MYFTC
median
bound
i .
i
--I !
!I , = I i
.L
!E IBINIAI
  j J-LmJn_rn3n  _
E
I
,X
\
3 cycle adjustment
period (behind adj.)
Figure 4-38. Self-Behind FTC Adjustment Period
4.5.2.1.5. Global Controller
The global controller coordinates the functions throughout the Network Element. The
GC asserts signals in almost all other major sections of the NE. The GC is a microcoded
finite-state machine. The GC also has the capability of driving constant data onto the VDAT
bus; this capability is used to load initialization parameters into other sections of the NE.
The microcode store for the GC is built out of registered RAMs with a serial scan-path
for initialization. The microcode is easily changed by creating a new load module which is
transferred to one of the PEs during the booting process. The PE responsible for initializing
the NE transfers the microcode to the GC before proceeding with self-tests or ISYNC.
An embedded system has no need for a flexible microcode store. Indeed, it may be
more desirable from a reliability standpoint to use non-volatile storage. Thus, the GC is
designed so that registered PROMs can be easily substituted for the registered RAMs.
4.5.2.1.6. Scoreboard
The scoreboard is the key element in the AFTA. The scoreboard is responsible for ap-
proving the execution of exchange primitives in a manner consistent with Byzantine re-
silience.
Page4-53
TheNEsperiodicallyperformapoll of thebuffer statusof eachphysicalprocessorin
theAFTA. The statusincludeswhethertheprocessorhasroom in its input buffersto con-
taina packet(the input buffernot full, or IBNF, condition)andwhetherit hasapacketin
its outputbuffers to beexchanged(theoutputbuffer notempty,or OBNE,condition). If
thelatter is true,theexchangeprimitiveto beexecuted,thedestinationof thepacket,anda
user-definedbytearealsoincluded.
Theaggregateof buffer statuspolls is called thesystemexchangerequestpattern,or
SERP.The SERPsareexchangedsuchthat eachfunctioningNE is guaranteedto havea
SERPcopy thatagreeswith all otherSERPcopies;thus,eachNE makesdecisionsbased
ontheSERPwithconfidencethattheotherNEswill makethesamedecision.
The exchangedSERPis deliveredto the scoreboardfor processing.The scoreboard
usesavirtual groupto physicalprocessormappingto extractthebuffer statusinformation
from the SERPon aVG-by-VG basis.The individual buffer statusbitscontainedin the
SERPfor thevirtual groupmembersarevotedto determinetheoverall statusof theVG. If
all membersof aVG, aunanimity,assertastatusbit, theconditionisconsideredtrue.If a
majority, but not a unanimity,of VG membersasserta statusbit, thecondition is consid-
eredalmosttrue,anda timeoutis started.If thetimeoutexpiresbeforetheremainingmem-
bersassertthestatusbit, theconditionbecomestrue anyway.In anyothersituation,the
conditionis considerednot true.
When the OBNE condition is observedtrue for a VG, that VG becomesa packet
source.The class,destinationVID, anduserbytefields for thesourcearevoted to deter-
mine therequestedpacketexchangeprimitive to beexecutedand thedestinationof the
packet..The voteddestinationVID field is usedto look up thestatusof theIBNF for the
destinationVG. If the IBNF condition is not true, thedestination'sinput buffersarefull,
causingflow control to beasserted.If theIBNF conditionfor thedestinationVG is true,
thedestinationhasroom to receiveat leastonepacket.In the lattercase,the scoreboard
specifiestheNE to executetherequestedexchangeprimitive asdeterminedby theclass
field, anddeliver theresultingpacketto thevirtual groupspecifiedby thedestinationVID
field.Thecontentsof theuserbytearevoted(butnotused)by thescoreboardanddelivered
to thedestination.
Page4-54
4.5.2.2. Network Element Physical Characteristics
The following table outlines the characteristics of the AFTA brassboard Network Ele-
ment. Since the NE is currently only a conceptual design, these characteristics are estimates
derived from experience and preliminary design parameters.
Weight 1.5 pounds (estimated)
Power 7A@5VDC
Volume Height
Depth
Thickness
9.187 inches
6.299 inches
0.063 inches (board)
0.800 inches (front panel)
Failure rate See Section 9.
Device technology TI'I./CMOS
Operating
Temperature
Range
0 to 55°C at inlet to cooling fans
Storage
Temperature
Range
-40 to +85°C
Relative humidity 5% to 90%, non-condensing
Cooling
Requirements
10 SCFM air flow over nominal operating temperature range
Table 4-4. Characteristics of AFTA Network Element
4.5.2.2. l. Circuit Board Layout
Figure 4-39 shows a preliminary boardlayout_for the AFTA brassboard Network Ele-
ment.
Page 4_
Figure 4-39. Network Element Brassboard Layout
Page 4-56
4.5.2.2.2. Military Qualification of Baseline Network Element
The AFTA is targeted for application in hostile military embedded environments. Thus,
the availability of the components of the AFTA in military-qualified versions is an impor-
tant factor in the acceptance of the AFTA for such applications. Since the NE is the only
hardware specific to the AFTA, the availability of the individual integrated circuits of the
NE in military-qualified foma is necessary.
Throughout the conceptual design, two major military publications, MIL-STD-883C
and MIL-M-38510, were used as criteria foroo!he qualification level of components for the
AFTA NE. The goal is to ensure that all backplane-independent components (i.e. anything
not in the VMEbus interface subsection of the NE) are available qualified to MIL-STD-
883C, Class B and specified by either a MIL-M-38510 "slash sheet" or on a standard mili-
tary drawing (SMD). Alternatively, equivalent functionality in a similar part is desired.
Table 4-5 summarizes the military availability of devices for the AFTA.
Many of the parts do not have exact equivalents in military-qualified versions. How-
ever, with a few exceptions, the NE functionality can be captured in devices that are very
similar to those used in the brassboard. The reason the brassboard does not use these simi-
lar parts in the first place is due to space, performance, and cost considerations. The fol-
lowing paragraphs summarize the status of the parts which do not have exact military-qual-
ified equivalents.
The status of the scoreboard is indeterminate, since at this time, the design and imple-
mentation of the scoreboard has not been finalized. However, a "Baseline Network Ele-
ment" has been identified which consists of the ND1 devices listed in the Table below and a
single Application-Specific Integrated Circuit (ASIC) which implements the Scoreboard
functionality. It is believed that this level of :integration of the Scoreboard is necessary for
the NE to fit onto a single MIL-STD-344 LRM. Subsequent calculations of AFTA failure
rate, weight, power consumption, and volume wi!l refer to this version of the NE.
Page 4-57
Manufacturer
CSDL
IDT
Signetics
Altera
Lattice'
TI
AIVID _"
part ........
Dallas
Cypress
AT&T
Notes •
scoreboard
7202
6116
7134
72402
7006
39C10
715092
SCB68172
EPM5032
EPM5064
EPM5128
EP910
GAL16V8
GALI8V10
GAL22V 10
GAL26V[2"
ALS245
ALS646
7968
7_k69
1232
7C245A
ODL Xanit
ODL Receive
1 Second source part.
Mil-Std
n/a
8'83B(C?)
883B(C?)
883B(C?)
883B(C?)
883B(C?)
883B(C?)
883B(C?)
n/a
883B(C?)
883B(C?)
883B(C?)
883B(C?)
883C
883C
883C
883C
54ALS
)54ALS
883C
883C
n_
883C
883C
883C
SMD Number
n/a
5962- 8753101 i7201LA30) 3
5962-8866904 (7203S40.) 3
5962-8874002 (6116LA25) 2
5962-8700201 (7132SA-45) 3
5962-8700205 (7132LA-45) 3 .
5962-8684604 (72404L35) 3
5962-8684603 (72404-25)3
n/a
5962-8770803 (39C10C) 4
n/a
5962-8770501
5962-90611 (CY7C344) 1
n/a
5962-89468 (CY7C342) 1
5962-89839032A
n/a
5962-8984103LA
5962-8867001
5962-8872401
5962-8872401
n/a
5962-8403001
(C22V10, Wind.) 1
(C22V 10, Opaq.) 1
(C22VlOL, Opaq.) 1
5962-89956O1
DH27023
DH27025
DH27024
DH27026
(LCC Temp Waiver)
(LCC Voltage W,giver)
(LCC Temp Waiver)
(LCC Volta[eWaiver)
_a
5962-89815
5962-88735
(Wind., pnd. June)
(Opaque)
2 Speed variation or technology variation from specified part.
3 Similar part to specified part.
4 Different or specified revision of part.
Table 4-5. Milit:u'y Device Availability for Network Element
Several devices from IDT are not awfilable in a military-qualified form. Some, such as
the FIFOs and the small dual-port RAMs, have similar devices with less internal memory.
For the FIFOs, this is no problem. For the DPRAMs, more devices can be used in parallel
to provide the same amount of memory. Other devices, including the large dual-port RAMs
Page 4-58
(IDT 7006) and the registered RAMs (IDT 71502), are not available in any similar part.
However, the large DPRAMs are part of the VMEbus-specific section, and the registered
RAMs can be easily replaced with Cypress registered PROMs (CY 7C245A), which are
available in an SMD form. The functionality of the registered PROMs and the registered
RAMs is very similar, as is their timing characteristics. The GC is designed to function
with either of these two forms of control store.
The Altera EPM5064 is not currently available in an SMD form, although it is qualified
to MIL-STD-883B. However, the complete functionality of the EPM5064 can be captured
in an EPM5128, which is available in an SMD form. Since the EPM5128 comes in a larger
package, board space must be sacrificed to accommodate EPM5064 designs in an
EPM5128.
The Lattice 22V10 is a very common PAL architecture and is available in an SMD form
from many vendors, including Lattice. The 18V 10 and 26CV12, variations to the 22V10
architecture, are not available on the SMD list. However, the functionality of an 18V10 can
be implemented in a 22V!0, and the function_dity of a 26CV 12 can be implemented in two
22V10s with the sacrifice of board space.
The AMD 7968 and 7969 devices, the TAXI transmitter/receiver pair, are only available
in a waivered form, either temperature or voltage waivered. While waivered parts are to be
avoided, these are the only devices that perform the necessary function.
The Dallas device is not available in a military-qualified form. However, this is not a
problem since the function of the Dallas dey_ce, a watchdog timer, is not essential to the
AFTA functionality. If the functionality of a watchdog device is desired (which it probably
is, for common-mode fault recovery), the same functionality can be implemented using
several discrete components with the sacrifice of board space.
The AT&T devices are the fiber optic data links that convert between electrical and opti-
cal signals. Since they are hybrid devices, they are not covered by the SMD list. However,
surface-mount versions of the device, qualified to MIL-STD-883C, have been announced
by AT&T [APS90].
Page 4-59
Effect of implementation Technology on Network Element Physical Character-
isacs
The technology used to implement the NE determines its performance, power con-
sumption, weight, volume, and failure rate. The Baseline design outlined above uses one
ASIC to implement the Scoreboard functionality. Because of the cost involved in fabricat-
ing an ASIC, two additional implementation options for the Scoreboard have been consid-
ered for the NE. The first relates to the implementation of the Scoreboard function using a
high speed Reduced instruction Set Conlputer (RISC) processor and the second relates to
the Scoreboard implementation using a number of Field Programmable Gate Arrays
(FPGAs). For the reasons listed below, neither of these implementation options appears
sufficiently attractive to warrant their continued development.
4.5.2.3. I. RISC Processor Scoreboard
Presented below is a listing of the pin counts, total gates, and power consumption of a
the AFTA Network Element Scoreboard implemented as an AMD29000 processor. This in-
formation is intended for comparison purposes only, not as any recommendation for this
course of action. Bctow is also a paragraph explaining some of the penalties associated
with this approach.
The main drawback to this approach is the massive performance penalty. The code
which gets executed most often is the voting code. Using optimized assembly language,
this code comes out to 99 instructiot_s. This section is executed at least once for every VG
in the system. Assuming an all triplex configuration (13 VGs) this means 99 total instruc-
tions. Using a 40MHz processor (25ns per instruction), the scoreboard emulator will take
99 ItS just to vote the OBNE bits of the VGs. When the overhead of performing timeouts
and voting the rernainder of the SERP intk)rmation is added, the scoreboard emulator will
be to slow to support real-time tasks with iteration rates of 100Hz. Thus, this method will
not meet the perfom_ance requirenaents of the AFTA.
The total gate count provided is intended to convey the number of transistors present
rather than a size estimate. For this purpose the RAM counts are doubled. This is due to the
fact that two 'gates' are used for each RAM cell. A gate is defined as four transistors (two
N-channel and two P-channel).
Page 4-60
ScoreBoard Emulatgr
AMD29000 RISC Processor
Power Consumption (Watts)
Pin Count
Dual Port RAM (Sdt 7134 or equivalent) [4K x 81
Power Consumption (Watts)
Pin Count
Gate Count
Processor RAM (Sdt 7186 or equivalent) 2x[4K x 16]
Power Consumption (Watts)
Pin Count
Gate Count
Miscellaneous Glue Logic (Altera 5064 or equivalent)
Power Consumption (Watts)
Pin Count
Gate Count
Functional Scoreboard Emulator totals
Power Consumption (Watts)
Pin Count
Gate Count
-4.125
169
1.5
48
70K
1.5
88
262K
1.5
44
5K
8.625
349
337K + Processor complexity
4.5.2.3.2. FPGA Implementation
Presented below is a listing of the pin counts, total gates, and power consumption of an
AFTA Network Element Scoreboard implemented in FPGAs. This information is intended
for comparison purposes only, not as any recommendation for this course of action. Below
is also a paragraph explaining some of the penalties associated with this approach.
There are three major reasons for not taking this approach. The first is the complexity.
A student at CSDL recently completed two FPGA designs for his thesis, one of which was
a voter. The voter consumed an entire FPGA by itself, and the scoreboard contains a voter
plus other custom hardware. The second reason is that the existing VHDL models would
be invalid for the new design. Some companies have promised VHDL support for their
FPGA design systems, such a capability is at least a year and a half away. With all the ef-
fort put into the VHDL modeling, it would be wasteful to throw it all away. Closely cou-
pled to this is the third reason, verification. With the VHDL model the verification occurs
with the transformation to the gate level. Each new step could be verified before continu-
ing. With the FPGA approach, each FPGA would be a segment of the scoreboard algo-
rithm, causing problems with verifying it individually.
Page 4-61
Thetotal gatecountprovidedis intendedto conveythenumberof transistorspresent
ratherthanasizeestimate.ForthispurposetheRAM countsaredoubled.This is dueto the
fact thattwo 'gates'areusedfor eachRAM cell. A gateis definedasfour transistors(two
N-channeland two P-channel).
ScoreBoard Emulator (FPGA)
2 Dual Port RAMs ddt 7134 or equivalent) [4K x 81
Power Consumption (Watts) 3.0
Pin Count 96
Gate Count 140K
4 Altera 5128 FPGAs
Power Consumption (Watts) 5.0
Pin Count 272
Gate Count 32K
Functional Scoreboard Emulator totals
Power Consumption (Watts)
Pin Count
Gate Count
8.0
368
172K
4.5.2.3.3. High End Network Element
A third implementation study performed under the Conceptual Study relates to the ag-
gressive use of VHSIC/VLSI packaging to construct the Scoreboard, Global Controller,
Voter, and VMEbus Controller in four ASICs. This design is denoted the "High End Net-
work Element."
Presented below is a listing of the pin counts, total gates, and power consumption of an
AFTA Network Element consisting of four ASICs and 16K x 32bits of DPRAM. The
DPRAM was not included into the ASIC design since it was so large.
Each ASIC is shown below followed by its clock speed. Shown below the name are
the parts which were used to estimate the gate counts. When there is a number in square
brackets, [], it indicates a part which would have been used had the ASIC approach not
been taken. Following this number is a calculation showing how the equivalent number of
gates was arrived at. For example, the FTC section of the Voter ASIC contains the follow-
ing : FTC [5128] (256x48x0.8). This means that the FTC would have been implemented in
an Altera 5128, containing 256 internal Macro cells. Each Macro cell is estimated to contain
48 gates of functionality and the chip is considered to be at 80% utilization. Thus 9.8K
Page 4-62
gatesis reached.Any multiplier beforethechip typeis thenumberwhichwouldhavebeen
used.For RAM theinformationis thenumberof chipsused,theirdepth,andtheirwidth.
Thetotal gatecountprovidedin the sumn)ary is intended to convey the number of tran-
sistors present rather than a size estimate. For this purpose the RAM count was doubled
and added to the logic count. This is due to the fact that two 'gates' are used for each RAM
cell. A gate is defined as four transistors (two N and two P-channel). For a size estimate,
the RAM cells take about 43 percent of the space required for the logic cells, so the original
RAM count should be multiplied by 0.43 and added to the logic count to receive a total
'soft gates' equivalent. Since the voter has minimal RAM requirements it may be imple-
mented in an LSI LCA package without using separately diffused RAM.
Any errors in the data below should be on the conservative side. All calculations pre-
sented were done in the most conservative ma0ner possible. One assumption made was that
25% of the gates could be active on any clock edge. LSI has stated that typically only 15-
20% will be active. Another was the total nun3ber of gates used for RAM has been included
in the gate count. Most likely the RAM usage will come from the raw gate count due to its
being diffused into the silicon directly and not being implemented in the 'soft gates' the
logic will use. It will therefore take less space than the 'soft gates' and not impact as much
of theuseable area on the die.
Page 4-63
!
_,_..QLedl.9.I£_ (25Mhz) _ RAM
Random Logic 50K
RAM 40K
Totals 50K 40K
Logic Gate Power consun_plion • (25 % x 5.5 p.W/gate/MHz)
0.25 x 5.5 x 10-6 x 50K x 25 = 1.719 Watts
Pin Count
Data (4x32) 128
Address I0
Control 7
Total !45
Pin Pow.er ¢onsunlption : (25 I.IW/pin/Mttz/pF)
25 x 10 -6 x 145 x 25 x 5 = 0.453 Watts
Summary:
Pin Count : 145
Tom! Gates : ! 30,000
Total Power (Watts): 2.172
Page 4-64
Global Controller (12.5Mhz) RAM
Controller 2K
Mist Logic (Mux, etc.) 0.5K
Control Store 328K
Totals 2.5K 328K
Logic Gate Power consumptiow! : (25 % x 5.5 _tW/gate/MHz)
0.25 x 5.5 x 10 -6 x 0.5K x 25 = .05 Watts + .5 Watts = .55 Watts
Pin Count
Control 80
Pin Power consumption : (25 I.tW/pin/Mtlz/pF)
25 x 10 -6 x 80 x 25 x 5 = 0. t25 Watts
S_,mlm_u'y :
Pin Count : 80
Total Gates : 657,860
"lbtal Power (Watts): 0.675
Page 4-65
voter (12.5Mhz)
FIFOs (5x64x32)
Debug Router (4x32x2:1 )
VDAT FIFO (64x32)
Voter [51281 (256x48x0.8)
FTC [51281 (256x48x0.8)
VReset REC 150641 (128x48x0.8)
VReset XMIT 150321 (64x48x0.8)
lsync Proc. [50321 (64x48x0.8)
Totals
RAM
0.5K 10K
0.3K
0.1K 2K
9.8K
9.8K
4.9K
2.5K
2.5K
30,4K 12K
la)gi¢ Gate Power consumption • (25 % x 5.5 gW/gate/MHz)
0.25 x 5.5 x 10 -6 x 30.4K x 25 = 0.142 Watts
Pin Count
Data (5x32) 160
FTC 5
DS 5
Clock !
Buffer Cntl. 6
M/I 16
Total 193
Pin Power con_ml_ : (25 p.tW/pin/MHz/pF)
25 x 10 -6 x 193 x 25 x 5 = 0.302 Watts
S_m_nm1y :
Pin Count : 193
Total Gates : 54,976
Total Power (Watts): 1.37
Page 4-66
VME Controller (25Mhz) _ RAM
VME Buffer 4x[245] 0.4K
VME Addr. Buffer 2x[646] 1.0K
VME DS gen [26V12] 0.8K
VME Ctlr. [5064] (128x48x0.8) 4.9 K
VME Addr. Decode [22V 10] 0.5 K
DPRAM Addr. Reg. [5064] 4.9K
INT FIFO (64x5) 1.5K
INT Ctl. PROM 3x[7C245] (2048 x 8) 0.1 K
RBM [5128] (256x48x0.8) 9.8K
RBM RAM [6116] (2048x8) 16.4K
Totals 23.9K 65.9K
0.3K
49K
Dual
Logic Gate Power consumption • (25% x 5.5 l.tW/gate/MHz)
0.25 x 5.5 x 10 .6 x 23.9K x 25 = 0.821 Watts
Pin Count
Data (2x32) 64
Address 46
INT & Control 21
Tot,'d 131
Pin Power consumotion ' (25 I.tW/pin/M Hz/pF)
25 x 10-6 x 131 x 25 x 5 =0.205 Watts
Summary "
Pin Count • 13 i
Total Gates ' 155,619
Total Power (Watts): 1.026
Port RAM
VME DPRAM (idt 7(X)6 or equivalent) 4x116K x 81
S umm_,lry :
Pin Count : 272
Total Gates : 1,048,576
Total Power (Watts): 2.5
4.5.3. Input/Ot.!tput Controller (IOC) Char____!_;leristigs
A host of mission-specific input and output devices are connected to the AFTA by in-
put/output controllers (IOCs). An IOC is an LRM that contains one or more I/O devices.
Examples of IOCs include interfaces to IMUs, GPS receivers, CNI gear, displays, status
Page 4-67
panels,AHARS, air data sensors, altitude, image, range, Doppler navigation, engine sen-
sors and controls, switches and annunciators, and data buses such as MIL-STD-1553.
A number of IOC types may reside in the AFTA; an IOC plugs into the FCR backplane.
PlUs communicate with the IOCs either over the FCR b:tckplane bus or over a dedicated I/O
busl the advantage of the latter approach is that the 1/O traffic does not interfere with the in-
ter-V,G traffic between the PEs and the NEs, which goes over the FCR backplane bus. The
exact definition of the I/O suite is mission-specific. The AFTA architecture is designed to
admit any IOC that is compatible with the selected backplane bus. The IOC may simply be
a device-specific memory-mapped controller, or it may be an interface to a dedicated gO
network.
"Dumb" IOCs are controlled by a PE (either a simplex VG or one member of a redun-
dant VG) and never communicate directly with the NE. "Smart" IOCs, for all intents and
purposes, act like PEs and may be grouped into redundant VGs; dumb IOCs can not.
The following I/O devices are candidates for implementation in the AFFA; this list will
be refined as det;tiled data regarding the mission's requisite I/O suite become available.
Instrumentation btlses: MIL-STD- 1553/1773 (SA VA 1553M)
LANs: FDDI, SAFENET I1, Ethernet
Memory-mapped I/(3: A/D, D/A, discretes, SAVA ADM
Expansion memory: non-volatile RAM, EEPROM
Mass memory: disk, tape, CD-ROM, SAVA MM
Fault Tolerant Dat:t t3tJs/Authenticatioxl Module
4_,5,4_. Power Conditioner (PC) Ch_u'_tcteristics
The power conditioners (PCs) supply the power required by the AFTA. At least one
PC is used for each AFTA FCR; to eliminate single-point failures, a PC can not be allowed
to drive more tha_'_ one FCR. The PCs not o,lly regulate and filter the power delivered to an
AFTA FCR, they also maintz_in uninterrupted AFTA operation in the presence of momen-
tary dropouts on the m:_in vehicle power buses. A simplified architecture of a PC is shown
iq Figure 4-40.
A PC provides reg_lated power co_wersi()_ from tl_e main vehicle direct-current (DC)
power buses, labeled "A" and "B" in Figure 4-40, to the voltage level(s) necessary to
power the AFTA LRMs. One or more sense signals are fed back from designated points in
_Page 4-68
theFCRto allowclosed-loopfeedbackof theregul:ltedpowerasFCRloadvaries.ThePC
maycontainmultiple regulatorsfor variousvoltagelevelsor to powerisolatedsectionsof
theFCR. OptionalPCcapabilities,not shownin Figure4-40, includeselectiveactivation
anddeactivationof aPCandremotemonitorhlgof thePCoutputvoltageandcurrent.
To toleratethe indefinitelossof eitherof thetwo mainvehiclebuses,eachPCis con-
nectedto bothmainbuses.DiodesCR1and_CR2preventcurrentfrom flowing from anop-
erationalto an inactivemain bus. The PCis designedto provide full ratedpowerto the
FCR usingonly one bus;thustheAFTA call beusedin vehicleswhich possessonly one
main powerbus. The PCmustalso toleratethetemporaryoutage of both buses, such as
during the switch-over from helicopter APU to engine generators, during which a 0.5s
power dropout occurs on both vehicle buses. This function is provided by a battery in each
PC which must be sized to provide full FCR power for the duration of any anticipated out-
age of both vehicle main buses. The PC must also protect itself and its FCR from over-
voltages on the main buses; this protection is provided by the units labeled OV1 and OV2 in
the figure. Metal-oxide varistors (MOVs) are included in the OVI and OV2 circuitry to
handle high-frequency high-voltage input sPikes which are too fast for conventional input
overvoltage protection circuitry. Finally, fuses (or circuit breakers), labeled F1 and F2 in
Figure 4-40, limit input current from either main vehicle bus. These fuses must be sized to
permit power-on current surges in excess of t!_e expected steady-state power dissipation.
A
B
OVl(,
T
CRI
Main Vehicle Power Buses
Power
Monitor
Voltage' Regulator
T Battery
Power-Fail
Interrupt
Sense
CR3 F3
Ground
FCR Power
Figure 4-40. Functional Blcx:k Diagram of the AFTA Power Conditioner
Page 4-69
TheFCRis protectedfrom excessivelyhighPCoutputvoltageandhigh-voltagehigh-
frequencyspikesby theoutputovervoltageprotectorlabeledOV3; this circuitry alsopro-
tectsthePCandmainvehiclebusesfrom back-propagatedovervoltagesandspikesemanat-
ing from the FCR. Diode CR3 preventsbackcurrentsgeneratedwithin the FCR from
flowing into thePCandpossiblyonto thevehiclemainbuses.Thefuse(orcircuit breaker)
F3 providescurrent-limitingprotectionfor theFCR.This fusemustbedesignedto permit
theexpectedpower-oncurrentsurge.
If anoutput undervoltage condition persists lk)r a time interval which exceeds the bat-
tery's amp-hour rating, or if the voltage regulator itself fails, the PC's power monitor as-
serts a power-fail interrupt to the FCR to allow the FCR to attempt emergency power-down
functions before the PC output drops below a minimal level. Since the AFTA is Byzantine
resilient, the sudden loss of any single FCR to an undervoltage condition or PC failure is
tolerated; however, the power fail interrupt provides a mechanism for a more graceful
degradation of the AFTA system.
4.5.5. Cooling System
All AFTA components utilize a thermal management layer for conductive cooling from
the LRM interior to the LRM edge. The edges of each LRM are thermally connected to the
LRU side walls using wedge locks. The LRU side walls are in thermal contact with the
LRU cold plate, which transfers heat away fi'om the LRU. Depending on the installation,
the LRU cold plate may be cooled by ambient air, forced air, or forced liquid.
Page 4-70
5. AFTA Software Architecture
5.1. Overview
The AFTA is designed to provide a highly reliable parallel processing system. Pro-
eessing is distributed among the parallel processing sites by task, and intertask communi-
cation is provided by message passing. High reliability is provided by redundantly execut-
ing the tasks on replicated processors. The AFTA hardware and software have been de-
signed to hide the hardware redundancy, hardware faults, and the parallel processing de-
tails from the applications programmer. The functional structure of the system software is
shown in Figure 5-1. Each VG in the system executes the rate group tasking paradigm
which provides the execution environment for application and system service tasks. FDIR
is a system service which provides fault detection, isolation, and reconfiguration. The I/O
services provide the interface to input/output devices.
r
Rate Group Tasking Paradigm
, J
Figure 5-1. AFTA System Software Organization
A desired initial system configuration must be specified prior to beginning system op-
eration. The configuration must specify the mapping between tasks and VGs and that be-
tween VGs and processors. This mapping is maintained by the operating system and is
used to isolate the applications programmer from the underlying redundancy and parallel
processing mapping. System initialization uses the above mapping to test and ready the
hardware components of the system and evaluate whether there are sufficient resources to
perform the mission. System initialization must also ready the software components for
execution. The system specification and an overview of system initialization is described in
Section 5.2.
Page 5-1
The rategroup tasking paradigm provides the framework for executing system and
application tasks in the AFTA. It is composed of the rate group tasking services, the time
management services, and the communication services. The rate group tasking services
define the tasking framework for the application and system service tasks and are described
in Section 5.3. The time management services are used to generate the rate group frame
boundaries and provide periodic resynchronization of the VG. They are described in Sec-
tion 5.4. The communication services are used for intertask communication and are de-
scribed in Section 5.5.
FDIR provides fault detection, isolation, and reconfiguration of Processing Elements in
the system. It is composed of local FDIR which executes on each VG and system FDIR
which executes on a specially designated VG. Local FDIR has the responsibility for detect-
ing and isolating hardware faults in the processor elements of its VG and disabling their
outputs using the interlock hardware. In addition, local FDIR reports all link and Network
Element (FIE) faults to system FDIR and responds to its reconfiguration commands. It is
also responsible for transient PE hardware fault detection and for running low priority PE
self tests to detect latent PE faults. It is transparent, autonomous and non-intrusive to the
application. The system fault detection, isolation and reconfiguration is responsible for the
collection of status from the local FDIR and detection, isolation and masking of Network
Element faults, and link faults. It resolves conflicting local fault isolation decisions, iso-
lates unresolved faults, correlates transient faults, and handles VG failures. Local and
system FDIR are both described in Section 5.6.
The I/O services provide efficient and reliable communication between the application
program and external devices (sensors and actuators). They execute on any VG which is
responsible for I/O and provide source congruency on all input data and voting of all output
data. The I/O services provide the user with the ability to group I/O transactions into chains
and I/O requests. It also provides the user the flexibility of scheduling both preemptive and
non-preemptive I/O. The I/O services are described in Section 5.7.
5.2. System Specification and Initialization
The AFTA is composed of a set of processing and Network Elements. During system
initialization these components are individually initialized and tested. Non-faulty compo-
nents are then grouped into redundant units which can provide fault tolerant operation. The
Processing Elements are grouped into a set of virtual groups or VGs and the network ele-
ments are grouped into the Network Element aggregate. These units are then initialized and
tested. System initialization is completed by distributing the application load among the
available VGs and beginning cooperative execution of the assigned tasks.
The initialization is based on a VG configuration table and a task configuration table.
The VG configuration table maps VGs to Processing Elements. The task configuration
table maps rate group tasks to VGs. An initial configuration must be specified for both ta-
bles and compiled into the AFTA operating system. The VG configuration table and asso-
ciated initialization are discussed in Section 5.2.1. The task configuration table and asso-
ciated initialization are discussed in Section 5.2.2. Detailed descriptions of the initialization
procedures are discussed in their associated sections.
5.2.1. Virtual Group Configuration
The Network Elements provide message passing between VGs. To provide the com-
munication, the Network Elements must. map each VG to its corresponding set of
Processing Elements. This mapping is maintained in the Network Element's VG configu-
ration table. The VGs also maintain a copy of the configuration table to alter the VG to
processing element mapping when operational requirements or resource availability
changes. The processing element is specified by its hosting Network Element and the
Network Element port it is using for communication. During initialization, the processors
contend for the available communication ports and the above mapping does not provide a
unique external physical identification of the corresponding processor hardware. The
physical identifier of each processor in a VG is maintained in the software version of the
table for identification of faulty elements. An example AFTA configuration showing the in-
formation used in the configuration table is shown in Figure 5-2.
NV. A "A-
! port 0
port
port 2
port 3 i
i port 4
port 5
6
port 7
u
" E"
u
m
VG
"e
"0
"0
Figure 5-2. Example AFTA Configuration
Page 5-3
The examplesystemhasfive Network Elements with each Network Element having
eight communication ports and hosting eight Processing Elements. VG 1 is a quadruplex
composed of Processing Elements (B,6), (C,4), (D,3), and (E,5) where the the first ele-
ment of the pair is the Processing Element's Network Element id and the second element is
its port id. This is the information required by the Network Elements for delivery of the
communication service. The OS also maintains the physical identifier of each processor in
the VG to externally identify faulty hardware. The corresponding physical identifiers of the
processors in VG 1 are (B,S4), (C,$6), (D,S3), and (E,S2) where the the first element of
the pair is the Processing Element's Network Element id and the second element is an ex-
ternally visible identifier unique among the processors hosted by that Network Element.
An initial VG configuration must be specified for the system. It must be compiled into
the operating system and loaded into the Network Elements. Both the NE and OS versions
of the configuration specify each VG's redundancy level, its Processing Elements, a fault
mask, and a communication timeout value. The processor class of the VG, its rate group
phasing, and the physical identifier of each of its members is also included in the OS con-
figuration. The processor class field is used to identify the ports associated with each pro-
cessor class and limit the contention for these ports to processors of the appropriate class
during initialization. This prevents different processor types from becoming members of
the same VG. The ."ate group phasing specifies the relative phase of the VG's rate group
frames to the frames on other VGs. The physical identifier of each Processing Element is
the only field which is not specified in the initial configuration. It is determined after the
Processing Elements has successfully contended for a communication port and are a mem-
ber of the VG mapped to that port.
The rate group phasing describes the relationship between the rate group frames on
each VG in the system. Within the task configuration table described in the next section,
each task is assigned to execute in some rate group. The rate group determines the fre-
quency at which the task will be executed and the resulting rate group frame delimits the
execution cycle of the task. Tasks assigned to the same rate group will execute at the same
frequency regardless of their hosting VG, but there may be a time difference between the
start of their first and each subsequent rate group frame if the tasks are executing on differ-
ent VGs. This phasing would be caused by the completion of system initialization at dif-
ferent times on different VGs. An example phasing of the frames for tasks in a given rate
group on multiple VGs is shown in Figure 5-3.
Page 5-4
VGI vG2 .... VG3 VG4 base time
_u_u_!_jlu_p_n_l_|_u_u_m_m_l_iI_m_u_J_!_i_p_i!_iu_iH_
.......... .... pha_e ,,',,'..".,.,,'.,
r_,,,
Figure 5-3. Rate Group Frame Phasing
In the example, the first rate group frame on VG 1 starts at the base time and the f'trst
rate group frames on the th¢ remaining VGs are delayed. The interval betweerl _e base
time and the start of the firsI rate group frame is the VG's phase delay. The phase delay ts
important because it deterrnlnes the relationship between the frame in which mes sage_ arc
sent and the frame in which they are received. This is also affec!ed by the message passing
restrictions in the rate group tasking paradigm. In the paradigm, a task's queued messages
are only sent and its received messages are only made available at its corresponding rate
group frame boundary. This Is indicated in the figure by the arrows at the frame botmd-
aries. An example from the figure is the messages transmitted after the f'a'st frame on VG!.
They will be received at the slart of the first frame on VG2 and VG4, but will holt be re-
ceived until the start of the sec0rtd frame on VG3. This relationship of sending frame to re-
ceiving frame will remain c0nstgnt for subsequent frames if the phastng does not change.
The phasing will change if the start of subsequent rate group frames on different V Gs
are allowed to float with respect to each other. The time management service !ms been de-
signed to minimize this float by locking the phase to the system time maintained by the
Network Element. There still remains inherent float because of the variability of the in.ter-
val from the start of the frame _o when any given message will be sent or read, This float is
increased when VGs which shal'e a Network Element have the same phase del_y or [heir
delays differ by an integer number of minor frames. This is because the VGs are then
forced to compete for access [o the Netwo_ Element to send and read their messages at
....... .... : ...... Page 5-5
their frame boundary. For this reason the simplest phasing of a zero phase delay for all
VGs is not recommended. The phase field in the VG configuration table is provided to
specify the desired phase delay for each VG. The declaration of the OS version of the con-
figuration table and its associated data types are shown in Figure 5-4.
type redundancy_level_type is 0..4;
type processor_class_type is (sbc,intelligent_io);
type minor_frame..period_unit is delta 0.01 range 0.00 .. 8.00;
subtype ne_id type is (A..E) of message_exchange_type;
type port id type is (0..7);
type processor id record is record
ne_id "ne id type;
portid • portid_type ;
board id • integer;
end record;
type vg_configuration_record is record
redundancy : redundancy_level_type;
class : processor_class_type;
phase : minor.frame_.period_unit;
id : array (natural range l..redundancy_level_type'last) of
processor_id__record;
mask : mask_type;
timeout : timeout_type;
end record;
type vg_id_type is range 0..63;
vg_configuration "array (vg_id_type) of vg configuration_record;
Figure 5-4. VG Configuration Table
vg_id_type defines the allowed set of VG ids and the configuration table has an entry
for each VG id. redundancy specifies the redundancy of the VG. Zero indicates there are
no members in the VG. One, three, and four indicate a simplex, triplex, and quadruplex
respectively. Duplexes and quints are not supported, class is the processor class of the
VG and corresponds to a type of single board computer or intelligent io device, phase is
phase of the rate group frames on the VG with respect to the frames of other VGs in the
system. It is specified in minor frame period units and the maximum phase is one major
frame, id defines the Network Element and port assignment of each member. It also con-
Page 5-6
rains each member's board id for physical identification of the corresponding processor.
The board id is determined during initialization based on the results of the port contention
process, mask and timeout specify the fault mask and communication timeout for the VG.
The information in the VG configuration table is sufficient to complete the first phases
of system initialization. These phases are processor initialization, Ada elaboration, port
contention, FCR initialization, and Network Element synchronization. Processor initializa-
tion is performed by each processor at power up to initialize the local hardware devices and
perform self tests. The test results are written tO an assigned area of mass memory. Pro-
cessors which deem themselves healthy perform Ada elaboration and begin execution of the
main task. The remaining processors attempt to remain passive throughout the remainder
of the mission. Within main each processor determines its processor class and its hosting
Network Element. It then evaluates the configuration table to determines the network ele-
ment ports assigned to processors of its class on this Network Element. All the processors
in each class then contend for the assigned ports in ascending order. Based on the acquired
port and the hosting Network Element, each process can determine its VG id from the con-
figuration table.
Port 0 on each Network Element is a specially designated port. The processors which
acquire these ports are responsible for testing and initializing the shared components of
their FCR and any dumb components within their FCR. These test results are also written
to mass memory. When the testing is complete, these processors direct the network ele-
ments to attempt initial synchronization. A Network Element can also be programmed to
spontaneously attempt synchronization. This is useful if the NE does not host any proces-
sors. The processors which did not acquire port 0 remain passive waiting for directives
from the system manager VG. When synchronization is successful, the Network Elements
start system time keeping and are capable of providing inter-VG communication.
5.2.2. Rate Group T_sk Configuration
The remainder of system initialization requires the use of the task configuration table.
The task configuration table maps tasks to VGs and specifies the task's rate group assign-
ment and message buffering requirements, Each task is assigned a communication id or
CID and has a corresponding entry in the task configuration table. The CID is used as the
task's logical address for intertask communication. An initial configuration must be speci-
fied for each task in the system and compiled into the OS. The structure of the task config-
uration table is shown in Figure 5-5.
Page 5-7
type location_type is (no_vg, one_vg, all_vg);
type rg_type is (RGlu_G2,RG3,RG4);
type precedence_type is range 0..15;
type task_configuration_record is
location • location_type;
vg_id " vg_id_type ;
rg "rg_type;
precedence • precedence_type;
task_id " task__id_type ;
max xmit size "natural;
max xmit num• natural;
max rcve size'natural;
max rcve num " natural;
end record;
type communication id__type is ( rg_dispatcher,
fdi,
system.f di,
m,,
appll
app 12);
task_configuration : array (communication id type) of
task_configuration_record;
Figure 5-5. Task Configuration Table
rg_dispatcher, fdi, system_fdi, appll, and appl2 are tasks in the system and each task
has an associated configuration record, location is used to define whether the associated
task is not executing, executing on one VG, or executing on all VGs. If the task is not exe-
cuting, then the remaining fields are invalid. If the task is executing on only one VG, then
vg_id defines which VG. rg defines in which rate group the task is executing, precedence
is the task's precedence among the tasks executing in the same rate group on the same vir-
tual group. It is used to determine their execution order within the rate group frame. The
highest precedence corresponds to 15. task_id is the VG's local identifier for the task.
Unlike the communication ids, the task ids are not guaranteed to be unique throughout the
system. The maximum number and maximum size of messages that the task will have
queued for transmission is indicated in max_xmitnum and max_xmit_size. Their rcve_
counterparts are used for messages that it will have queued waiting to be read.
Page 5-8
An initialization must be specified for each entry in the table within the enclosing pack-
age body. A partial initialization of the a_ye example is shown in Figure 5-6. In this ex-
ample the rate group dispatcher has been installed to execute on all VGs as an RG4 task
with the highest precedence and an application rate group tasks has been installed to execute
on VG 4 has an RG3 task with high precedence. It is important that the rate group dis-
patcher is the first element in the communication id type, executes as an RG4 task, and
has the highest precedence. This guarantees that it will execute at the beginning of each
minor frame and be able to dispatch the other rate group tasks.
task_configuration(rg_dispatcher) := (
location = > all_vg,
rg => RG4,
precedence := precedence_type'last,
task_id => rg_dispatcher_id,
max xmit size := 40,
max xmit num := 10,
m
max rcve size := 40,
m
max_rcve_num := 10);
task_configuration(appll ) (
location = > one_vg,
vg_id => 4,
rg = > RG3 ,
precedence 12,
task_id = > appll_id,
max xmit size := 50,
max-xmi["nUm := 5,
max rcve size := 10,
max rcve num := 2);
Figure 5-6. Task Configuration Table Initialization
The VG which is hosting system_fdi is designated the system manager and is respon-
sible for the directing the remaining phases of system initialization. These phases are sys-
tem manager alignment, resource evaluation and reconfiguration, VG initialization, and
system start. System manager alignment is performed by the system manager to align the
memory and devices of its members. The System manager then performs resource evalua-
tion by testing the Network Elements and _lling each VG in the system for its members'
test results and physical identifiers. The physical identifiers are recorded in the VG config-
uration table and the test results are analyzed to determine which VGs have faulty members.
Page 5-9
Reconfiguration is performed if there are unutilized simplex VGs which can be used to re-
place the faulty members. If there are sufficient resources after reconfiguration to meet the
mission requirements then the system manager begins VG initialization.
VG initialization consists of aligning the memory and devices of each VG and initializ-
ing their rate group tasking services, communication services, and time management ser-
vices. The system manager directs each VG to perform the above initialization and waits
for continuation that the initialization was completed. Rate group tasking initialization uses
the task configuration table to determine the locally executing tasks and installs them into
the local rate group tasking suite. The communication services initialization allocates the
packet queues required by the local tasks and enables message based communication. Time
management initialization starts rate group tasking with the phase specified in the VG con-
figuration table when the directive to start operational execution is received from the system
manager. This directive is a broadcast to all VGs and its timestamp defines the reference
time for the rate group phasing. System initialization is completed when each VG receives
the directives and begins rate group tasking.
5.3. Rate Group Tasking Services
Within a VG of the AFTA, multiple tasks require the use of the message passing re-
source. These ir,_.lude both application tasks and timer based preemptive Ada Run Time
System (RTS) services. In order to maintain congruent use of this resource across the
members of a VG, it is necessary to ensure there is no competition for its use. This is done
by limiting the preemption allowed in the system and by limiting the use of the message
passing resource. A rate group tasking paradigm was developed to fulfil these require-
merits for the AFTA. The paradigm consists of the AZFA rate group tasking services, the
AFTA time management services, and the AFTA communication services.
Within the rate group tasking services, tasks are assigned to execute as either RG1,
RG2, RG3, or RG4 tasks and a rate group dispatcher is provided to control their execu-
tion. The tasks in each rate group must be cyclic and execute one complete iteration within
their rate group frame. The time management services only allow task preemption at the
fastest rate group's frame boundaries. At these boundaries, the rate group dispatcher pre-
empts the executing task and starts the execution of tasks in faster rate groups. The com-
munication services are provided to prevent preemptible tasks from using the message
passing resource directly. Instead, their messages are buffered on queues controlled by the
rate group dispatcher. This removes the possibility of contention for the message passing
resource.
The rate group tasking initialization and associated interaction with the time manage-
ment and communication service initialization are described in Section 5.3.1. The rate
 age 5'10
group dispatcher is described in Section 5.3.2 and the structure of rate group tasks is de-
scribed in Section 5.3.3. The time management services will be described in Section 5.4
and the communication services will be de, bed in Section 5.5.
5.3.1. Rate Group Tasking lniti;_lization
Every task installed in the task configuration table must be present on each VG and will
start execution during elaboration. The tasks must suspend themselves until processor and
Network Element initialization are completed and the local VG id has been determined.
Based upon the VG id, a list of the tasks executing locally is created and these tasks are
scheduled for execution. The tasks not ex_uting locally will not be scheduled and will re-
main suspended throughout the mission. After the indicated tasks are scheduled, the com-
munication services and time management services are initialized. When the initialization is
complete, the rate group dispatcher will begin execution of the first rate group frame and
trigger the execution of the appropriate rategroup tasks.
The list of tasks executing locally is created from the task configuration table and is
maintained as a separate linked list of the tasks in each rate group. The head of each list is
stored in the rg_task_lists structure, rgtask_lists is used during initialization to set up the
scheduling parameters for the tasks and to allocate packet buffers for the locally executing
tasks. After initialization, it is used is at each rate group frame boundary by the rate group
dispatcher to check the overrun status of the tasks and by the communication services to
transmit their messages and to update their frame markers. The declaration of the rate
group task lists is shown in Figure 5-7.
type rg_task_record ;
type access_rg_task record is access rg_task_record;
type rg_task_record is record
task id " task_id_type;
cid -communication id type;
next task " access_rg_task_record;
end re_rd;
type rg_task_listarray is array (rg_type) of access_rg_task_record;
rg_task_lists • rg_task_listarray ;
Figure 5-7. Rate Group Task Lists
task id is the local run time system identifier for the task. It is used for scheduling and
checking the execution status of the tasks in each list. cid is the communication id for the
Page 5-11
taskandis usedto allocatethetask'spacketbuffersandmaintainits queues,nexttask is a
pointer to the next entry in the list. next_task is null in the last list entry of each rate group.
The init_rate_group_tasking procedure creates the rate group task lists and sets the
scheduling parameters for the rate group tasks. It is called by the main task after the local
VG id has been determined. The order in which tasks are placed in the lists determines
their execution order within the rate group frame. The precedence field in the task configu-
ration entry is used to determine this order. Tasks with higher precedence will execute be-
fore tasks with lower precedence. If tasks have equal precedence, then the task first de-
elated in the communication_id_type will execute first. The init_rate_group_tasking pro-
cedure declaration is shown in Figure 5-8.
procedure initrate_group_tasking ;
Figure 5-8. Initialize Rate Group Tasking Procedure
After init_rate__group_tasking creates the task lists it then initializes the scheduling pa-
rameters for each task in the lists. The rate group dispatcher will be the first task in the
RG4 list. Its execution priority is set to rg_dispatcher_.priority and it is set to start execu-
tion when the start_rg_tasking event is set and to thereafter resume execution after every
minor frame period interval. The execution priorities of the remaining tasks in the RG4 list
are set to rg4_.priority and they are set to resume execution whenever the rg4_event is set.
The priorities of the tasks in the RG3, RG2, and RG1 lists are set to rg3__priority,
rg2..priority, and rgl_priority respectively and they are likewise set to resume execution
whenever the rg3event, rg2event, and rgl_event is respectively set.
The task priority is used to provide preferred execution of the rate group dispatcher and
tasks in faster rate groups when tasks in multiple rate group are executable. The priority
declaration is shown in Figure 5-9. Lowest priority is 0 and highest priority is/5.
-Page 5-12
main_priority
rg_dispatcher_priority
rg4_priority
rg3..priority
rg2..priority
rg l_priority
• constant := 10
• constant := 9;
• constant := 7;
• constant := 6
• constant := 5;
: consiant := 4;
Figure 5-9. Task Priority
After the return from initrate__group_tasking, main calls initcommunication and
inittimekeeping, init_communication initializes the packet queues used by the communica-
tion services and is described in Section 5.5_2_ init_timekeeping sets up the local VG's rate
group frame phase and returns when the time management service has been started. It is
described in Section 5.4.1.
The start_rg_tasking event is set by main after the return from init_timekeeping. All the
rate group tasks have suspended themselves during elaboration and are waiting to be re-
sumed based upon their scheduling parameters set up in init._rate..group_tasking. When
startrg_tasking is set, the rate group dispatcher is placed on the RTS ready queue and will
begin execution when the higher priority main task completes. At the event, the rate group
dispatcher is also placed in the RTS delay queue and will be resumed by the RTS every mi-
nor frame period thereafter. After setting the event, main is suspended, the rate group dis-
patcher resumes execution, and rate group tasking has started.
5.3.2. Rate Group Dispatcher
The rate group dispatcher is a special RG4 task that is responsible for controlling the
execution of the rate group tasks and providing reliable communication between rate group
tasks throughout the system. It executes at the start of each minor frame and based upon
the minor frame index determines the corresponding rate group frame boundaries. It
checks that the tasks in these rate groups have completed an iteration of their execution cy-
cle and uses the communication services totransmit the messages queued by these tasks
and to update the set of messages available for their retrieval. It then sets the events to trig-
ger the next execution cycle of these tasks _d suspends itself. These rate group tasks and
any slower rate group tasks which have an execution cycle still in progress then resume ex-
ecution based upon their assigned rate group and precedence within the rate group. The
mapping of rate group frames to minor frames is shown in Figure 5-10.
Page 5-13
minor frame index:
io 11
%%%%%%%%
_ F r a m e %k,_5,%F r a me _'_3
_RG2 Frame _G2 Frame i
Figure 5-10. Mapping of RG Frames to Minor Frames
During elaboration the rate group dispatcher suspends itself and will not be resumed
until the start_rgtasking event is set. When the start_rgtasking event is set and the higher
priority main task suspends itself, the rate group dispatcher will begin execution of its first
cycle and will repeat its execution cycle every minor frame period thereafter. At the start of
each cycle, the rate group dispatcher records a congruent value of the current time. It then
determines the slowest rate group whose frame boundary corresponds to the start of this
minor frame. Because of the mapping of rate group frames to minor frames, all faster rate
groups will also be at a frame boundary and the identifier of the slowest rate group is used
to indicate the entire set of rate groups at a frame boundary.
The send_queue and update_frame_marker communication services are then called and
passed the identifier of the slowest rate group at the frame boundary, sendqueue transmits
all the messages enqueued by the tasks of the corresponding rate groups in their previous
frame, update_frame_marker updates the communication service pointers to provide a
congruent of set of received messages and free buffers to the rate group tasks throughout
their frame, send_queue and updateframemarker are described in Section 5.4
frame_start is then called and passed the slowest rate group identifier and the time
recorded by the rate group dispatcher at the start of this execution cycle. It uses the time
value to update the time latch for each of the corresponding rate groups. A time latch is
provided for each rate group and is used to latch the time of the start of the rate group
frame. It is the only value of time which is guaranteed to be congruent during the task's •
execution and the only value of time which should be used by rate group tasks, frame__start
then uses the rate group task lists to determine the tasks executing in the indicated rate
groups and checks the overrun condition of each of the tasks. If a task has overrun the
condition is logged in the rate group dispatcher log. The log can be examined from the
terminal display. It then sets the appropriate rg4_event, rg3event, rg2_event, or
Page 5-14
rgl_event to ready the tasks in the indicated rate groups for their next execution cycle. The
frame_start procedure declaration is shown in Figure 5-11.
procedure frame_start( ....
slowest_rg "in rg_type;
congruenttime "in time);
Figure 5-11. Frame Start Procedure
After theframestart procedure is exeeu!ed, the rate group dispatcher then increments
the minor frame index and suspends itself until the start of the next minor frame. This al-
lows the lower priority tasks which were previously executing or which were readied for
execution by setting the rate group events to=begin execution based upon their priority and
precedence.
The RG 1 frame boundary is a special condition. Because all rate groups have a frame
boundary at the RG1 frame boundary, this point defines where the memory used by the
rate group tasks will be congruent on all members of the VG and can be successfully
aligned. It is only at these points that the rate group manager will attempt to recover a failed
processor by invoking lostsoul. If lostsoul is required, then it will be called by the dis-
patcher after update_framemarker at the start of the first minor frame. If no channel is re-
covered in lost_soul then the remainder of the frame should proceed normally. If a channel
is recovered then some frames may be slipped because of the recovery process. A detailed
description of processor recovery is contained in Section 5.6.
5.3.3. Rate Group Tasks
Rate group tasks must be uniquely associated with a communication id and a corre-
sponding task configuration table entry as described in Section 5.2.2. The table entry
must be initialized to specify whether the task is executing on one VG or executing on all
VGs. System service tasks normally execute on all VGs. If a task executes on all VGs
then broadcast messages can be used to send a message to all instantiations of the task.
Otherwise the task instantiation must be identified by specifying the hosting VG id. If a
task will execute on only one VG then that VG must be specified in the table and the tasks
communication id is sufficient to uniquely identify the task. The task's rate group and
precedence within the rate group must also be specified. This determines how often the
task will execute and the order in which tasks in the same rate group and on the same VG
will execute. The local RTS identifier mus! also be specified to provide the link between
the logical communication identifier and the actual task.
Page 5-15
The maximum number and maximum size of messages that each task will queue for
transmission and that may be queued for its reception must be specified. These values are
used to allocate packet buffers for the task's messages. Each task has private and separate
outgoing and incoming message queues. A given task's queue operations (including over-
flows) have no effect on the state of other tasks' queues. If an executing task attempts to
enqueue a message to a full outgoing message queue, an error indication is immediately
returned to that task, with the outgoing queue and message to be enqueued being left un-
changed. In the event of incoming queue overflows, the AFTA operating system indicates
the number of incoming messages that have been discarded to the task which would have
received the messages had the incoming queue overflow not occurred. Tasks should be
designed to check this indication of discarded incoming messages and perform appropriate
application-specific recovery from this error condition. An example of such a recovery
policy would be to utilize stale input data instead of input data derived from the discarded
input message. Note that all members of a redundant VG have an identical view of both
outgoing and incoming message queue overflow conditions. In addition, tasks are never
presented with a message queue containing partial messages, the Ab'TA operating system
ensures that complete messages are delivered from one task to another in the absence of
queue overflows, or no message whatsoever is transmitted.
The task itself must have a well defined cyclic execution behavior. The task and all the
other tasks specified to execute on the VG must complete their execution cycle within their
rate group frame. If they do not, then the rate group dispatcher will detect an overrun
condition for those tasks which did not complete within their frame. These tasks are not
necessarily the ones that caused the overrun condition to occur. Rate group tasks may use
the queuemessage and retrieve_message to communicate between tasks. Both the recep-
tion and transmission of the communication is based on the rate group frame boundary as
described in Section 5.5. This must be accounted for in determining the communication
timing and the message allocation.
An example of a minim',d task is shown in Figure 5-12. Associated with the task decla-
ration is the task id declaration used in the task configuration table initialization. The task
defines my_cid to be its associated communication id. This is used in the communication
and rate group tasking procedures calls to identify the calling task. hum_deleted is used to
indicate how many of the task's messages were deleted in the previous rate group frame
because of insufficient free packet buffers. The allocation specified in the task configura-
tion table initialization should be used to ensure that message are not deleted, frame_time
maintains a congruent value of the time the current rate group frame was started. It should
be the only value of current time that is used by the task.
Page 5-16
task appll_task is
end task;
appll__id " task_id_type :- id(appl l_task' address) ;
task body appll_task is
my_cid • constant communication id type := appll ;
num deleted • natural := O;
frame_time • time := startup_ttme ;
begin
loop
wait_for_next_frame(my cid,num_deletedJrame_time);
end loop;
end task;
Figure 5-12. Example Rate Group Task
The task begins execution during elaboration and may perform data initialization. The
hosting VG id is not known at this time and the initialization will occur on all VGs even if
the task will not being executing or a particular VG. It must suspend itself using
wait_for_next_frame to end its task ela_tion. Based on the local VG id and the task
configuration table, the task instantiations are then selectively resumed when rate group
tasking begins. When the task is resumed itreturns from wait_for_next.frame with the
num_deleted and frame_time values updated: The task should then begin its cyclic execu-
tion. At the end of its cycle it must again call wait.for_next_frame to perform its self sus-
pension. The wait_for_next_frame procedure declaration is shown in Figure 5-13.
procedure wait_for_next_frame(
cid : communication_id_type;
deleted_messages : out natural;
frame_time : out time);
Figure 5-13. Wait for Next Frame Procedure
wait_for_next_frame uses cid to identify the rate group of the caller and access its
cid status record maintained by the communication services. When wait_for_next.frame
is executed it suspends the calling task. To resume execution, the rate group dispatcher
must set the appropriate rate group event. Prior to setting the event, the dispatcher can use
Page 5-17
the RTS to examine the execution status of the task. If it is not suspended then an overrun
condition exists. When the event is set the task resumes execution inside the
wait_for_next.frame procedure. It uses the congruent time latches maintained by the rate
group dispatcher to update frame_time and the communication service's cid_.status_record
to update deleted_messages, wait_fornext_frame then returns to the calling task to begin
the next execution cycle.
5.4. Time Management
The time management service is executed on each processor in the AFTA to maintain
congruent execution between the members of a VG in the presence of timer based events
and to provide a consistent system time for all the VGs. The system time is maintained by
the Network Element aggregate as the elapsed time since system start up. When a packet is
received by a Network Element the system time is written to the packet's corresponding de-
scriptor field. This time information is used by the time management service to define ab-
solute time. Each processor in the AFTA locally maintains a timer to measure the elapsed
time since the absolute time was updated. This timer is used to generate periodic interrupts
to define the start of each minor rate group frame and to trigger the update of the absolute
time. The interrupt causes preemption of the currently executing task by the rate group dis-
patcher. The execution state of the system must be well defined at these points to maintain
congruent execution. This is provided by the rate group tasking implementation.
The system time maintained by the Network Element and its timestamp of delivered
packets has been discussed previously. Section 5.4.1 will discuss the initialization of local
time management on each processor of the AFTA. Section 5.4.2 will discuss the timing
model used by the RTS and the operation of the time management service.
_4.1. Time Management Initialization
System time keeping is started by the Network Elements during Network Element ini-
tialization and is maintained by the Network Elements throughout system operation. The
time management service on each processor of the AFTA is closely coupled with their exe-
cution of the rate group tasking paradigm and must not be started until the VG is ready to
begin operational execution of the paradigm. Prior to that time each VG is waiting for ini-
tialization directives from the system manager VG. These messages do not go through the
communication services and are read synchronously by the main task. The timestamps as-
sociated with these message are used to update the local value of absolute time, but the local
timers are not active and no timer based interrupts are generated.
The VGs will receive the system manager directive to begin operational execution after
all other VG initialization has been completed. This message will be a broadcast to all VGs
and the timestamp of the message will used as the base time for starting rate group tasking
-Page 5-18
on all VGs with the phasing specified in the rate group phase table described in Section
5.3.1. When the message is received, each member of a VG sets its local timer to generate
an interrupt based on the VG's assigned phase. When the interrupt is generated, the first
frame of rate group tasking is started and subsequently an interrupt will be generated every
minor frame period. The time management service now begins operation and is responsi-
ble for coordinating the local timer with theN e_twork Element system time and maintaining
the phasing specified in the phase table. Its operation is described in the next section.
init_timekeeping is the procedure responsible for starting the time management service with
the appropriate phase. Its declaration is shown in Figure 5-14.
procedure inittimekeeping(phase " in time); I
Figure 5-14. Initialize Time Keeping Procedure
init_timekeeping is called from the main_task after all other VG initialization has been
completed and the directive from the system manager to begin operational execution has
been received. The timestamp of this message is the reference time for the phasing of rate
group frames across all VGs. The reading of the message has also updated the local value
of absolute time to this value, init_timekeep!ng is called in response to the directive and is
passed the delay corresponding to the phase specified for the VG in the rate group phase
table. It sets the local timer to generate an interrupt when the delay has expired and enables
generation of the receive message interrupt from the Network Element. Messages can now
be read asynchronously and will be processed by the communication services.
inittimekeeping then suspends itself until after the interrupt. When the interrupt is gener-
ated the time management procedures descried in the next section begin execution.
Normally the rate group dispatcher would be resumed immediately after the interrupt.
At the first timer interrupt only, the main task is still executable and will be resumed after
the interrupt, inittimekeeping will return to main and it will set the startrg_tasking event
to start execution of the rate group dispatcher, main will then terminate its execution and
allow the lower priority rate group dispatcher to execute. It is necessary to coordinate the
start of time management and the execution of the rate group dispatcher to ensure that the
dispatcher and the rate group tasks have a fu_!! minor frame period to execute. Otherwise,
our task completion guarantees are not valid _nd a frame overrun condition may result.
5.4.2. Time Management Operation
After the time management service has been started, a chiming model is used to main-
tain the local value of system time on each processor of the AFTA. The local timer is used
Page 5-19
to generatethechimeeveryminorframeperiodoneachmemberof a VG. When the chime
is generated, the VG members resynchronize and congruently update the local value of
system time with the system time maintained by the Network Element. The updated time
may not agree with the expected time of the chime and this difference is used to adjust the
next chime interval and maintain a constant frame phasing among the VGs in the system.
The local value of system time is only updated at the chime interrupt and is the only time
value provided to the remainder of the RTS.
Time management is provided by the chime interrupt handler. It will execute at each
minor frame boundary and the rate group tasking paradigm must guarantee that no packet
transmissions are in progress on any of the VG members when the chime is generated.
Packet receptions may be in progress because of the asynchronous packet reception pro-
vided by the communication service. The packet reception interrupt has higher priority than
the chime interrupt and will execute to completion prior to the chime interrupt being ser-
viced. When the chime interrupt handler is executed, it disables the packet reception inter-
rupt and sends itself a synchronization packet to flush all received packets from its Network
Element buffers. It reads and handles packets in the same manner as the packet reception
handler until the synchronization packet is found. It then sends itself an additional syn-
chronization packet through the flushed network and uses the timestamp of this packet to
update the local copy of system time.
The chime interrupt handler then determines the expected time of the next chime inter-
rupt by adding the minor frame period interval to the expected time of the current chime. It
records this value and sets the next chime to generated after an interval corresponding to the
difference between this value and the system time last read from the Network Element.
This will maintain the rate group phase relationship between the VGs. The chime interrupt
handler then enables the Network Element packet reception interrupt and returns to the
RTS. The RTS reevaluates the scheduable tasks based on the updated time. The rate
group dispatcher will now be placed on the ready queue and resume operation because it is
the highest priority ready task.
5.5. Communication Services
The communication services are used to communicate between rate group tasks. Each
rate group task has a global communication id which can be used as its logical address.
Other tasks in the system can send messages to this address and the communication ser-
vices will map the logical address to the VG executing the task. The communication is in
the form of messages enqueued by the sender for transmission at the start of the next rate
group frame and dequeued for reading by the recipient task within the next rate group frame
after it is received. Messages are delivered in the same order at all common destinations
and are delivered in the order in which they were sent.
Page 5-20
The communication services are composed of message based interface procedures used
by rate group tasks and lower level primitives used by the rate group dispatcher. The
communication service primitives manipulate the messages as a sets of queued packets.
The message and corresponding packet structures are described in Section 5.5.1. The
communication service initialization and associated control structures are described in Sec-
tion 5.5.2. The message transmission procedures are described in Section 5.5.3 and the
message reception procedures are described in Section 5.5.4.
5.5. I. Message and Packet Structure
The rate group tasks have a message based interface to the communication services.
The message itself is a contiguous block of data that is transferred from the sender to the
receiver. The block must be no larger than the maximum message size defined for the sys-
tem. Associated with the message are descriptor fields describing the sender, receiver, type
of message, and how the message is to be exchanged. The message and message descrip-
tor fields are supplied to the communicati0nservices by the task wishing to send a mes-
sage. The communication services then perform the exchange and deliver the message and
descriptor information to the receiving task when it requests delivery of its messages.
Internally, the communication services store and manipulate the message as a set of
fixed size packets. A packet is the exchange unit used by the Network Elements. The mes-
sage descriptor fields are mapped to packet descriptor fields and a message header. The
packet descriptors are sent with each packet and the message header is prepended to the
message data and sent only in the first packet of the message. The message and packet
structure for task-to-task communication is shown in Figure 5-15.
The message descriptors consist of the destination VG id, the destination communica-
tion id, the source VG id, the source communication id, the message class, and the size of
the message data. The destination VG idis Supplied by the sending task and used in the
packet exchange, but is not delivered to the receiving task. The source VG id is not sup-
plied by the sending task, but is provided by the Network Element with each delivered
packet. It is provided as a message descriptor to the receiving task.
The VG id is used to specify a virtual group and the communication id is used to spec-
ify a task executing on the VG. The message size specifies the size of the message data in
bytes. The message class is used to specify whether the message should be broadcast to all
VGs, whether the message is task-to-task data or task-to-ne data, and whether the message
should be a voted or a single source exchange.
Broadcasts messages are only useful when the destination task has an instantiation on
all VGs. In addition, they monopolize bandwidth and can cause flow control problems.
For this reason it is the intent to limit the use of broadcast messages to system service
Page 5-21
tasks. Task-to-ne data is send by system services to update the NE configuration table,
generate voted resets, and perform initial synchronization. All other tasks must send only
task-to-task data.
The packet descriptors consist of the destination VG id, the destination communication
id, the source VG id, the message class, and a boolean indicating whether this is the last
packet of the message. All except the last packet boolean and the source VG id are copied
directly from the message descriptor. The packet descriptors are included with every packet
and are always voted by the Network Elements. This is true even for single source ex-
changes where the corresponding packet data is not voted, but congruently replicated from
the single source. This guarantees that a single member of a redundant VG cannot cause
this information to be corrupted on the delivered packet.
Page 5-22
Message from
Sending Task
vgid of destination
dd of source
ctd of destination
dass of message
sizeof message
Queued Packets
Exchanged by NE
classof message
vgid of destination/source
cid of destination
last packet of message
dd of source
size of data tn last
first block of
message data
classof message|
vgid of destination BI
cid of destination
.......,as, , ofmes.ge[]
nox,i o77
°-" I
Message to
Receiving Task
vgid of source
c_dof source
cid of destination
classof message
sizeof message
message data
Figure 5-15. Task-to-Task Message and Packet Formats
The source VG id and the last packet boolean are used by the receiving VG's communi-
cation services to link the packets of a message together. Within the AFTA, the communi-
cation services of each VG will send all the packets of a message before it starts sending the
next message. Therefore, all the packets from a VG will belong to the same message until
the last packet boolean is true. Then the next packet will be the start of a new message.
The destination cid is used by the receiving communication services to determine the correct
queue for the packet. The source VG id is supplied by the NE witheach delivered packet
and is copied to the receive message descriptor.
Page 5-23
In additionto thepacketdescriptors described above, a packet syndrome and packet
dmestamp are also provided by the NE for each delivered packet. These fields are main-
tained in the queue of received packets and are used by system services, but they are not
propagated to the receiving task.
The message header is only prepended to the message data for task-to-task data and is
sent in the first packet of the message. It is not included when task-to-he data is to be sent.
The message header consists of the source communication id and the size of the data in the
last packet. The source communication id is used only as information to the receiving task.
The size of the data in the last packet is used with the number of packets in the message to
determine the message size. This is copied to the receive message descriptor.
The header information is included in the packet data and will not be voted during sin-
gle source exchanges. A single member of a VG may therefore cause this information to be
faulty in the delivered message. Corrupted data will not cause loss of system service, but it
may confuse the receiving task. Tasks should be written to tolerate this condition if they
expect to receive single source messages.
5.5.2. Communiea690 Services Initialization
A transmit packet queue and a receive packet queue are maintained for each cid. They
are the buffers between the underlying packet based communication primitives which di-
rectly access the Network Elements and the message based communication services which
are used by the rate group tasks. The transmit queues are used to guarantee that the packets
written to the NEs by the members of a VG have a consistent ordering. The receive queues
are used to guarantee that rate group tasks see a consistent set of available messages. Both
these conditions are necessary to guarantee that the members of a VG do not diverge.
Each queue is portioned into a set of active packets followed by a set of free packets.
The active transmit packets contain data waiting to be written to the NE. The active receive
packets contain data waiting to be read by a task. During initialization all the packets allo-
cated for a task are placed in the free portion of the respective transmit or receive queue.
The allocation is based on the values specified for the corresponding task in the task com-
munication table described in Section 5.2. The queues are maintained as linked lists with
pointers to the entry at the head of the active portion, to the entry at the head of the free
portion, and to the entry at the tail of the free portion.
In the transmit queues, entries are moved from the free portion to the active portion
when a message is enqueued by a task. This transition is performed by using the entries at
the free head to store the packetized message and then making the free head point to the
next free entry in the queue. Entries are removed from the active portion and replaced in
the free portion when the stored packets are written to the Network Element. Only the en-
_Page'5-24
try at the active head is written. After it is written, it is removed from the active head and
replaced at the free tail and the active head _d free tail are updated. The transmit queues
are maintained as singly linked lists. An example of the transmit and receive queues is
shown in Figure 5-16.
active head m w active head
active frame
free head J
f
I !
f
free tail [ I
Transmit Queue
free head [ J
f
I I
free frame I ]
t
"ee'a"l I
Receive Queue
Figure 5-16. Transmit and Receive Queues
In the receive queues, entries are moved from the free portion to the active portion
when a packet is read from the Network Element and the transition is the same as in the
transmit queues. Entries are removed from the active portion and replaced in the free por-
tion when a task retrieves a message. Because the packets of messages may be interleaved,
entries may be removed from anywhere Within the active portion. The entries in this por-
tion of the queue are doubly linked so the linked list can be maintained when entries other
than those at the active head must be removed. The removed entries zre replaced at the free
tail and the active head and free tail are updated as necessary.
Maintaining separate queues for each cid guarantees congruent ordering of input and
output messages for the corresponding task, but it does not guarantee the timing of the
packet events within the rate group frame with respect to other members of the VG. The
..... Page 5-25
only guarantee is that at the rate group frame boundary an identical set of events will have
occurred on all members of that rate group. Controlled access to these queues with in the
rate group frame is necessary to prevent divergence of the VG members. Access to the
transmit queues is controlled within the communication service primitives by only writing
packets to the Network Element at the rate group frame boundary and then writing all en-
queued packets. This guarantees that all the queue entries will be in the free portion of the
queue at the start of the frame. The completion of the task within its rate group frame guar-
antees that the each member of a VG will have a congruent set of packets in the active por-
tion of its transmit queue when the packets are sent at the end of a frame.
Unlike the transmit queues, the receive queues may be updated by the communication
primitives throughout a rate group frame. This is done whenever the packet reception inter-
rupt is generated by the Network Element. In order to maintain congruent operation, a
frame marker is provided for the active portion of each receive queue to indicate the packets
which were read from the NE prior to the start of the current frame. The packets between
the active head and the active frame marker are guaranteed to be present on all members of
the VG and are made available for reading by the task if they compose a complete message.
The free portion of each receive queue must also have a frame marker. This is necessary to
ensure that a consistent set of free entries is available within the rate group frame for new
packets read during the frame's execution. Otherwise, some members of a VG may have
no free entries for a packet to a given task, while others do have a free entry and continue
normal execution. The entries between the free head and the free frame marker are guaran-
teed to be free on all members of the VG and are usable to store received packets. The
frame markers for a given queue are only updated at the rate group frame boundary for the
corresponding task.
Page 5-26
type xmit pktrecord;
type access_xmit_pktrecord is access xmit_.pktrecord;
type xmit..pktrecord is record
nextin_queue • access_xmit_pkt_record;
message_class •message__class_type;
to_vg " vg_id_type ;
to cid: communication_id_type;
laSt_packet • boolean;
packet "message_data_record;
end record;
type rcve..pktrecord;
type access_rcve pktrecord is access rcve_pkt_record;
type rcve_.pktrecord is record
nextin_queue "access_rcve_packetrecord ;
previous_.in_queue " access rcve_.packetrecord;
first_in_message " access_rcve_acketrecord;
nextin_message • access_rcve_packetrecord;
message_class "message_class_type;
from_vg " vg_id_type ;
to cid: communication_id_type;
lab..packet "boolean;
syndrome "syndrome_type;
timestamp • timestamp type ;
packet "message_data_record;
end record;
Figure 5-17. Transmit and Receive Packet Queue Entries
The transmit and receive queue entry declaration is shown in Figure 5-17. The transmit
and receive queues use next in queue as their forward link to the next entry in the list and
if the entry is in the free portion this is the entry's only valid field. Entries in the active
portion of the receive queue use previous in queue as their backward link to remove en-
tries from the middle of the queue and use first in message and nextin_message to re-
construct packetized messages, first in message is null except in the last packet of the cor-
responding message. There it is set to point to the first packet of the message. This infor-
mation is used to search the queue for the first completed message, next in message is
used to reconstruct the message once a completed message is found. The previ-
ous in queue,firstin_message, and next_in_message fields are not necessary for the
Page 5--_
transmit queues because the messages in the transmit queues are stored in contiguous
packets.
The remaining fields are partof the packet descriptor and the packet data as described in
the previous section. The message class and message data declarations are shown in Fig-
ure 5-18. message_data_type indicates whether the data is being sent task-to-
task(task data) or task-to-ne(ct_update, voted_reset, or init_sync), mes-
sage_exchange_type indicates whether the data is to have a voted exchange or a single
source exchange, message_data_record is a discriminant record that can contain the mes-
sage header for the first packet in a task-to-task message or contain only message data for
subsequent packets in the message or for task-to-he messages.
type message_data_type is (task__data,ctupdate,voted_reset, init_sync);
type message_exchange_type is (sync, vote,A,B,C,D,E);
type message_class__type is record
broadcast : boolean;
data_type : message_data_type;
exchange_type : message_exchange_type;
end recora;
when
end case;
end record;
type message_datarecord(header: boolean) is record
case header is
when TRUE = >
from_cid • communication id type;
last_.packetsize "natural;
data "array (3..packet_size) of unsignedbyte;
FALSE =>
data • array (1.. packetsize) of unsigned_byte;
Figure 5-18. Message Class and Message Data Structure
The pointers used to access each queue are maintained in a queue table. The table is
referenced by cid and is initialized by the in#communication procedure. The table decla-
ration is shown in Figure 5-19.
Page 5-28
type cid_queue_record is record
xmit active head : access_xmit._pkt__record;
xmiCfree._head " accessxmit..pktrecord;
xmit_f'ree_tail : accessxmit_.plct_record;
rcve active head'access_rcve_.pkt_record;
rcve-active--frame : access_rcve pktrecord;
rcve_free_head " access_rcve_pkt_record;
rcve_free_frame " access_rcve_pktrecord;
rcve_freetail " access_rcve_pkt record;
end record; .........
type cid_queue_array is array (communication_id_type) of
cidqueue_record;
cid_queuetable " cid_queue_array ;
Figure 5-19. CID Queue Table
init_communication uses the task configuration table and rate group task lists described
in Section 5.3 to allocate packet entries for the tasks which will be executed locally. For
each of these tasks, entries are allocated from memory based upon their transmit and re-
ceive memory requirements specified in the task configuration table. These entries are
placed in their respective transmit and receive queues and the remaining queue pointers are
initialized. A warning will be generated if there is insufficient memory to allocate the re-
quired number of packets, initrate__grouptasking must have been called previously to
initialize the rate group task lists. After iniicommunication is called, the communication
services are ready to begin operation, but the packet reception interrupt from the Network
Element has not yet been enabled. The inte_pt is enabled when the directive to begin rate
group tasking is received from the system manager VG and inittimekeeping is called.
When the interrupt is disabled the communication services can be bypassed and this is the
operational mode during system initialization. The ink_communication declaration is
shown below.
procedure init_communication;
Figure 5-20. Initi',dize Communication Procedure
at
Page 5'-29
5.5.3. Message Transmission
Procedures are provided to immediately transmit a message or to enqueue the message
for transmission at the end of a rate group frame. Immediate transmission may only be per-
formed by the rate group dispatcher or RG4 tasks and is done using sendmessage. En-
queued message transmission must be done by RG3, RG2, and RG1 tasks. It may also be
done by RG4 tasks and the rate group dispatcher. The message is enqueued using
queue_message and is transmitted by the rate group dispatcher using send_queue.
send_message bypasses the transmit queue and directly accesses the Network Element
to transmit a message. It must not be preempted, otherwise members of the VG may write
different data to the NE and diverge. For this reason only the rate group dispatcher and
RG4 tasks are allowed to use send_message. These tasks are guaranteed to complete their
iteration every minor frame and will therefore not have pending calls of send_message
when the frame expires, send_message should be used only if it is absolutely necessary
and under well defined operating conditions. It is especially dangerous if hardware flow
control gets asserted because the message transmission (and hence the transmitting task)
will be stalled in a busy wait until the flow control condition is cleared. Stalling a nonpre-
emptive RG4 task for an excessive amount of time could result in a cascade of frame over-
runs. The send_message declaration is shown in Figure 5-21.
type send_.error_flag_type is (no_errors, illegal_message,
illegal_.destination, inactive,destination);
procedure send_message is (
source cid • in communication_id_type;
destination cid • in communication id type;
destination_vg "in vg_id_type ;
message_class "in message_class_type;
messageaddress : in address;
messagesize • in natural;
error_flag • out send_error_flag_type);
Figure 5-21. Send Message Procedure
source cid is the communication id of the caller. It is included for possible use by the
receiving task. destination_cid is the communication id of recipient. It used to access the
task configuration table and determine where the destination task is executing. If the task is
executing on all VGs then destination_vg is used to determine which instantiation of the
task should be sent the message, message_class is used to determine the type of message
Page 5-30
which is to be sent. message_address is the starting address of the data to be sent. mes-
sage_size is the size of the data in bytes. The procedure returns the completion status in er-
ror.flag, no_errors indicates the operationlwas performed, illegal_message indicates the
data size was not acceptable or the addres_ could not be accessed, iUegal_destination in-
dicates an illegal combination of destination eid, destination VG, and/or message class, in-
active destination indicates the destination cid was not active.
queue_message is used by rate group tasks to queue messages for transmission by the
rate group dispatcher at the end of their fHram_e.When it is called by a task the source cid is
examined to determine where to queue the message. The message size is then examined to
determine if there are enough free transmit packet buffers to enqueue the message. If there
are, then the packet descriptors and a message header are constructed and the message is
parsed and written into the free packet buffers. These packets are then removed from the
free list and placed on the task's transmit active queue. If there are insufficient buffers then
a failure condition is returned. The queue_message declaration is shown in Figure 5-22.
type queue_error_flag_type is (noerrors, illegal_message,
illegaldestination, inactive_destination, insufficient free_entries);
procedure queue_message is (
source cid "in communicatio__id..type;
destin_ion_cid • in communication id_type;
destination_vg • in vg_id_type;
message_class ' in message_class_type;
message_address "in address;
message_size "in natural ......•
error_flag • out queue_error_flag_type);
Figure 5-22. Queue Message Procedure
The parameters of the queue_message are the same as send_message, source_cid is
now also used to determine in which queue the message belongs. An insuffi-
cient_free_entries flag is provided to indicate there are not enough free packet buffers to
enqueue the message.
At the end of each frame, the rate group dispatcher determines the corresponding rate
group frame boundaries as described in Section 5.3.2. The dispatcher calls send_queue
with the slowest rate group at a frame boundary as its parameter. Because of the mapping
of rate group frames to minor frames, all faster rate groups are also at their frame bound-
ary. send_queue examines the cid_queue_record for all the tasks in the indicated rate
Page 5- 3q--
groups and transmits all their enqueued messages. Any incomplete messages in the queue
are flushed. This indicates a task overrun or other error condition and is logged. The
send_queue declaration is shown in Figure 5-23.
procedure sendqueue is (slowest_rg : in rg_type);
i i _R i i i ,i
Figure 5-23. Send Queue Procedure
5.5.4. M_ssage Reception
Messages may be read by the receiving tasks using read_message or retrieve_message.
The rate group dispatcher and RG4 tasks may use read_message. It is a blocking call that
returns the next message for the caller. The message may already be in the receive queue or
it may require waiting for additional packets to read from the NE. RG3, RG2, and RG1
tasks must use retrieve_message. It returns the next message for the task if it exists in the
receive active queue between its head and the frame marker. Otherwise, it returns a failure
status, retrieve_message is also useable by RG4 tasks and the rate group dispatcher, up-
date.frame_marker is a specialized procedure provided to the rate group dispatcher to up-
date the receive queue frame markers for the tasks in the indicated rate groups.
Packets are asynchronously read from the Network Element throughout the VGs execu-
tion in response to an NE generated packet ready interrupt. As each packet is received, the
communication services read the associatedfrom_vg descriptor and examine the associated
message pending table entry. The message pending table indicates whether there is a mes-
sage in progress from the sending VG. If there is, then this packet must belong to that
message and it is linked to the previous packets of the message. Otherwise it is the fin'st
packet of a new message and tile to_cid descriptor is examined to place it on the appropriate
queue. The message pending table declaration is shown below.
Page 5-32
type message..pending_record is record
first..packet : access packetrecord;
previous_packet : access_packetrecord;
message_deleted : boolean;
end record;
message..pending_table : array (vg_id_type) of
message_pendingrecord;
Figure 5-24. Message Pending Table
If a message is in progress from the sending VG then first..packet points to the first
packet of the message. If the last_packet descriptor field indicates this is the last packet of
the message, then thefirstin_message field for this packet is set tofirst_.packet. This is
used to search the queue for completed messages, previous_.packet points to the previous
packet of the message and is used to link this packet with the previous packet of the mes-
sage. message_deleted is a boolean indicating whether the pending message was flushed
because of insufficient free packet buffers, If message__deleted is true then all subsequent
packets of the message are also flushed. .....
If there are no available buffers to store a received packet then that packet and all the
other packets of the message are flushed. The available buffers are those in the recipient
task's receive free queue between its head and frame marker. Limiting the task's message
storage to its allocated set of buffers controI _ propagation of the task's fl0w control prob 7
lem. Limiting the use of free entries to thosebetween the head and frame marker guarantee
congruent behavior between the members of the VG.
When a packet must be flushed,first_packet in the message pending table is used to
determine if previous packets of the message have been read. If they have, then the asso-
ciated queue entries are removed from the active portion of the queue and placed at the head
of the free portion. This violates the queue paradigm described previously, but minimizes
the number of deleted message by allowing reuse of the buffers during the current frame.
Otherwise the buffers would placed at the tail of the free queue and would not be useable
until the next frame update.
A counter of the messages deleted in the current frame is then incremented. If this is
not the last packet of the message, then messagedeleted is set to true. All subsequently
read packets of the message are deleted and when the last packet of the message is read the
message pending table is readied for the St_t of the next message. The messages deleted
counter is maintained in the cid status table sl_own in Figure 5-25.
Page 5-33
type cid status record is record
delete-d in Fast'natural;
deleted in current •natural;
end recorc_ -
type cid__status_arrray is array (communication_id_type) of
cid_status_record ;
cid_status_table "cid_status_array ;
Figure 5-25. CID Status Table
The information made available to the tasks about the number of messages deleted must
also be based on their rate group frame to guarantee congruent operation. When a message
is deleted the deleted in current field is incremented. At the next frame boundary, this in-
formation is transferred to deleted in last and made available to the task whose messages
were deleted.
The frame markers make the asynchronous reception transparent to the rate group tasks
except for loss of execution time in the frame and a possible increase in execution skew.
The interrupt is disabled when the synchronization packet used by update_frame_marker is
read. This is necessary to guarantee a consistent state when update_frame_marker updates
the control structures. After the control structures are updated the interrupt is enabled and
the corresponding interrupt handler is again used to read packets during the remainder of
the frame.
readmessage returns the next available message to the calling task. It must be called
only by the rate group dispatcher or RG4 tasks. It first looks for a completed message to
the calling task starting from the head of its receive active queue. If a completed message is
not found between the queue head and the queue frame marker, then read_message may
use packets beyond the frame marker. If a completed message is not found in the queue
then it waits for packets to be added to the queue by the asynchronous receive packet inter-
rupt handler until a completed message is found. If the last packet of the completed mes-
sage is past the frame marker, then the frame marker for that queue is updated to the last
packet. This is allowed because all the members of the VG are guaranteed to read the same
set of packets prior to reading the last packet. Because of the asynchronous packet recep-
tion, no conclusion can be determined about packets after the last packet of the message or
packets in other queues. The asynchronous reading of packets also disallows updating the
task's receive queue frame marker, readmessage may take an indefinite amount of time
and should only be used if absolutely necessary and under well defined operating condi-
tions. The declaration of the read_message procedure is shown in Figure 5-26.
Page 5-34
• i i ,m, i
L
type read..error_flag is (no_errors,buffer_too_small);
procedure read_message is (
source._cid •out communication id_type;
source_vg • out vg_id_type;
destination cid "communication id_type;
message._cfass •out message__class_type ;
message..address : address;
message_size • in out natural
error_flag • out read_error_flag,type);
Figure 5-26. Read Message Procedure
read_message is passed the destination cid, message_address, and message_size.
destination_cid specifies which receive queue to examine, message_address and mes-
sage_size describe the buffer in which the task wants its message to be copied. If the
buffer is not large enough for the message then buffertoo_small is returned in error.flag
and the actual message size is returned in message_size, source_cid, source vg, mes-
sageclass, and message_size are copied from the received message descriptors.
update_frame_marker is called by the rate group dispatcher at a rate group frame
boundary. It is used to congruently update the set of packets useable by the tasks in that
rate group within their next frame. Whet!__iS called it sets the receive free queue frame
marker to the receive free queue tail for all the tasks in the rate group. Because the tasks in
this rate group have completed their execution at the frame boundary this provides the same
set of free buffers for use when reading packets within the frame. It then sends itself a
class0 synchronization packet and waits Until this packet is written to its receive queue.
When the synchronization packet is read by the receive interrupt handler, the handler dis-
ables the packet reception interrupt. This ensures that the same set of packets will have
been read on all members of the VG whenupdate_frame_marker resumes execution and
sees the synchronization packet in its receive queue, update_frame_marker then sets the re-
ceive active queue frame marker to the receive active queue tail for all the tasks in the rate
group and the messages deleted informati0n in the cid_status..reCord for the corresponding
tasks is also updated. The deleted_incurr_nt field of the cidstatus_record for the corre-
sponding tasks is copied to deleted in last and deleted in current is reset to zero.
deleted in_last is then propagated to each task as a return value from wait_for next_frame
when it resumes execution. When the da_fflstructures have been congruently updated up-
date_frame_marker re-enables the packet receive interrupt. The update_frame_marker pro-
cedure declaration is shown in Figure5-27.
procedure update_framemarker is (slowestrg : rg_type);
Figure 5-27. Update Frame Marker Procedure
slowest_rg is the rate group frame boundary corresponding to this call of up-
date_frame__marker. Because of the dispatching cycle used by the rate group dispatcher,
the frame boundary for any given rate group will also be a boundary for all faster rate
groups. This property is used by the dispatcher and update_frame_marker to remove the
need for multiple calls of update_flame_marker at a given boundary. Instead the dispatcher
calls update_frame_marker with an rg of the slowest rate group at the boundary and up-
date_frame_marker uses one synchronization packet to update the frame markers of the
tasks in that rate group and all faster rate groups.
retrieve_message returns the next available message to the calling task which has been
read prior to the last frame marker. It can be called by the rate group dispatcher or by any
of the rate group tasks, retrieve_message looks for a completed message to the indicated
cid from the head of its receive active queue to its frame marker. If the message is found it
is unpacketized and reconstructed at the message address specified in the retrieve_message
call. The message descriptor fields are then updated and the freed buffers are placed at the
tail of the receive free queue. Otherwise, an error condition is returned. The declaration of
the retrieve_message procedure is shown in Figure 5-28.
type retrieve_error_flag is (no_errors,buffer_too_small, no_message);
procedure read_message is (
source_cid • out communication_id__type ;
source_vg • out vg_id_type;
destination cid • in communication_id_type;
message_class • out message_class_type;
message_address • in address;
message_size • in out natural
error flag "out retrieve_error_flag_type);
Figure 5-28. Retrieve Message Procedure
The retrievemessage parameters ,are the same as read_message except for the addition
of a no_message error flag. This is used to indicate no completed message was found in
the queue.
-Page 5-36
5.6. Fault Detection, Identification and Recovery
The AFTA uses hardware redundancy with fault detection and masking capabilities to
provide fault tolerance. This inherent fauJt deiection capability is supplemented with tradi-
tional self test methods to increase AFTA, s coverage of faults.
The fault tolerance provided by the hardware is enhanced by the Fault Detection,
Identification and Recovery (FDIR) functions which are part of the AFTA operating sys-
tem. While the hardware alone in the AFTA could sustain one fault, the FDIR software
allows it to sustain multiple successive faults by identifying a faulty component and mask-
ing it from system operations. Consequentiy,:ihe primary purpose of FDIR is to maintain
correct operation in the presence of hardware faults. To achieve this, FDIR has four main
functions:
• testing of AFTA components, i.e., initiating various test procedures in order to
uncover hardware failures.
• identifying a failed component, i.e,, detecting a fault, isolating it to a single com-
ponent and disabling the faulty component.
• performing a remedial operation, i.e., initiating a recovery operation commensu-
rate with system requirements.
• performing transient fault analysis, i.e., determining whether the error was due to
a transient fault.
5.6.1. System and Test Modes
Each of the 4 primary functions ofFDIRhas various alternatives which arise because
the system operating conditions vary. Since the FDIR functions must be commensurate
with these conditions, numerous options _e posed to match these requirements. Conse-
quently, much of the subsequent discussion on FDIR functions will occur within the
framework of system modes and test modes2
FDIR functions occur at all stages of the AFTA's operations. As the computing system
proceeds through the various system modes from an initial power-on state through a
standby mode to a fully operational mode, the testing methodology also evolves through
various modes of testing commensurate with the operational constraints. During each test-
ing mode, suites of tests are activated to exercise the AFTA components both individually
and systematically as comprehensively as Possible. Specifically, there are three ,test modes
(initial built-in test (I-BIT), maintenance built-in test (M-BIT), and continuous built-in test
(C-BIT)) and three system modes (power-on, standby, operational or mission critical).
Figure 5-29 depicts the interaction of the s_fern modes and the test modes.
Page 5_-'._'Y"
17-
power on ]
opc ax\ I /
initiat__
mission
completed
Figure 5-29. System mode and test mode interactions
The I-BIT and the M-BIT are automated sequences of tests which are executed to test
the functionality of the AFTA components. In the I-BIT mode a basic set of tests is exe-
cuted where the primary goal is to ensure the correct operation of all components which are
configured into :he operational system. Although the M-BIT mode is somewhat similar,
philosophically the intent is to extensively test for line-maintenance reasons. The I-BIT is
initiated automatically at power-on whereas the M-BIT is commanded by an operator. Be-
cause the power on sequence is constrained by time the set of tests comprising the I-BIT
suite is a subset of the M-BIT test suite. The C-BIT tests are a set of low-overhead tests
which execute during mission critical operations to identify and disable faulty components
and to uncover latent faults.
5_6,2, Off-Line FavlL Detection. Isolation and Recovery
In actuality, the functions encompassing fault detection, isolation and recovery are di-
vided into 3 groups - those diagnostic functions performed by the individual components
(Off-Line FDIR), those functions performed by a single virtual group in monitoring itself
(Local FDIR) and those functions of the system manager which monitor the system com-
ponents globally (System FDIR).
After a system reset occurs or power is applied to the AFTA components, all compo-
nents operate individually rather than systematically as a fault tolerant computer. During
this phase of operation the Off-Line FDIR exercises a sequence of diagnostic tests of the
individu',d components to determine which components shall be incorporated into the initial
configuration of the AFTA.
Page 5-38
5.6,3, ,, Local F_01t Detection. Isolation and Recovery_
After the Network Elements have synchronized with each other the AFTA operates as a
fault tolerant system which provides fault tolerant communication mechanisms to process-
ing entities referred to as virtual groups. Using these communication mechanisms each
virtual group will exercise some level of fault detection and identification (FD1) capabilities
for identification of failures among its processors. Simplex virtual groups may perform
only processor self testing. Fault masking groups which are virtual groups consisting of 3
or 4 members can not only perform various ievels of testing (unlike simplexes) but can also
unequivocally diagnose a failure in a constituent processor. The fault masking virtual
group maintains correct operation even when one of its members has failed. Furthermore,
it may initiate certain recovery options.
5.6.4. System Fault Detection. Isolation and Recovery_
The local FDIR function executing in a virtual group monitors itself and performs some
recovery operation which directly affects itself. However, in order to monitor the AFTA
system globally and also to determine the health of shared components such as the network
elements, a system FDIR is necessary. The system FDIR executes on a single fault mask-
ing group and is responsible for high level testing of the AFTA such as a poll of all virtual
groups within the system. This is particularly important when a simplex virtual group ex-
hibits faulty behavior. Since a simplex cannot mask itself out of the system configuration
via configuration table updates, the system FDIR assumes this responsibility. In addition,
some recovery options require global information regarding system resources; this infor-
marion is unavailable to the local FDIR functions.
The system FDIR function is only one of many system-wide functions of the system
manager.
5.6.5. Operational Modes
The AFTA operations are characterized by 2 distinct modes of operations. When the
AFTA components are initially powered on or when a reset occurs, all AFTA components
are operating independently. The processors on an FCR backplane bus can only communi-
i
_cate with other devices which occupy slois on this bus. The Network Elements in the
AFTA are not synchronized with each other; nor are they performing fault tolerant message
exchanges. During this mode of operation the AFTA is capable of performing only non-
fault tolerant operations. On the other hand, when the Network Elements become syn-
chronized and are capable of performing fault tolerant message exchanges, the AFTA is
transformed into a fault tolerant system.
...... Page 5-3
ii v
L non:fault
: iii i tolerant
:i operations
network \
element )
fault
tolerant
operations
Figure 5-30. Operational modes
The off-line FDIR task is solely responsible for all testing activity while the AFTA is
operating as a non-fault tolerant computing system. During fault tolerant operation, both
the local FDIR and the system FDIR tasks share responsibility for execution of all testing
and recovery functions.
5.6.6. Fault Detection Mechanisms
The AFTA is a highly reliable system which achieves its reliability by exploiting the
testing capabilities available in both modes of operation. During non-fault tolerant opera-
tions the AFTA executes device self tests which extensively test the functional subcompo-
nents of the device. These tests directly exercise the functionality of a component. If the
component behavior disagrees with the expected result the tested component is identified as
faulty. These tests are intended to identify faults in a line replaceable module (LRM) with
the emphasis on isolating the fault to a chip-level component. This goal can be achieved
using on-board diagnostic mechanisms or functionally equivalent tests.
_age 5'40
While theAFTA isoperating in a fault tolerant state, the repertoire of tests changes to
include tests utilizing the inherent fault detection mechanisms. In addition, during this latter
operational mode the operating system characteristics also change; a rate group task
scheduling mechanism is activated. Consequently, certain other mechanisms become avail-
able for exploitation.
Although these 2 modes appear to require disjoint sets of test, this is not the case.
When operating in the non-fault tolerant m_e, 0nly the device self tests may be exercised.
However, during fault tolerant operations, system tests are exercisable and some of the
device self tests may be executed provided that they do not violate the operational require-
merits of the AFTA operating system. These constraints will be discussed in subsequent
sections.
5.6.6.1. Enumeralion of Mechanisms
The various tests are able to identify faults in the following AFTA components - pro-
.....
__ss_rs_ Network Elements, I/O devices, FCR backplane bus, power conditioners and
mass memory devices.
5.6.6.1.1. Processor Self Tests
The processor self test suite will exercise various components of each processing ele-
ment. Specifically, the tests will exercise the CPU, cache, memory, real-time clock, mem-
ory management unit, floating point coprocessor as well as any on-board I/O functions.
The following tests define the test suite f0r.the Motorola MVME147 single board micro-
computer which will probably be used in the APTA Brassboard. If a different processor is
selected for incorporation into the AFTA a functionally similar set of tests would comprise
the processor self tests.
These following set of tests are executed by a processor on its own constituent compo-
nents.
5.6.6.1.1 .l. CPU Tests
Register - The register test performs a thorough test of all registers.
Instruction Set - This test performs various data movements, integer arith-
metic, logical, shift and bit manipulation functions.
Addressing Modes - This tests the various addressing modes.
Exception Processing This tests many of the exception processing functions.
5.6.6.1.1.2. Cache Tests
Basic Data Caching - This tests the gross functionality of the data cache.
Page -4I
5.6.6.1.1.3.
5.6.6.1.1.4.
D Cache Tag RAM - This tests the tag RAM by causing accesses to locations
generating a variety of tags.
D Cache Data RAM This tests the data RAM by causing various values to be
written and read from the data cache.
D Cache Valid Flags - This test verifies that the valid flags are properly set
when the associated entry is valid and cleared when the cache is flushed or
the individual entry is cleared.
D Cache Burst Fill - This tests the burst fill mechanism.
Basic Instruction Caching - This tests the basic functionality of the instruction
cache.
Unlike Instruction Function Codes - This tests the ability of the cache to rec-
ognize instruction function codes.
I Cache Disable - This tests the ability to enable/disable the instruction cache.
I Cache Invalidate - This tests the ability to invalidate cache entries.
Memory Tests
Marching Address This tests the address lines for "stuck high" or "stuck
low" conditions.
Marching One - This tests each RAM location's ability to maintain a single bit
in _I1 bit positions.
Refresh - This tests the refresh mechanism by writing a pattern into RAM and
checking it 'after a time period has elapsed.
Random Byte - This tests byte data transfer and comparison operations on
RAM locations.
Program - This tests the RAM's ability to execute a self test program in RAM.
TAS - This tests the Test and Set operation.
Brief Parity - This tests the parity checking ability on longwords.
Extended Parity - This tests the parity checking ability on bytes.
MMU Tests
Root Pointer Register - This tests the root pointer register with a marching bit
test.
Translation Control Register - This tests the translation control register by
clearing and then setting the Initial Shift field.
Super Prog Space - This test enables the MMU and initiates a table access in
supervisor program space.
Super_Data Space - This tests enables the MMU and initiates an access in su-
pervisor data space.
Write/Mapped-Read Pages This tests the ability of the MMU to read data
which had been written while the MMU was disabled.
_Page 5-42
Read Mapped ROM - This tests _me of the upper MMU address lines by at-
tempting to access ROM, _
Fully Filled ATC - This tests the _drcss translation cache by verifying that all
entries in the translation cache can hold a page descriptor.
User_Prog Space - This tests thefunction code signal lines into the MMU by
accessing user program space.
User_Data Space - This tests the function code signal lines into the MMU by
accessing user data space.
Indirect Page - This tests the ability of the MMU to handle an indirect descrip-
tor.
Page-Desc Used-Bit - This tests the ability of the MMU to set the Used bit in a
page descriptor when the page is accessed.
Page-I_sc Modify-Bit - This tests the ability of the MMU to set the Modify bit
in a page descriptor when the page is written.
Segment-Desc Used-Bit - This tests the ability of the MMU to set the Used bit
in a segment descriptor when the corresponding segment is accessed.
Invalid Page - This tests the ability of the MMU to detect an invalid page and
generate a bus error when access is attempted to that page.
Invalid Segment - This tests the ability of the MMU to detect an invalid seg-
ment and generate a bus error when access is attempted to that segment.
Write-Protect Page - This tests the page write protect mechanism in the MMU.
Write-Protect Segment - This te_ts the segment write protect mechanism in the
MMU.
Upper-Limit Violation - This tests the capability of the MMU to detect when a
logical address exceeds the upper limit of a segment.
Lower-Limit Violation - This te_s the capability of the MMU to detect when a
logical address exceeds the lower limit of a segment.
Prefetch on Invalid-Page Boundary - This tests determines if the MC68030
rightfully ignores a bus error that occurs as a result of a prefetch into an
invalid page.
Modify-Bit and Index - This tests the capability of the MMU to set the Modify
bit in a page descriptor of a page which has an index field greater than 0
when the page is written.
Sixteen-Bit User-Program Space - This tests the capability of the MMU to ac-
cess user program space in 16-bit mode.
Sixteen-Bit Page-Desc Modify-Bit - This tests the ability of the MMU to set
the Modify bit in a page descriptor when the page is written in 16-bit
mode.
Sixteen-Bit Indirect Page - This tests the ability of the MMU to handle an indi-
rect descriptor in 16-bit mode.
RMW Cycle - This test performs the Test-and-Set instruction in 3 modes to
verify that the MMU functions correctly during read/modify/write cycles.
5.6.6.1.1.5. I/O Tests
Ethernet LANCE Chip - This performs an initialization and both internal and
external loopback tests on the local area network components.
Z8530 Serial I/O Chip - This tests the functionality of the Z8530 chips for se-
rial transmission and reception.
Interval & Watchdog Timers - This tests the functionality of the interval and
watchdog timers.
DMA Controller - This tests the functionality of the DMA device registers.
Power Fail & Bus Error Interrupt Enables - This test writes and reads the AC
fail interrupt control and Bus error interrupt control registers.
VMEBus Interface - This tests the VME gate array registers by reading and
writing from the local processor bus.
5.6.6.1.1.6. Mis_;ellarlgous Tests
Real-Time Clock/BBRAM Test - This tests the real time clock functionality
and the battery backed-up RAM.
Bus Timz'out Error Test - This tests the local bus time-out and global bus time-
out error conditions.
Floating Point Coprocessor Test - This tests the functionality of the floating
point coprocessor.
5.6.6.1.2. Network Element Self Tests
The following tests are executed by a processor communicating with the tested
Network Element via the FCR backplane bus. These tests exercise various components of
the network element:
5.6.6.1.2.1. Processor-Network Element Interface
Dual port RAM - This tests the ability of the Dual port RAM to be written to
and read from the processor.
Ring buffer management - This tests the activation of packet transfers and tests
the ability of the Ring Buffer Manager to access the proper input and out-
put ring buffers and to check the proper assertion of Output Buffer Full
(OBF) and Input Buffer Empty (IBE).
Packet receive interrupt - This tests the functionality of this interrupt mecha-
nism.
_age '5-44
5.6.6.1.2.2.
5.6.6.1.2.3.
_6.6.1.2.4.
5,6,6.1.2,5.
_6.6.1.2.6.
5.6.6.1.2.7.
Network Element Data Paths
Class 1 data path FIFO test - This tests the functionality of the data paths
through the data path FIFOs using the voting rules for class 1 exchanges.
Class 2 data path FIFO test - _is tests the functionality of the data paths
through the data path FIFOs using the voting rules for class 2 exchanges.
Voter error detection capability - This tests the error detection capability of the
voter.
Message reflection multiplexer - This tests the special data paths involved in
source congruent exchanges.
Network Element Global Controller
Global controller - This tests the functionality of the global controller.
ISYNC test - This tests the ability of the global controller to achieve synchro-
nization with the other channels using the debug wrap mode.
Transient NE recovery test - This tests the ability of the global controller to
resynchronize with the other channels and update the configuration table.
Scorekxx_
Message class test - This tests the operation of the scoreboard in sending pack-
ets of every 'allowable class.
Configuration Table Updates - This tests the ability to regenerate the system
configuration and to reseta!! timeouts.
OBNE Timeout detection - This tests the detection of the OBNE condition and
the generation of the OBNE timeout syndromel
IBNF Timeout detection - Thistests the detection of the IBNF condition and
the generation of the IBNF timeout syndrome.
Scoreboard vote error detection- This tests the detection of a scoreboard vote
condition and the generation of the scoreboard vote syndrome.
Inter-Fault Set Communication Links
Optical data links and TAXIs-This tests the correct operation of the devices
used in the optical communication network.
Voted Reset
Voted reset - This tests the ability to detect a system reset sent by a majority of
other Network Elements and to issue a system reset of its own FCR.
Fault Tolerant Clock
.....r....
Fault tolerant clock - This tests the ability to detect a self-ahead or self-behind
condition and to compensate correctly for this clock skew.
Page 5-45
5.6.6.1.3. FCR Backplane Bus Self Tests
The FCR backplane bus will be tested using a standardized suite of self tests to exercise
such functions as bus arbitration, bus master control, etc.
5.6.6.1.4. Input/Output Device Self Tests
Input/Output devices may range from a simple "dumb" I/O device to an intelligent de-
vice which behaves as a processor. In the former case, a processor on the FCR backplane
bus will exercise a suite of tests to evaluate its functionality; in the latter case, the I/O device
itself may be capable of executing processor-like self tests.
The tests to exercise the I/O device functions will be determined as I/O devices are
identified in subsequent phases of the AFTA program.
5.6.6.1.5. Power Conditioner Self Tests
The power conditioners in each fault containment region will be nominally tested via the
on-board set of tests of an intelligent power conditioner.
5.6.6.1.6. Mass Memory Self Tests
The mass memory device is a memory unit with error detection and correction capabil-
ity consisting of both non-volatile RAM and ROM. It is accessible by all components in
the fault containment region via the FCR backplane bus.
The mass memory devices in each fault containment region will be nominally tested via
a suite of tests intended to ensure that the memory contents are correct and that the memory
addressability is operating properly. In fact, the same memory tests described for the pro-
cessor on-board memory may be executable on the processor but access the mass memory
device if the mass memory exhibits the appropriate characteristics (for example, support for
parity). Consequently, the mass memory tests would include the marching address,
marching one, refresh, random byte, test-and-set, brief parity and extended parity tests.
5.6.6.1.7. System Tests
The tests discussed previously exercise the functionality of the individual line replace-
able modules. Conversely, the system tests exercise functions requiring multiple compo-
nents operating in tandem to effectively test the system. Because the AFTA is designed as
a fault tolerant system, fault detection mechanisms are built into the specially designed in-
terconnection network and are exercised at every message exchange to provide high cover-
age of faults with low fault latency. The goal of the system self tests is to test the AFTA as
an operating entity exercising these fault tolerant mechanisms.
Page 5-46
Fault tolerance in the AFTA is implemented using hardware redundancy. A specially
designed set of Network Elements operate in tight synchrony to implement fault tolerant
message exchanges among processors grouped into redundant virtual groups. The con-
stituent processors in a virtual group communicate with the members of its virtual group
and with other virtual groups by synchrono0sly sending messages via the network ele-
ments. The Network Elements perform fault tolerant specific operations on messages and
deliver voted messages to all members of the destination virtual group. The voting process
generates a consistent voted copy of the message as well as error syndrome data which are
reported with the delivered message. This error syndrome information can be used to iden-
tify faulty components.
A number of tests can be constructed based upon these fault tolerant capabilities as well
as upon the characteristics of the operating environment:
1) The presence test is a means of polling various components to determine if each is
active and synchronized. Within the AFTA the presence tests can be employed at 2 levels
of abstraction: presence test on members of a virtual group (intra-virtual group) and pres-
ence tests on each virtual group in the AFTA (inter-virtual group). The failure of either
type of presence test implies that the tested entity is not synchronized.
2) The error syndrome data described in Section 4 indicates the Network Element de-
tected an erroneous condition. The analysis of this syndrome data can identify either a pro-
cessor or a Network Element as faulty.
3) Because a voted exchange of information generates a consistent message, the voted
message mechanism can be used create a consistent voted copy of memory. In this test
(called RAM scrub) the contents of RAM locations known to have congruent data are com-
pared across all channels by voting the contents of memory. If there is a discrepancy, the
fault is logged and the correct value is written t° the faulty RAM location.
4) In the PROM check test the contents of PROM are verified by summing all loca-
tions and comparing the results against a voted value.
5) The voter test will test the Network Element voting mechanism by seeding non-con-
gruent values selectively on each channel of a fault masking group. Not only does this test
the composite data but also the syndrome generation.
6) The class test will test the Netw9rk Element voting mechanism by requesting a non-
congruent message exchange class selectively on each channel of a fault masking group.
This tests the SERP processing and the tim¢out syndrome generation.
• Page 5-47
7) The Network Element presence test checks the status of each Network Element.
Specifically, it determines the synchronization of all Network Elements. This test is used
only during the I-BIT mode to determine the initial system configuration.
8) In order to overcome possible software or hardware errors which disable the sys-
÷
tem the watch dog timer is implemented. A timer is periodically reset by software at regular
intervals. This timer is decremented periodically by an interrupt process. A timer decre-
mented to 0 is indicative of an error since the timer had not been reset with the predefined
time.
9) Exception handlers are provided to handle undesirable events such as a divide by
zero exception, an illegal instruction or an overflow. In some cases these events are ex-
pected by an application and the application should provide a means to account for this sit-
uation. However, for those unforeseen situations where a fault causes the trap invocation,
a handler will be provided which will initiate remedial action to recover from the fault.
5.6.6.2. Ooerational Constraints of Fault Detection Mechanisms
During each system mode the time constraints, the requirements on maintenance of
mission critical information and even the system configuration differ. Consequently, the
tests executed during each of the test modes vary based upon these factors. Because the
power on sequence describes the transitions among these operational environments, a brief
description of the power on sequence with the emphasis on testing follows (refer to Figure
5-31):
1) Upon initiation of power or manual system reset, the AFTA system is essentially
established in an initial, unsynchronized state where each component is operating indepen-
dently. While in this unsynchronized state, the primary emphasis is to execute as many
device self tests as possible. The processors test themselves; subsequently, a single pro-
cessor is selected which exercises I-BIT self tests of the FCR backplane bus, the Network
Element, power conditioner, mass memory and I/O devices.
2) The initial synchronization of the Network Elements is a process whereby the net-
work elements synchronize and commence fault tolerant message exchanges. Each net-
work element, operating independently, can be directed to synchronize upon direction of a
processor within the fault containment region or by its own global controller. Subsequent
to the initial synchronization, the processors (now members of virtual groups) are capable
of performing fault tolerant message exchanges with each other. When the initial synchro-
nization phase terminates a system configuration has been established. This configuration
will consist of fault masking virtual groups commensurate with the minimum dispatch
complement.
Page 5-48
3) After the initial synchronization,asingleredundantvirtual groupwill assumethe
taskasthesystemmanager.Thissystemmanagervirtual groupwill requestthatall virtual
groups(whethersimplex,triplex, orquadruplex)transfertheirdiagnostictestresultsto the
system manager. The system manager evaluates the results and reconfigures those redun-
dant virtual groups which contain a faulty component with the intent of achieving the mini-
mum dispatch complement for computing resources at the required reliability level.
4) After the redundant system configuration has been established, the system manager
commands each virtual group to initiate the real-time scheduler and to commence the system
tests. The minimal set of I-BIT system tests will be exercised.
5) When it has been determined that the minimum dispatch complement for the current
mission has been established, the AFTA system will be established in a state referred to as
"operational standby". During this state thecomprehensive suite of M-BIT system tests
will be exercised until the mission is activated.
6) When the mission is activated, the System is established in the "mission critical"
mode where the C-BIT tests are executed concurrent with the mission functions. As indi-
cated in Figure 5-31, the AFTA operating system prevents execution of any BIT other then
C-BIT until a reliable indication is given that the mission is over and it is safe to enter other
diagnostic modes. This could be a composite indication from mutually corroborative
sources such as the weight-on-wheels switch, rotor RPM, vehicle INS and rate gyro sys-
tem, propulsion status, and pilot discrete(s).
7) Alternatively, from the standby state an operator may command the M-BIT se-
quence of tests which execute similar to the I-BIT tests.
Page 5-49
Figure5-31. TestModeSequences
5.6.6.2.1. I-BIT Mode Self Tests
The I-BIT test mode is automatically initiated when power is applied to all AFTA com-
ponents. However, the I-BIT test mode is constrained by a requirement that this mode be
active for only seconds. Presumably, shortly after initiation, it is desirable that the vehicle
Page"5-50
bemission ready. Because of this time limitation only a subset of the self tests employed in
the M-BIT mode shall be executed. The emphasis in the sele, ction of I-BIT tests is primar-
ily to exercise the functionality of all LRMs in the AFTA and secondly to test those compo-
nents as comprehensively as time permits.
The component tests will be sequenced such that components deemed to be non-faulty
will exercise subsequently tested components. This methodology requires that the initial
component test itself. Although it cannot beguaranteed that a faulty initial component will
correctly conclude its own health, subsequent system testing can detect faulty behavior and
that faulty component will be eliminated.
During the non-fault tolerant operational mode, the AFTA components are not syn-
chronized or operating as a fault tolerant computer. Consequently, there is no critical in-
formation which must be maintained aside _om test results and there are no synchroniza-
tion constraints. The tests can destructivelychange memory locations on the processors,
test the Network Elements in a debug wrapback mode, or change the bus master on the
FCR backplane bus. In addition, since the real-time operations have not commenced, the
scheduling constraints are relaxed .............
5.6.6.2.2. M-BIT Mode Self Tests
The M-BIT test mode is initiated by an operator in an operational environment with the
expressed purpose of extensively testing all components of the AFTA. Because of the lack
of a severe time constraint, the suite of tests could conceivably be the identical set as that
executed in the depot test mode. However, there are two primary differences between
these test modes - the operator interface and the automation of the test sequence. The site
of the M-BIT execution is on-board a hosting vehicle which naturally, implies that the op-
erator is either the vehicle operator or a line maintenance crew. In addition, the M-BIT tests
exercise all components of the AFTA as a_automated sequence. In contrast, the depot
tests are conducted at a remote repair facility by a repair technician exercising a set of tests
on a single LRM.
The other operational constraints are the same as those of the I-BIT test mode.
5.6.6.2.3. 1-BIT Mode System Tests
The I-BIT test mode is a bridge between a system reset condition and a fault tolerant
operational state. Because the time constraint to transition from an initial state to a fully op-
erational environment (that is, operational standby), is so severe, only a minimal set of the
system tests is executed. In fact, only an NE presence test is incorporated during this mode
to ensure that all Network Elements are operational and synchronized.
aPSe 5-5Y
5.6.6.2.4. M-BIT Mode System Tests
Because of the lack of a severe time constraint (as with the I-BIT) the suite of tests
which can comprise the M-BIT test mode can be very extensive. However, as currently
envisioned, the M-BIT system tests will comprise the same suite of tests as the C-BIT.
5.6.6.2.5. C-BIT Mode Tests
During the C-BIT mode the AFFA is operating in a state where: 1) real-time scheduling
is enforced, 2) mission critical operations occur and 3) redundant virtual groups exist.
These constraints require that the tests enacted during this mode be unobtrusive.
During the C-BIT mode, the AFTA system performs mission critical tasks within the
confines of a real-time scheduler. These constraints pose two requirements for C-BIT
testing - 1) information must be preserved and 2) the operation of these tests must be unob-
trusive. Consequently, a minimum of computing resources must be consumed in the ana-
lyzing the inherent fault detection mechanisms and data integrity must be maintained.
Any device self test implemented as a C-BIT must ensure that it does not modify the
mode of operation of any AFTA component in an unrecoverable way. Examples include
tests which change ',he processor status register or activate the memory management unit,
tests which cause the Network Elements to desynchronize, or tests which alter bus arbitra-
tion on the FCR backplane bus. Because of the operational requirements of the system and
because on-board diagnostics typically do not preserve system state, many of the manufac-
turer supplied set of tests are inoperable in a real-time operational environment. For in-
stance, memory tests typically modify, read, and check memory locations without preserv-
ing the information. Therefore, the implementation of the self tests for the C-BIT mode
will be different than those for the other test modes.
In addition, because the system configuration could conceivably be a mixed redundancy
system consisting not only of fault masking groups but of simplexes as well, the system
tests must ensure that all virtual groups are active and that all are operating properly.
___Ma.o.oing of Fault Detection Mechanisms to Test Modes
Because the goal of the AFTA design is to create a digital computing system of the
highest possible fault coverage, it is imperative that a comprehensive set of tests be exe-
cuted during all test modes. Since the time constraints and system configuration vary for
each test mode, the suite of tests for each test mode will differ. In fact, the suite of tests
will be a composite of both the self tests and the system tests whenever practical. For in-
stance, it is impossible to execute a system test when the hardware is operating in non-fault
tolerant mode. On the other hand, a self test may require altering some datum (for exam-
Page 5-52
pie, a status register) which allows for the possibility of a catastrophic change of state
which jeopardizes mission critical information i
The following series of tables delineate the AFTA tests. Because some tests are exe-
cutable only in either one of the operational modes whereas others earl be executed in both
modes, the individual tests are marked for each test mode as follows:
test is executable in non-fault tolerant mode only
2 test is executable in faul t tolerant mode only
3 test is executable in both modes
For those tests which are executable in both modes the actual implementations of the
test could be different although they functionally perform the same operations.
5.6.6.3.1. Processor Self Tests
CPU Tests:
Register
Instruction Set
Addressing Modes
Exception Processing
depot
1
1
1
1
I-BIT M-BIT C-BIT
3
3
3
1
Cache Tests:
Basic Data Caching
D Cache Tag RAM
D Cache Data RAM
D Cache Valid Flags
D Cache Burst Fill
Basic Instruction Caching
Unlike Instruction Function Codes
I Cache Disable
I Cache Invalidate
depot
1
I
1
1
1
1
1
1
1
I-BIT IM-BIT C-BIT
Memory Tests:
Marching Address
Marching One
Refresh
Random Byte
i i,,i rl
I-BIT M-BIT C-BITdepot
1
1
1
1
3
3
1 3
.... Page 5-53
I Program
TAS
BriefParity
Extended Parity,,,
MMU Tests:
i.
Root Pointer Register
Translation Control Register
Super_Prog Space
Super_Data Space
Write/Mapped-Read
Read Mapped ROM
Fully Filled ATC
User_Prog Space
User_Data Space
Indirect Page
Page-Desc Used-Bit
Page-Desc Modify-Bit
Segment-Desc Used-Bit
Invalid Page
Invalid Segment
Write-Protect Page
Write-Protect Segment
Upper-Limit Violation
Lower-Limit Violation
Prefetch on Invalid-Page Boundary
Modify-Bit and Index
Sixteen-Bit User-program Space
Sixteen-Bit Page-Desc Modify-Bit
Sixteen-Bit Indirect Page
RMW Cycle
depot
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
I
1
1
I-BIT
1
1
1
1
1
1
1
1
1
I
1
1
1
1
1
1
1
1
1
1
1
I
1
1
M-BIT C-BIT
1
1
1
I/O Tests:
i
Ethemet LANCE Chip
Z8530 SIO Chip
Interval & Watchdog Timers
DMA Controller
depot
1
1
1
1
I-BIT C-BIT
Page 5'54
PowerFail & Bus Error Interrupt Enables_
/VMEBus Interface 1 . 1
Miscellaneous Tests:
Real-Time Clock/BBRAM Test
Bus Timeout Error Test
Floating Point Coprocessor Test
5.6.6.3.2. Network Element Self Tests
Processor-Network element interface:
Dual port RAM
Ring buffer management
Packet receive interrupt
depot
1
1
1
I-BIT M-BIT
1
1
3
C-BIT
°depot
1
I
1
I-BIT M-BIT C-BIT
Network element data paths:
Class 1 data path FIFO test
Class 2 data path FIFO test
Voter error detection capability
Message reflection multiplexer
,_dep °t
1
1
1
1
I-BIT M-BIT C-BIT
Network element global controller:
Global controller
ISYNC test
Transient NE recovery test
depot
1
1
1
I-BIT
1
1
1
M-BIT C-BIT
Scoreboard:
Message class test
Configuration table updates
OBNE timeout detection
IBNE timeout detection
Scoreboard vote error detection
depot
1
1
1
1
1
I-BIT M-BIT C-BIT
1
1
1
1
1
Inter-fault set communication links: I-BIT M-BIT C-BIT
Optical data links and TAXIs
depot
1
Page 5-55
Votedreset: depot
1
I-BIT M-BIT C-BIT
Voted reset 1
i
.Fault tolerant clock: , depot
Fault tolerant clock 1
I-BIT
I
i -- 1,, i1|
M-BIT C-BIT
i
1
5.6.6.3.3. FCR Backplane Bus Self Tests
depot
TBD 1
TBD 1
TBD 1
I-BIT M-BIT C-BIT
5.6.6.3.4. Input/Output Device Self Tests
,i
TBD
TBD
TBD
5.6.6.3.5. Power Conditioner Self Tests
depot
1
1
1
I-BIT
1
1
1
M-BIT
3
3
3
C-BIT
3
3
3
TBD
TBD
TBD
5.6.6.3.6. Mass Memory Self Tests
depot
1
1
1
I-BIT
1
1
1
M-BIT C-BIT
Memory Tests:
Marching Address
Marching One
Refresh
Random Byte
TAS
Brief Parity
Extended Parity
depot
I
1
1
1
1
1
1
I-BIT
1 3
1 3
1
1 3
1 3
1 1
1 1
M-BIT C-BIT
3
3
3
3
Page 5-56
5.6.6.3.7. System Tests
Intra-virtual group presence test
' Inter-virtual group presence test
Syndrome analysis
RAM scrub
PROM check
Voter test
Class test
NE presence test
Watchdog timer
Exception handlers
depot M-BIT C-BITI-BIT
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
5.6.7. Fault Diagnosis
During both non-fault tolerant and fault tolerant operations the AFTA system performs
various levels of testing commensurate with _e operational constraints. Because of the op-
erational environment and the ultimate goal of comprehensively testing the AFTA during all
operations, the tasks of testing and result analysis is divided among the three FDIR func-
tions - Off-Line FDIR, Local FDIR and System FDIR. This section describes the overall
methodology used by each task and the self and system tests implemented by each.
5.6.7.I. Non-Fault Tolerant O.ocratia_
During the non-fault tolerant mode the emphasis is on ensuring that the constituent
components of the AFTA are operating correctly. This is accomplished by exercising
functional components of each LRM in the_A with a series of diagnostic level tests.
Off-Line FDIR is initiated when the system is reset at which time the AFTA compo-
nents are operating independently. In other words, the Network Elements are not synchro-
nized, the processors act as individual processors rather than as members of a virtual group
and the I/O devices are in an initial state.
!!
Off-Line FDIR systematically sequences through a series of diagnostic level self tests to
exercise all AFTA components. There are 2 distinct series of tests corresponding to the I-
BIT and M-BIT modes. These sets were necessitated because of the constraints regarding
iiiiii
the amount of time allotted to perform testing in these modes. The enumeration of the tests
comprising each test mode can be deduced from the tables in the fault detection mechanisms
section.
" Page 5-5_
Procedurally, Off-line FDIR initiates a series of self tests of each processor in the sys-
tem. As each processor completes its suite of tests and is determined to be non-faulty, it
claims an area of the dual ported RAM as its interface with the Network Element. (Section
4 describes the memory map and the functions of this RAM.) The first non-faulty proces-
sor in each fault containment region shall be responsible for testing the network element,
FCR backplane bus, power conditioner, mass memory, and I/O devices. All fault infor-
mation will be saved by each processor in the FCR's non-volatile mass memory for later
dissemination.
Page 5-58
self
failed
passed
RAM
slot0 selected other slot
bus self
wait for
failed
self
failed
element
self tests
network
clement
memory
self tests
device
network
elements
fault tolerant
opcradons
Figure 5-32. Off-Line FDI Overview
Page 5-J9
Off-Line FDIR uses the device self tests exclusively for diagnosis of faulty compo-
nents. Because these tests directly exercise the functionality of LRMs the diagnosis of a
specific LRM as faulty is obviously trivial. However, since the self tests exercise the func-
tionality of an LRM, the failure of a test identifies not only the failed LRM but also the
failed LRM function which maps to a chip or set of chips. This chip level diagnostic in-
formation is retained for dissemination to a maintenance crew.
5.6.7.2. Fault Tolerant O, erations
When the AFTA system hardware is operating synchronously with fault tolerant mes-
sage exchanges, it is capable of exercising the system tests which employ the inherent fault
detection mechanisms to provide fault tolerance. In addition, the system is also capable of
executing some of the self tests. Because of the requirement to execute unobtrusively dur-
ing fault tolerant operations, only a subset of the entire suite of self tests can be executed.
In particular, only some of the processor self tests can be performed.
Although all test modes execute system tests, only the M-BIT and C-BIT modes utilize
the full capabilities of these tests. The system test capability of the I-BIT is minimal.
During fault tolerant operations the duties of fault detection, isolation and recovery are
shared between local FDIR and system FDIR. Local FDIR executes on each virtual group
and is able to diagnose faults in the constituent processors of that virtual group. System
FDIR, on the other hand, executes on a single fault masking group; it diagnoses failures in
all other AFTA components.
5.6.7.2.1. Local Fault Detection and Isolation
Each redundant fault masking virtual group executes the intra-virtual group presence
test, syndrome analysis, RAM scrub, PROM check, watch dog timer, and provides excep-
tion handlers for certain unusual conditions. Of these tests the intra-virtual group presence
test and the syndrome analysis are invoked on an iterative basis to check for an unsyn-
chronized channel, a failure in the Network Element hardware, and processor failures.
This synchronous test methodology is depicted in Figure 5-33.
Page 5-60
tests
failed failed systemFDIR of
initiate
recovery
passed
analyze
errors
vote
link
no errors
channel
PE
NE
link
disable
faulty
)rocessor
recovery
system
FDIR of
system
FDIR of
Figure 5-33. Synchronous FDI Overview
Page 5-61
AlthoughthelocalFDI candetect failures in components other than the constituent pro-
cessors of its virtual group, it is responsible only for the diagnosis and disabling of its
member processors. Other component failures such as failures in a Network Element fail-
ure or an !/O device are analyzed and disabled by system FDI.
When a processor is identified as being faulty, local FDI disables the faulty processor
so that it does not adversely affect operation. Specifically, FDI disables the voted outputs
from the faulty processor, reports the failure to the FDI system manager and initiates the
selected recovery option.
5.6.7.2.1.1. lntra Virtual Group Presence Test
An unsynchronized processor is detected by means of the intra-virtual group presence
test. This test detects an unsynchronized processor by sending a unique pattern from each
member of the virtual group via source congruent message exchanges through the network.
If the result received is not the expected pattern, the processor originating the exchange is
judged not present and, therefore, desynchronized from the other channels. When the syn-
chronized channels detect the loss of synchronization of a processor, the synchronized
members of the virtual group disable the faulty channel.
•_(Z,2,._] .2L_._. Syndrom_ Analysis
Failures in the processors of a virtual group or Network Elements are detected by ana-
lyzing the error syndrome delivered with message packets by the Network Element hard-
ware. The error syndrome defines both vote errors and link errors generated during han-
dling of the packet by the Network Elements. Vote syndrome are generated during a mes-
sage exchange if a miscompare is detected by the Network Element hardware during the
voting of data received from the redundant channels. Link errors are generated if the re-
ceiver fails to detect the transmitter/receiver synchronization pattern. Since the error syn-
drome is delivered with the message exchanges, the message handling primitives extract
this information on a message class basis for analysis by FDI.
The Network Element generates the syndrome just prior to delivery of the message to
the processor. Therefore, the syndrome data are non-congruent with the other members of
its virtual group. In order to prevent divergence of the synchronous channels which must
operate on identical inputs to maintain synchrony, the channels must participate in a series
of source congruent exchanges of this syndrome data. Upon completion of the syndrome
exchange process, each channel has a copy of all channels' syndrome data.
The syndrome analysis identifies a fault in either (1) a Network Element (including the
transmitter-to-receiver link), or (2) a processor which generated incorrect voted data. The
analysis occurs in a 3 step process:
Page 5-62
1) Analyzethevote syndrome,
2) Analyze the link syndrome, and
3) Correlate vote and link results to identify a faulty component.
The vote syndrome analysis compares the pattern across channels for each message
class with known fault patterns. In each case a hypothesis testing methodology is used
where a channel is assumed faulty, that channel is masked out and the resultant pattern
compared with the known pattern, This analysis results in the indication of a faulty proces-
sor associated with a message class.
The link syndrome analysis identifies either a transmitter or receiver as faulty. How-
ever, rather than analyzing all channel's syndrome against a specific pattern, each channel's
link syndrome is analyzed individually. Essentially, each analysis generates 2 hypotheses
- one indicative of the transmitter indicat_,_ the syndrome and the other representing the
detecting channel's receiver. The subsequent count of the errors detected verify either one
of these hypotheses. Multiple channels detecting the same link error indicates a transmitter
fault; a single channel detecting the link error implies the receiver fault.
Because link faults can generate vote errors, it is important to identify the source of a
vote error. In the absence of other error syndromes, processor faults are identified as vote
errors on voted messages. Any combination of errors which includes a link error are at-
tributable to either a transmitter or receiye_fault in the appropriate Network Element.
Although Network Element errors may manifest themselves as a vote syndrome on either
voted messages or source congruent errors, the diagnosis of a Network Element fault is
identified as a vote error on a source congruent message.
The syndrome data analyzed by a virtual group is that data delivered with messages
which the virtual group addressed to itself. Consequently, the members of a virtual group
diagnose its constituent processors. The virtual group also disables its own faulty proces-
sor. Although a redundant virtual group can also diagnose Network Elements as faulty, it
merely reports these diagnoses to a system FDI function which performs further analysis
and, if necessary, Network Element recovery.
5.6.7.2.1.3. Self Tests
Because the AFTA is designed to withstand a specific number of simultaneous faults, it
is imperative that the number of faults be c0n_ned to a minimum. Exceeding the number of
simultaneous faults could result in total Sy_em failure. For this reason, background self
tests were devised to minimize the possibility of simultaneous failures by checking for la-
tent faults in the AFTA processors. Whena faulty component is uncovered, that compo-
Page 5-63-
nentcanbeeliminatedfrom thesystemconfigurationthereby,reducing the likelihood of
simultaneous faults.
While the majority of the system tests are effective for the identification of processor
faults in fault masking groups, the testing (and subsequent diagnosis) of simplex virtual
groups is primarily provided by self tests.
The background self tests exercise processor components and comprise a comprehen-
sive as is feasible set of tests which are executed as low frequency background tests.
These tests include memory tests (for example, RAM pattern tests) and CPU tests (register,
instruction, addressing modes).
Failures detected using these tests will adhere to the transient fault analysis and recov-
ery policies defined in a subsequent section.
5.6.7.2.2. System Fault Detection and Isolation
System FDI is responsible for the coordination of system status and fault information
as well as for testing and analysis of shared components. Specifically, system FDI will be
responsible for
1) maintaining the current status of every system component,
2) initiating Network Element tests and analyzing test results,
3) initiating I/O device tests and analyzing test results,
4) evaluating the fault diagnosis information of other fault masking groups with regard
to Network Element failures, and
5) analyzing syndrome data indicative of Byzantine faults, and
6) collecting and reporting of fault information logged in the mass memory devices in
each fault containment region.
System FDI will evaluate the status of every system component by executing the inter-
virtual group presence test as well as by accepting update status from each system compo-
nent indicating the faulty component. The inter-virtual group presence test is essentially a
poll of all virtual groups within the system. Failure of a virtual group to respond to the
message within a specific time period is indicative of a fault. This is especially important
for the analysis of simplex processors which may have failed without communicating that
information to the system manager.
Page 5-64
The system Network Element tests incl_e a voter test and a class test. Each of these
tests exercises the Network Element by seeding either non-congruent data or non-congruent
class information into the Network Element_ This testing will systematically exercise each
Network Element. However, because the AFTA system configuration could consist of up
to 5 Network Elements whereas the system manager fault masking group may consist of a
maximum of 4 members, the system manager itself is incapable of testing all network ele-
ments. Instead, it may assign some portion of the test task to another fault masking group.
Although the local FDI functions can detect and identify a Network Element failure, it
cannot formally diagnose a Network Element_as faulty. This information is sent to the sys-
tem FDI which may perform additional analysis and actually perform some remedial action.
There may be situations where the syndrome information maintained by the local FDI
on a virtual group is inconsistent. Either all_mbers of the virtual group do not concur on
the identification of a component or not all members agree that a fault even exists. This
type of syndrome information is indicativeof a Byzantine fault where the faulty component
maliciously communicates some informati0_to a fault containment region and some other
data to another fault containment region. Faults of this type would generally be indicative
of a faulty Network Element and, hence, must be handled by the system FDI.
5.6.8. Recovery. options
When a component has been diagnosed as faulty it will be disabled. The network ele-
ment hardware has masking capabilities to mask a failed component. For instance, there is
a processor mask which can disable a fau!ty processor's participation in a voted message
exchange. Furthermore, a faulted processor Can be excluded from a virtual group thereby
preventing it from communicating with other virtual groups. In addition, a network ele-
ment can be masked causing the other Network Elements to ignore data and clock signals
from the disabled Network Element. Although these masking capabilities can prevent a
faulty component from corrupting information in the other fault containment regions, it is
desirable to inhibit the faulty component from affecting other components which share the
FCR backplane bus. For this reason the component will be disabled from both a compo-
nent and a system perspective.
The response to failure section defines the actions of the individual components to limit
faulty behavior to the most confining failure envelope. The subsequent section describes
from the system perspective the methods tO recover from a component failure while effi-
ciently utilizing system resources.
Page 5-65
5.6.8.1L Response to Failure of Test
When a component fails it is highly desirable to contain the faulty behavior to the small-
est possible extent. Although this faulty behavior is contained within the FCR boundaries,
the component could corrupt other devices which share the same FCR backplane bus. The
following actions attempt to reduce the failure envelope to include only that device itself.
Failure to adequately limit the damage by use of these measures will ultimately lead to the
failure of the FCR which is unavoidable under these circumstances.
1) Processor - During non-fault tolerant operations a processor may be able to detect
itself as faulty. The processor will attempt to log the failure in a mass memory device and
to disable itself by executing a reset. During fault tolerant operations the processor itself
may diagnose itself as faulty via a processor self test; it may reset itself. Alternatively, the
virtual group may diagnose the fault. Depending upon the system recovery strategy and the
transient analysis policy the virtual group may generate a voted reset.
2) Network element - If a Network Element exhibits faulty behavior, the only reme-
dial action the testing processor can perform during non-fault tolerant operations is to issue
a Network Element reset to prevent that Network Element from initially synchronizing with
the other Network Elements. During fault tolerant operations, the system manager virtual
group which diagnoses Network Element failures may permanently disable the faulty net-
work elements via the Network Element mask and may perform a voted reset.
3) llO device - If an I]O device is declared faulty during non-fault tolerant operations,
the testing processor can reset that device and log the fault in the mass memory. The moni-
tor interlock is asserted to disable that I/O device (Refer to Section 4). Although the I/O
device may be reset and disabled via a monitor interlock during fault tolerant operation, the
specific mechanisms have yet to be defined.
4) FCR backplane bus - Since all communication between the Network Element and
the attached devices (that is, processors, I/O devices, mass memory) occurs via the FCR
backplane bus the only appropriate response to a FCR backplane bus failure is to reset the
Network Element. Masking the Network Element disables the faulty component totally.
A failure of the FCR backplane bus during fault tolerant mode would be attributable to ei-
ther a processor or a Network Element because the bus is not directly tested by any test in
this mode. Nonetheless, those components exhibiting the faulty behavior will be identified
and disabled via a reset.
5) Power conditioner - Because a power conditioner regulates the voltage to the en-
tire fault containment region, its failure could generate spurious signals to any component
within the fault containment region. Failure to adequately mask this failure could result in
Page 5-66
later system failure. Consequently, in order to prevent the possibility of a system failure it
is imperative that the Network Element be reset and that Network Element be masked.
6) Mass memory - Because the mass memory device is a somewhat passive compo-
nent, the manner in which it is disabled depends upon the nature of the fault. If a memory
location is faulty as in a "stuck-at-one" condition, than the mass memory device can be ig-
nored or the faulty locations bypassed. However, if the FCR backplane interface with the
mass memory device is inoperative, the mass memory could disrupt communications
across this bus and would require disabling the entire fault containment region via a net-
work element reset.
5.6.8.2. SystemRecovery
During fault tolerant operations the system tests and the processor self tests may indi-
cate a component as failed. At that time some remedial action must be taken in order to re-
move that faulty component from the operational system. Because of the dynamic recon-
figurability of the AFTA architecture many recovery options are possible ranging from
merely masking the faulty component such as a Network Element to integration of a spare
processor as a replacement for a faulted onel These recovery actions are different for each
type of failed component (that is, processor, network element, or I/O device). Further-
more, when recovery from a failure is initiate, the current system mode is an important
factor because it defines the time constraintsfor execution of a recovery strategy.
As the recovery strategies are discussed the operational requirements of each strategy
are addressed. These constraints may or may not be commensurate with the operational re-
quirements of the mission critical environment. Hence, an appropriate recovery strategy
should be selected based upon the mission and its system mode. Figure 5-34 depicts rather
qualitatively the appropriateness of a recove_ methodology given the system mode (that is,
power-on, standby or operational).
Page 5-67
graceful
degradation
processor
resynchronization
processor
reintegration
processor
replacement
processor
replacement with
initialization
task migration
mm_lmR
network element
resynchronization
network element
masking
power-on
fairly practical
highly practical
if power-on time
sufficient
Dmn_bJmm_
highly practical
if power-on time
sufficient
highly practical
if power-on time
sufficient
highly practical
if power-on time
sufficient
highly practical
highly practical
highly practical
standby
minimally
practical
highly practical
if standby time
sufficient
highly practical
if standby time
sufficient
highly practical
if standby time
sufficient
highly practical
if standby time
sufficient
highly practical
highly practical
highly practical
operational
highly practical
highly practical
if large minor
frame
highly practical
if large minor
frame
highly practical
if large minor
frame
fairly practical
if task restart
_.possible
highly practical
if task restart
possible
aa, am _ m iim
highly practical
highly practical
Figure 5-34. Qualitative evaluation of recovery methods
There are two primary criteria for the selection of a recovery option - 1) the operational
environment of the system when the fault was uncovered and 2) the type of faulty compo-
nent. The operational environment criterion defines the system mode of operation and is
indicative of the system constraints. These constraints may require that the ieconfiguration
process complete within a minor frame (for example, 10ms) or that the recovery time can
be significantly greater. They may also require that mission critical information be main-
tained. Obviously, the recovery option is also contingent upon the type of failed compo-
Page 5-68
nent. A reconfigurationpolicy to recover a failed processor is drastically different than that
of a failed Network Element.
These recovery options are oriented at system goals as a means of dealing with a failure
in a component. Some of the recovery options attempt to reintegrate the failed component
to determine if the fault was transient. Other_covery options discard the failed component
without an attempt to "recover" the diagnosed component.
The options discussed recover the system from failures in processors and network ele-
ments. The options for recovery from an input/output device fault are the same as those for
processors if the I/0 device has the processor-like functionality to interface to the Network
Element as a member of a redundant virtua! group. The specification of recovery options
for I/O devices will be addressed as I/O devices are selected for incorporation into the
AFTA. .......
5.6.8.2.1. Recovery from Processor Failure
If a processor failed, numerous strategies exist for system recovery. Although it is
highly desirable to recover a channel, it may not always be possible because of mission
critical constraints. For this reason a number of possible recovery options are posed which
have various operating characteristics. Depending upon the mission mode, these character-
istics may make a recovery option feasible tO execute without irreparable harm to the mis-
sion.
L6..E23d_ Graceful Degradation
During mission critical operations a redundant virtual group tests itself using such tests
as the intra-virtual group presence test or syndrome analysis. In the absence of a common
mode fault, the virtual group can correctly deduce that a member has failed and can initiate
corrective action. Specifically, a virtual group can gracefully degrade its redundancy level
by issuing a configuration table update message which eliminates the faulted channel. The
CT update message can reconfigure a redundant group and create a simplex atomically
eliminating the requirement that each virtual group in the system be cognizant of an upcom-
ing system reconfiguration. This has the net effect of initiating and terminating a reconfigu-
ration within a minor frame.
One disadvantage to this alternative is that the redundancy of the virtual group is de-
creased. A quadruply redundant group would degrade to a triplex; a triply redundant vir-
tual group would become a degraded triplex which is essentially a triply redundant group
which has a single channel's voted messages masked out. The faulted channel's data can-
....
not contribute in any voted message. Operating as a degraded triplex is undesirablebe-
cause its performance may be significantly penalized if the faulted channel fails to respond
to messagerequests and timeout penalties are sustained by the degraded triplex. A sim-
plex, of course, cannot be affected by this recovery technique.
5.6.8.2.1.2. Processor ResynchtoniTztion
When a virtual group member is judged to have lost synchronization with the other
channels of its virtual group, the resynchronization recovery strategy attempts to resyn-
chronize that lost channel and reintegrate it in order to maintain the redundancy level of the
virtual group.
A processor which fails the intra-virtual group presence test is deemed to have lost syn-
chronization and attempts to resynchronize itself with the other channels. The failed chan-
nel itself detects the failure, primarily via a watchdog timer mechanism; the synchronized
channels detect the failure when the channel fails to respond to its presence test. When
resynchronization (also referred to as lost channel synchronization) has been achieved, the
state of the failed processor (now resynchronized) must be made congruent with the other
synchronized processors. This is accomplished by an alignment process. This is a process
whereby the machine state of all members of a virtual group become congruent by voting
all congruent memory, registers, and timers.
There are essentially two participants in the resynchronization process - the lost channel
and the synchronized channels. When the lost channel detects loss of synchronization with
the other members, it immediately invokes a lost channel synchronization procedure. The
synchronized channels periodically invoke this procedure when the transient analysis func-
tion deems it appropriate to attempt recovery of a failed channel.
The resynchronization function consists of two control streams - one for the lost chan-
nel and the other for the synchronized channels. These are depicted in Figure 5-35. The
lost channel executes a pickup_sync routine which essentially listens to incoming messages
for the specific pickup message. When it detects this message, the lost channel participates
in the resynchronization presence test. Conversely, the synchronized channels perform a
voted message exchange of the pickup message. Subsequently, these channels execute the
resynchronization presence test which, like the presence test, consists of a series of source
congruent message exchanges. These message exchanges send source specific patterns
which differ from those used in the presence test. For each exchange, a comparison is
made of the pattern received for the given exchange against the pattern expected for a suc-
cessful exchange. A match indicates that the given channel is operating in synchronism
with the channel sourcing the exchange; a mismatch indicates a lack of synchronism. The
lost channel returns to pickup_sync if it failed to synchronize; the synchronized channels
return to the scheduler if the lost channel failed to resynchronize.
Page 5-70
Because the lost channel had been desynchronized with the other channels from some
time period, the processor state in the lost channel is most likely different than the proces-
sor state of the synchronized channels. It is imperative that once synchronous operation
among all channels is established, the processor state must be made congruent across chan-
nels such that synchronous operation will continue. This ensures that the control flows in
each channel are synchronous. Consequently, before normal scheduling resumes, the
channels must
1) align their congruent memory and
2) align their clocks.
The memory alignment is a process whereby RAM and registers in each processor be-
comes congruent. In this alignment process each processor within the channel transmits
and receives voted messages representing blocks of its congruent memory areas. Voting
this series of messages via the Network Element voted message mechanisms creates bit-
wise voted copies of the memory across all channels. Because the goal is to have identical
state across all channels, the alignment process also includes the processor registers.
In each channel the local interrupt timer is responsible for generating the minor frame
interrupt, typically every 10ms. Because synchronous channels reset their local interrupt
timers in tandem, each channel congruently maintains system time. When a channel is
desynchronized it does not participate in theg_me synchronization; its system time tends to
drift. The time alignment process restarts c6ngruent time keeping by resetting and activat-
ing its local clock after a lost channel has been resynchronized and its state aligned. Be-
cause the memory alignment process (through message exchanges) has tightly synchro-
nized each channel, the clock activation minimiz_es the skew among the timer interrupts.
Page 5-71
unsynchronized
member
pickup_sync
unsynchronized
mber
status
synchronized
members
exchange
pickup
message
Iresynchronization
presence test
"_ synchronized
_yes _embers
align 1memory
I "align time [j
return to scheduler
Figure 5-35. Lost Ch:mnel Synchronization
Although this recovery option is an attractive alternative because it maintains the redun-
dancy level of the virtual group, it is not suitable to all operational situations because the
Page 5_72
alignment process requires exclusive control of the virtual group for a significant time pe-
riod (on the order of 1-2 seconds for 1M RAM). During this time period the virtual group
is unable to schedule any real-time tasks nor is it able to respond to any interrupting de-
vices.
Like the graceful degradation strategy this recovery technique can be managed entirely
by the fault masking group which diagnosed itself; it is not necessary to activate the system
manager to control the recovery activity. However, other virtual groups communicating
with the recovering virtual group must be able to tolerate a 1-2 second dropout. This con-
..........
stitutes an application-specific decision ..................
5.6.8.2.1.3. Processor Reinte_ation
In some cases a channel failure will manifest itself as a syndrome error indicating that
the processor presented bad data for voting without that channel losing synchronization. In
this case the failure could be attributable to a bad RAM location which could be rectified by
a memory realignment without the resynchronization of a lost channel or just a glitch in an
outgoing message. It is necessary to realign the memory (that is, machine state) of the pro-
cessor with the other channels in order to correct a recurrent syndrome error in a processor.
This strategy has the similar temporal characteristics as processor resynchronization.
5.6.8.2,1.4. ProcessorReplacement
The intent of the processor replacement strategy is to replace the faulty processor with a
spare processor known to be fault free in order to maintain the redundancy level of the vir-
tual group. In this strategy, a spare processor must be located, configured as a member of
the redundant virtual group while the faulty processor is configured as a simplex, and the
memory of the redundant virtual group be aligned. This option also has the similar tempo-
ral characteristics as processor resynchronization.
In this scenario it is necessary that a system-wide task control this process rather than
the diagnosed virtual group itself. Because multiple virtual groups are involved in the re-
configuration process these yirtual groups must be coordinated globally. The system man-
ager maintains knowledge of the health and configuration of all AFTA components. Con-
sequently, it maintains the information required to optimally decide the updated configura-
tion.
5.6.8.2.1.5. __Ps__ccmcnt with !0itializ;ltion
77-!!11
Because the processor replacement strategy suffers a significant time penalty because of
the alignment process, an alternate strategy is posed which closely parallels that strategy.
The processor replacement with initialization alternative replaces a faulted processor with a
Page 5-73
sparebutratherthanalign thevirtual group it initializes all tasks in the virtual group. Task
initialization is expected to require significantly less time than the alignment process.
5.6.8.2.1.6. Task Migration
The strategies specified previously have concentrated around the redundancy con-
straints - either its relaxation (degrade VG) or its maintenance. There may be other con-
straints to a recovery policy such as maintenance of communication among a redundant
virtual group with an I/O device. For example, if a processor of a redundant virtual group
is assigned communication over a FCR backplane bus with a specific I/O device and if that
processor fails, the I/O task could be transferred to another virtual group with a member in
that fault containment region.
This alternative is essentially a single task migration rather than a total transfer of all
tasks to a spare processor as would be the case for a memory alignment. The migration of
a single active task is very complicated, requiring not only the transfer of the task's stack
space but also its global variables which may be scattered throughout memory and, of
course, would likely be intermingled with variables of other tasks. Consequently, a task
migration could only be a feasible recovery "alternative only in circumstances where the mi-
grated task could be transferred and initialized.
5.6.8.2.2. Recovery from Network Element Failure
The failure of a Network Element significantly reduces the reliability of the system.
Not only does its failure increase the probability of a system failure because of the loss of
this shared resource but also because any processors attached to that failed Network
Element are disabled as well despite the health of those processors. If the processors are
members of redundant virtual groups, the redundancy of their associated virtual groups is
decreased.
Because of the criticality of the Network Elements for Byzantine resilient communica-
tions, it is important that recovery of a failed Network Element be attempted. In fact, the
recovery of a Network Element is implicit in the design of the network element architecture
so that the recovery of only the failed NE is much less disruptive of system operations than
recovery of a failed processor. This strategy is described as the Network Element resyn-
chronization option. However, because of the nature of the failure, it is not always possi-
ble to recover a failed Network Element. For the latter situation, a Network Element
masking option is presented.
Page 5-74
5.6.8.2.2.1. Network Element Resvnchronization
The reintegration of a Network Element is a multiple step process which can include
reintegration of the processors into their respective redundant virtual groups:
1) The Network Element must be reset to initialize its internal state.
2) The Network Element must resynchronize itself with the other Network Elements.
3) Processors communicating directly with the resynchronized Network Element can
be reintegrated with the other members of the!r corresponding virtual groups. This could
be accomplished using one of the processor recovery strategies described above.
Because the system manager assumes responsibility for the diagnoses of a Network
Element, it would also reset the faulted Network Element via a voted reset (See Section 4).
This automatically initiates a synchronizatiQn •methodology within the Network Element
which attempts to perform an initial synchronization (ISYNC). When the network fails to
detect an initial synchronization from the other Network Elements, it initiates a resynchro-
nization phase in which it assumes that the other Network Elements are synchronized and it
itself is desynchronized. When the resynChronization has succeeded, the Network
Elements align the configuration tables so that each Network Element has a consistent view
of the system configuration. In this system configuration each processor on this failed
Network Element has assumed a new status as a simplex. If they had been members of re-
dundant virtual groups, they can be resynchronized and realigned using a processor recov-
ery strategy.
5J6.8.2.2.2. Network Element Masking
In some cases a Network Element may not behave correctly even after repeated attempts
to recover that failed Network Element. It may fail to respond to a voted reset or to syn-
chronize with the other Network Elements or it may exhibit faulty behavior shortly after
reintegration. In these cases, it is necessary to permanently disable that Network Element
via a configuration update masking out the failed Network Element. This message (issued
by the system manager) will cause the other Network Elements to disable the faulted net-
work element's data and clock inputs .....
5.6.9. Transient Fault Analy_iz
When a component exhibits faulty behavior, it is important to determine if this failure
resulted from a transient condition or from a•permanent malfunction. If the failure can be
deemed a transient failure then system resources can be utilized most efficiently because the
component can be reintegrated into the functioning system. Since transient failures are as-
P-age5--3:73-
sumed to be caused by some temporary environmental condition (e.g., a power surge),
they are expected to disappear with time. Permanent malfunctions, on the other hand, are
caused by breakdowns of the AFTA hardware that must be physically repaired.
A transient fault strategy will be implemented which resets the component and repeats
the test suite. A subsequent failure is indicative of a permanent failure and the component
would be disabled.
Transient fault analysis can be implemented by one of the following basic strategies:
1) A transient recovery policy
2) A wait-and-see policy, or
3) A no transient fault analysis approach.
5.6.9.1. Transient Reco_
The transient recovery policy would immediately disable the faulty component and im-
plement a recovery policy to reintegrate the faulty component into the system. After suc-
cessful integration, if the component did not fail again during a probationary period, it is
deemed to have suffered a transient fault. Figure 5-36 depicts the algorithm for transient
recovery.
Page 5-76
ii
( initiate )
component ....
initiate I
component
recovery strategy
monitor
component for
probationary
time period
°°
enable !component
h
permanently
disable
component
Figure 5-36. Transient Recovery Algorithm
The distinction between transient and hard failures defines the two functions of the
transient recovery option:
• It decides when it is appropriate to attempt to component recovery.
• Once a component has been reintegrated, it monitors its health for a brief proba-
tion period before declaring it fully recovered.
• _- ' ' Page 5-'77
Using the transient recovery method, a component recovery is periodically attempted.
If indeed a component encountered a transient fault, it is desirable to recover the component
quickly. Conversely, if the component suffered a permanent failure, it is highly desirable
to avoid excessive computing resources to revive this component. Transient recovery bal-
ances these two requirements by initially assuming that any particular fault is transient (it
has been observed that 50 to 80 percent of all faults in computer systems are transient) and
automatically attempting a recovery. As time passes without the component being recov-
ered, it becomes more likely that the fault is a hard failure rather than a transient, and tran-
sient analysis makes the recovery attempt less often. After a certain period it can be reason-
ably assumed that the failure is a hard failure; therefore, the transient analysis function per-
manently disables the component.
Additionally, it has been noted that permanent failures tend to manifest themselves spo-
radically. A component may be recovered according to the above criteria, but may imme-
diately fail again. Transient fault analysis attempts to prevent this situation by regarding a
recovered component as recovered only on a trial basis. If the component passes its trial
period without further errors, it is regarded as fully recovered and can be incorporated into
the AbTA configuration. On the other hand, if the component fails during the probationary
period, the component is permanently disabled.
If transient recovery fails to reintegrate the faulted component, an alternate recovery
strategy can be invoked. This might be the case ifa processor fault which initiated the pro-
cessor reintegration strategy subsequently suffered another fault. It may be appropriate that
the processor be permanently eliminated from the virtual group via graceful degradation or
processor replacement.
A transient fault can cause a state change which may not disappear with time. Using
the transient recovery methodology, the diagnosed component is essentially initialized,
reintegrated into the operational system and, after the trial period, is exonerated of being
faulty. Because the transient recovery policy attempts to return the component to an opera-
tional state, the transient recovery policy is the most ideal option.
5.6.9.1.1. Processor Recovery
When a processor is diagnosed as faulty, a processor can be disabled yet still maintain
its identity as a member of a virtual group. These processors can be recovered in two basic
ways which depend upon the manifestation of the processor fault using either the processor
resynchronization or the processor integration strategy. As indicated previously, both re-
covery options require a significant amount of time because of the memory alignment pro-
cess integral to these recovery strategies.
"ib'_ 5.78
5.6.9.1.2. Network Element Recovery ................
When a Network Element has exhibit_:faulty behavior and is immediately disabled,
that Network Element is reset causing it to lose synchronization with the other network ele-
ments. Since the Network Element communicates with some number of processors which
may be members of redundant virtual groups, each of those virtual groups consequently
loses a channel. Because the Network Elements are integral to Byzantine resilient commu-
nications, it is highly desirable a Network Element which has suffered a transient fault be
reintegrated. A recovery policy should include the reintegration of the Network Element
and, if possible, recovery of all processors as well.
5.6.9.2. Wait and See Transient Analysis Option
The transient recovery option performs the analysis of a transient failure condition by
attempting reintegration of the diagnosed component and analyzing the results. However,
any recovery option to reintegrate a failed component is timely and may be in conflict with
the mission requirements at the time the ree0very is attempted. For these reasons a more
conservative approach is posed as a substitute.
The wait-and-see transient analysis po!!cy does not disable a component until the fault
condition existed for a prescribed period of time. If the fault persists, then the component
is judged to have endured a permanent fault,. If the fault disappears, then it is assumed that
the component suffered a transient fault.
_ ( initiate )
no
\ } disable
! c°mp°nent
(completed)
Figure 5-37. Wait and See Transient Fault Analysis algorithm
Page 5-79
This policy is particularlyattractivewhenthemissionconstraintsdo not permit the im-
plementation of a component recovery strategy and hence, a transient recovery option.
This option may be used in conjunction with a graceful degradation strategy for processor
failures.
5.6.9.3. Na Transient Fault Analysis O.otiott
A no transient fault analysis approach may be selected which immediately disables the
faulty component and does not attempt to reintegrate the component diagnosed as failed.
( initiate )
component
Figure 5-38. No Transient Fault Analysis algorithm
5.6.9.4. Hybrid Transient Fault Analysis Option
When a Network Element fails it is highly desirable to attempt to recover at least the
Network Element in order to maintain a system which is resilient to Byzantine failures.
However, because disabling the Network Element actually desynchronizes it from the other
Network Elements, it is imperative that a failure exist in the Network Element with a high
degree of certainty. For these reasons, a hybrid transient analysis is presented which com-
bines the functionality of the transient recovery and the wait-and-see strategies. This is
depicted in Figure 5-39.
Page 5-80
initiate )
no
/ J
I initiate
r component
recovery strategy
 successfiiiuccessf1
monitor
component for
probationary
time period
yes ..
w
_rnp enable
o[n.ent
(completed)
permanently
disable
component
Figure 5-39. Hybrid Transient Fault Analysis algorithm
Page 5-81
5.6.9.5. Intermittent Fault Analysis
Typically, a component which is failing exhibits faulty behavior sporadically some time
period before its failure becomes unquestionably permanent. If a fault in a component re-
occurs within a predefined time threshold then the fault, originally classified as a transient,
will be reclassified as an intermittent. This intermittent failure interval is assumed to be
greater than the probationary period.
5.6_9.6. Transient Fault Analysis Qotion and System Modes
It may desirable to maintain multiple recovery options for each component which are a
function of the system mode. The system requirements during power-on or standby may
be much less stringent than during the operational mode. Figure 5-40 presents a possible
correlation of transient fault analysis methodologies for processor and Network Element
failures in each system mode.
SYstem
e
N
componenvX
type _ N
processor
failures
network
element
failures
power-on
transient
recovery
transient
recovery
standby
transient
recovery
transient
recovery
operational
wait-and-see
transient
analysis
hybrid
transient
recovery
Figure 5-40. Possible Mapping of Transient Analysis Options to System Modes
5.6.10. Fault Logging
In order to support automatic fault logging and maintenance recording, each fault con-
tainment region will be equipped with a mass memory device. This memory device will
contain non-volatile RAM to ensure that neither maintenance records for each component
nor faults detected during the operational modes disappear when power is lost.
Information regarding the component identifier, number of power-on cycles, time since
power-on, Greenwich Mean Time, fault description, and system configuration will be
Page 5-82
maintainedin the data log for retrieval at a later time by line maintenance personnel. In
addition, this repository will also keep maintenance records such as when each component
was installed or serviced and when the M-BIT test suite last exercised each component.
Any processor within a fault containmenj region may access the mass memory via the
FCR backplane bus. During I-BIT and M-BIT non-fault tolerant mode testing all proces-
sots may access the mass memory to log results of their self tests. In order to facilitate later
fault reporting, the fault information stored during the I-BIT and M-BIT self tests will be
disseminated during fault tolerant operations s0 that all mass memory devices will maintain
identical copies of the fault status of each c_ponent. Even after the distribution of the fault
information, however, it is still possible that__this data will be non-congruent especially if
the communication medium is disrupted between the diagnosing entity and the mass mem-
ory as would be the situation when a Network Element, mass memory device or FCR
backplane bus fails.
During fault tolerant operations the diagnosing virtual group will transfer fault informa-
tion to the system manager which coordinates and distributes this information for storage in
all mass memory devices.
Because the fault log maintains information concerning all faults which have occurred
irrespective of the test mode or the transientanalysis policy active at the time of the fault
detection, there may be logged faults which are indicative of a transient fault and, conse-
quently, are not reproducible (that is, cannot duplicate). The M-BIT tests are useful to dis-
cern whether or not a logged fault was a transient by extensively exercising all functional
components of the LRM identified as faulty.
5.6.11. Fault Reporting
All fault information will be maintained in the mass memory devices in each FCR.
These will be the primary repository for all fault information. During non-fault tolerant op-
erations a processor in each FCR will be capable of extracting this information for transmit-
tal to a fault reporting device attached to the FCR backplane bus. During fault tolerant op-
erations a single processor in the fault containment region will be responsible for extracting
fault information from its local mass memory and for communicating that information to the
system manager via fault tolerant message exch,'mges. The system manager will format the
information for the displays.
Fault status will be reported on three different types of displt, ys - a cockpit display unit
(CDU), a portable intelligent maintenance aid (PIMA) and a fault annunciator panel (FAP).
Page 5-83
5.6.11.1. Cockpit Display Unit
The CDU is a CRT display with a small screen located in the cockpit for display of
system status to the vehicle operator. This display may have three levels of detail with re-
gard to the identification of a faulty component:
1) AFTA system status,
2) LRU level status or
3) LRM level status.
The AFTA system status is merely an indicator representing the GO/NO GO status of
the AFTA system. The AFTA status represents the availability of system components to
achieve the minimum dispatch complement (MDC) required for the mission critical opera-
tion.
Figure 5-41. AFTA System Level Display
The LRU level displays the status of each fault containment region. This status is es-
sentially dependent upon the availability of shared components within the fault containment
region such as the FCR backplane bus or the Network Element. The failure of any of these
components renders a NO GO indication. Because the ability of the Ab"rA to achieve MDC
is not wholly dependent upon the availability of the LRU but depends upon the availability
of processors and I/O devices as well, the display of the LRU status should also include the
display of the AFTA system status.
Page 5-84
AFTA system status
LRU 1 slams LRU 2 status :: LRU 3 status LRU 4 status
Figure 5-42. LRU Level Display
The LRM level is the most detailed because it represents not only the status of the net-
work element but all other individual components (that is, processors, IIO devices, power
conditioner, mass memory device, and FCR backplane bus) as well. This display option
can precisely depict the cause of some failures. For instance, if the AFTA system status
indicated a NO GO state, the LRM level depicts precisely which component failures gener-
ated this state. It may have been caused by the failure of a single crucial component such as
a Network Element or it may have resulted by attrition of any combination of I/O device or
processor failures. A possible representation for the LRM display is depicted in Figure 5-
43.
Page 5-85
AFTASystemStatus
.RU4
Figure5-43. LRM LevelDisplay
The CDU will only beupdatedwhile thesystemis in eithera standbyor operational
mode. Communicationwith theCDU requiresthattheAFTA beoperatingsynchronously
with fault tolerantmessagexchanges.
I,_ble Intelligent Maintenance A M
The PIMA is a unit specifically dedicated to aid in maintenance diagnostics. Ideally, it
would resemble a laptop computer with a display, keyboard or buttons, and a printer. It is
employed to initiate maintenance diagnostic testing (that is, M-BIT), to interrogate the
AFTA for detailed fault information logged during operations for display or printing pur-
poses and to extract maintenance records for each component.
The PIMA is plugged into a socket which is located on the outside of each LRU as well
as at other strategic vehicle locations. However, for flexibility in performing maintenance
operations, there are essentially two connection options - AFTA system level and LRU
level. The distinction in these options revolves around the nature of the status of the
AFTA. With the AFTA system level option, a maintenance officer would plug the PIMA
into any socket and would be able to extract fault and maintenance information system-wide
(that is, from all fault containme,at regions). Alternatively, with the LRU level option the
PIMA would require plugging into the specific LRU from which fault and maintenance in-
formation was desired. The former option would require that the AFTA hr._rdware be in
standby mode in order to communicate all system information to the selected socket. The
option is selectable from the PIMA.
Page 5-86
5.6. ! 1.3. Fault Annunciator Panel
The FAP is a panel displaying the "go/no go" status of each component in a fault con-
tainment region. The panels are physically located in each LRU in close proximity to the
LRMs. It is implemented as a series of mechanical switches which are controlled by the
AFTA and which maintain their status when power is turned off. Each switch corresponds
to a single LRM in an easily identifiable pattern so that an LRM which is designated faulty
component can be readily identified and extract_ for maintenance or replacement.
5.7. I/O Services
The Army Fault Tolerant Architecture!/O Services provide efficient and reliable com-
munication between the user and external !/O devices (sensors and actuators). It is logi-
cally segmented into two functional modules& the I/O User Interface and the I/O Communi-
cation Manager (illustrated in Figure 5-44).Applications engineers use the I/O User Inter-
face to define the required I/O activity during the specifications phase. During the execu-
tion phase, the I/O Communication Manager controls the processing of the I/O requests.
The I/O User Interface and I/O Communication Manager are dependent processes, as
depicted in Figure 5-45. The Interface interacts with the application tasks to create an I/O
request database. Further, the Interface and=Communication Manager exchange control and
status information; the output data and con_oi commands are destine for the I/O devices
while the input and status data are sent to the application tasks. Additionally, the Commu-
nication Manager retrieves information from the I/O request and I/O device databases and
interchanges data with the 1/O devices. ........
Applications
Engineer
,i
| User l
Unterfac:J
f
AFTA
I/0
Communicatiol
Manager
I/O 1Devices
t t I
Specification Execution
Figure 5-44. The AFTA I/O Services
Page 5-87
Figure 5-45. The I/O User Interface and I/O Communication Manager
The preliminary design of the AFTA I/O User Interface is detailed in Section 5.7.1
while the I/O Communication Manager is described in Section 5.7.2. Sections 5.7.3 and
5.7.4 use examples to clarify the description of the AFTA I/O Services: Section 5.7.3
highlights the dependence between I/O requests' execution and processing times.
5.7,1. The AFTA I/0 User lnt._rface
The discussion of the I/O User Interface is separated into three sections: I/O User
View, I/0 Request Construction, and I/O Data Access.
5.7.2. Input!Output User View
There are three desired characteristics of the AFTA I/O process. First, the load mod-
ules of different members of a redundant VG must be identical, even if only a subset of the
members actually execute the I/O operation. The second requirement is that the control
flows of redundant VGs executing I/O must be similar if not identical, even if only a subset
of the members actually execute the I/O operation; heterogeneous I/O must not be allowed
to induce sufficient skew to force the desynchronization of a redundant VG. Finally, when
Page 5-88
redundantI/O is accessed, it is important that the copies of the I/O device be accessed at
very close to the same time.
In addition, it is currently planned that all I/O activity will be synchronized with frame
boundaries. That is, even though I/O requests may be completed at any time within the mi-
nor frames, their data will only be exchanged at the beginning and/or end of the frames.
5.7.3. I10 Request Construction
The AFTA I/O User Interface is a flexib!_ framework which is easily tailorable to meet
the reliability and performance requirements of avionics applications that access external
devices. The Interface is easy to use; the applications designer can specify the I/O activity
in a straightforward manner. In addition, the Interface provides all the tools necessary to
meet AFTA's I/O needs.
The AFTA I/O Services can either communicate directly or indirectly with I/O devices
(sensors and actuators). Direct communication is achieved by sending data and command
information immediately to the device. Indigect communication utilizes an I/O controller to
access a device. This intervening mechanis_ accepts data and control commands from the
VG and then manages the I/O operation.
The AFTA I/O Services support two general types of I/O activity: sequential and con-
current. Sequential 1/0 requires that the VG completely supervise the activity; that is, it
must block itself until the I/O operation has finished. Accordingly, the VG and the I/O de-
vices are tightly synchronized during the !/O activity. This is necessary to communicate
with I/O controllers or devices that have limlited processing capabilities such as A/D con-
verters or "dumb terminals".
Alternatively, concurrent 1/0 allows thcNG to perform other tasks while the I/0 is be-
ing processed. The VG downloads data to the controller, sends an "start" command, and
then executes another process. After the I/O has completed, the VG collects the resultant
input data. The concurrent I/O capability is provided to maximize AFTA's processing
throughput. To permit this parallel I/O - VG processing, smart hardware such as an Ether-
net or 1553 controller is necessary.
The applications engineer defines the required I/O activity. This is accomplished by
specifying one or more I/O requests. The I/O specifications are constructed in an hierarchi-
cal manner, beginning with transactions, continuing with chair, s, and ending with the I/O
requests. Figure 5-46 depicts these components and how they are related.
Page 5-89
.$.7 A. _. I/0 Transactions
A transaction is an autonomous command (or command and response) sequence,
permitting interaction between a VG and an I/O device (IOD). In general, an application
can create three types of transactions:
* An input transaction which is a sequence of instructions that waits for informa-
tion from an IOD.
* An output transaction which consists of a sequence of instructions that sends in-
formation to an IOD and does not expect a response.
An input output transaction which involves a sequence of instructions that re-
quests information from an IOD followed by a sequence of instructions that
waits for the IOD response.
If the transaction directly accesses an I/O device or controller that has limited processing
capability, then the VG must perform a sequential I/O operation. In contrast, if the device
or controller is relatively intelligent, then the VG can execute a concurrent I/O transaction.
The parameters necessary to specify input and output transactions and examples of their
initialization are illustrated in Figures 5-47, 5-48, and 5-49 respectively. These fields are
defined as follows:
• Transaction_Type. This parameter indicates whether the transaction is an input,
output, or input/output operation.
Page 5-90
Appllcallons
Engineer
I'P I )
Figure 5-46. I/O Transactions, l/OChains, and I/O Requests
• lOD_ldentifier. This is the I/O device identifier. It is used to indicate the corre-
sponding device driver and dete_ine the device's address.
• Num_lnput_Bytes. The num_r of input data bytes that are expected by the
transaction. .....
• Num_OutputBytes. This parameter specifies the number of output data bytes
that will be transmitted by the transaction.
• Dynamic or Static. This field indicates whether the output data for this trans-
action is dynamic (changes with time) or static (time-invariant).
• Time_Out. This is the worst case time that is waited by the VG or an I/O con-
troller for the arrival of an incoming data byte (in microseconds).
• Input_Buffer. Address of the buffer on the VG in which the input data will be
stored.
• OutputBuffer. Address of the buffer on the VG from which the output data
will be transmitted.
• SC_Transactions. An array of transaction identifiers that depicts the transac-
tions that are involved in this transaction's I/O source congruency (SC) algo-
rithm.
• SC_Source. This parameter identifies the source co_gruency algorithm that will
be used. The options are: FromA, From_B, From_C, From_D, and Voted.
.... Page 5-91
INPUT TRANSACTION INFORMATION :=
- ( TRANSACTION TYPE
IOD IDENTIFIER
NUI_ INPUT BYTES
TIME-OUT -
INPUT BUFFER
SC TR_4NSACTIONS
SC-SOURCE
=> INPUT;
=> GUIDANCE_COMMAND;
=> 12;
=> 128;
=> BUFFER'ADDRESS;
=> SC TRANS_ARRAY;
=> FROM_A);
Figure 5-47. An Input Transaction Record
OUTPUT TRANSACTION INFORMATION :=
- ( TRANS/CCTION TYPE =>
IOD IDENTIFIER =>
NUI_ OUTPUT BYTES =>
DYN/_MIC OR STATIC =>
OUTPUT BUFFER =>
OUTPUT;
ENGINE_ACTUATOR;
6;
DYNAMIC;
BUFFER "ADDRESS);
Figure 5-48. An Output Transaction Record
INPUT OUTPUT TRANSACTION INFORMATION :-
- ( SFRANSACTION-TYPE =>
IOD IDENTIFIER =>
NUM INPUT BYTES =>
NUM-OUTPUT BYTES =>
DYN/_MIC OR STATIC =>
TIME OUT - =>
INPUT BUFFER =>
OUTPGT BUFFER =>
SC TRAtVSACTIONS =>
SC-SOURCE "->
INPUT OUTPUT;
ALTITUDE_SENSOR;
16;
2;
STATIC;
255;
IN BUFFER'ADDRESS;
OI.TT BUFFER'ADDRESS;
SC _RANS ARRAY;
VO_ED); -
Figure 5-49. An Input/Output Transaction Record
Page 5-92
CREATE_TRANSACTION ( TRANSACT]ON_.ID ,
TRANSA CTI ON..INFORMA TION );
Figure 5-50. The Create_Transaction Procedure
After the transaction's parameters have been defined, the Create_Transaction procedure
(shown in Figure 5-50) must be invoked to inform the AFTA I/O Services of the transae-
tion's existence. This call requires two fields:
• Transaction_lD. An identifier that is returned to the application by the Interface,
allowing the transaction to be easily referenced in the future.
• Transaction_Information. The record, described above, which depicts the val-
ues of the transaction's fields.
5.7.5.. I/O Chains
A chain is a set of transactions, grouped toefficiently utilize the communications
bandwidth. Chains are executed as autono_gus units. The transactions of a chain are exe-
cuted serially, and all are either sequential or concurrent I/O. If the chain is a concurrent
I/O operation, then all transactions in the chain must utilize the same I/O controller.
CREATE_CHAIN( CHAIN_ID,
CHAIN I/0 TYPE = > CONCURRENT;
TRANSTtCTION ARRAY => TRANSACTION 11) ARRAY);
Figure 5-51. The Create_Chain Procedure
After a chain's transactions have been specified, the Create_Chain procedure, depicted
in Figure 5-51, should be executed to link the chain to its transactions. It has three parame-
term
- Page 5-93
• Chain_lD. An identifier that is returned to the application allowing the chain to
be easily referenced.
• Chain_I/OType. A field that indicates whether the I/O is sequential or concur-
rent.
• Transaction_Array. The array of transactions that comprises the chain.
5.7.6L I/0 Requests
An I/O request (IOR) is one or more chains of transactions. All !/O is structured as I/O
requests.
Each chain of an I/O request is executed nearly simultaneously. This requirement is
desirable because it allows two or more chains to request redundant information. If all
chains are executed within a negligible skew, then data from redundant sensors can be re-
trieved at nearly the same time and the information can be subsequently compared to mask
invalid data.
CREATEI/O_REQUEST( IOR ID,
RATE OF IOR =>
EXECUTION FRAMES -->
CHAIN ARR_4Y =>
EXECUTION TIME =>
R2;
(o, 4);
CHAIN 1D ARRAY;
IOR_E_EC__TIME) ;
Figure 5-52. The Create_I/O Request Procedure
After an I/O request's chains have been specified, the Create_I/O Request procedure,
illustrated in Figure 5-52, should be invoked. This call has five parameters:
• IOR_ID. An identifier that is returned to the application allowing the I/O re-
quest to be easily referenced.
• Rate of__lOR. A parameter that indicates the rate group of the I/O request, ei-
ther RG4 (100 Hz.), RG3 (50 Hz.), RG2 (25 Hz.), or RG1 (12.5 Hz.).
• Execution_Frames. The frames in which this I/O request will be executed
(started in the beginning of the frame). It will be some combination of numbers
ranging from (-1) through 7. If the field is set equal to (-1), then the I/O request
is started in the beginning of the frames in which it is processed. Strict control
over the execution and processing frames is desirable to efficiently manage
combinations of I/O requests that stress the 10 ms. throughput bounds.
• ChainArray. The suite of chains that comprises the I/O request.
Page 5-94
ExecutionTime. The worst caseexecution time for the I/O request. It is equal
to the I/O request time out; that is,t_e total amount of time the VG waits before
presuming that the I/O request has completed executionl
5.7.7. I10 Data Access
The applications user perceives all I/O devices to be memory mapped even though
many devices may be connected to the VG_through I/O controllers. The appearance of
memory mapped I/O is attained by establishing data sections in VG memory to emulate the
data regions in the I/O devices or controllers, These memory buffers are allocated by the
application and their addresses are passed to the I/t9 Services when the transactions are de-
fined. The applications engineer communicates to I/O devices by simply writing to and
reading from the corresponding local memory regions.
The reading/writing protocols between the VG and the I/O device are transparent to the
user. When the I/O request is executed, the I/O Communication Manager reads data from
the request's output buffers and sends it, along with command information, to the destina-
tion I/O devices. After the I/O request has _en processed, the data returned by the devices
is stored into the application's input buffers.
Since the application tasks and the I/O Communication Manager share I/O data buffers,
both processes could simultaneously updatethe same memory area. For example, the fol-
lowing scenario could occur: (1) an application task begins to modify a data buffer, (2) it is
preempted by the I/O Communication Manager before it completes the update; and (3) the
I/O Communication Manager begins to re_d or write into the buffer. Such simultaneous
modification of an I/(3 request buffer could cause inconsistent or corrupted data to be sent
to the I/O devices or read by the application. To prevent (or detect) this condition, the I/O
Interface provides procedures that allow the application to lock and unlock the data buffers.
In addition to input data, the I/O Communication Manager receives status information
from the I/0 devices. This status is recorded in the VG's memory after the I/O request has
been executed and processed. (Buffers ar_allocated by the I/0 Interface to store the I/O
status when each request is created.) The UO User Interface provides routines to allow the
application to read this data.
The buffer control and status retrieva! procedures are discussed in more detail in Sec-
tions 5.7.1.3.1 and 5.7.1.3.2, respectively:
5.7.8. The Buffer Control Procedur_
To ensure that data buffers for an l/O request are accessed mutually exclusively, the !/0
Interface allows the user to lock the request's data regions. Prior to reading or writing
Page 5-95
data,theapplicationmustinvoketheLock_I/O_Request_Buffer procedure (shown in Fig-
ure 5-53) to reserve the memory.
LOCK1/O_REQUEST_BUFFER ( IOR ID,
IN_USE);
Figure 5-53. The Lock_I/QRequest__Buffer Procedure
The procedure requires two parameters, one provided by the user and the other returned
by the Interface. The IOR_ID field is specified by the application to identify the request.
The InUse field is a boolean returned by the procedure indicating whether or not the re-
quest's buffers are currently being accessed by the Communication Manager.
If an I/O request's buffers are not "in use" when Lock_I/O_Request_Buffer is exe-
cuted, then they are reserved by the procedure. After the application data has been read or
written, the buffer must be unlocked. This is done by calling the U n-
iock_l/O_Request_Buffer procedure. This routine frees the buffers and thus allows the 1/O
Communication Manager or another application task to modify the memory region. The
procedure, which is illustrated in Figure 5-54, requires that the I/O request ID be provided
by the user to identify the buffer.
UNLOCK_I/OREQUESTBUFFER ( IOR_ID);
Figure 5-54. The Lock_l/O_Request_Buffer Procedure
If the application tasks and I/O Communication Manager contend for the one or more
buffers (i.e., In_Use is true), then a fatal scheduling error has occurred. If a collision oc-
curs, the user has to redesign the application tasks, the I/O requests, or both, because con-
tention recovery mechanisms are not provided by the I/O Services.
As discussed, the Lockl/O_RequestBuffer procedure informs user of one of two
possible contention scenarios: the I/O Communication Manager is using an I/O request's
buffers when an application task wants to access them. The second scenario is the reverse
case: the I/O Communication Manager wants to read or store information but an application
task has locked the buffers. This is also a fatal scheduling error of which the user must be
informed. If this type of fault occurs, an exception, 1/O_Buffer_Contention, is raised by
the I/O Services and passed to the application. To recognize the error, the user must write
Page 5-96
an exception handler in each application task. As with In_Use, I/O_BufferContention re-
quires that either the application tasks or the I/O requests have to be modified. (Figure 5-
55 depicts the declaration of the exception and an example exception handler.)
Exception Declaration:
I/O BUFFER CONTENTION: EXCEPTION,"
Exceptipn Handler:
WHEN I/O BUFFER CONTENTION => WRITE ERROR STATUS;
.... STOP PROGRAM EXECUTION;
Figure 5-55. The I/O_Buffer_.Contentio_ Exception and Error Handler
As mentioned earlier, after an application's I/O requests have been executed, the input
data from the I/O devices is deposited into the application's input buffers by the I/O Com-
munication Manager. The data, however, isnot necessarily error-free. The user must ac-
cess the I/O request status information, which is also recorded by the Communication Man-
ager, to determine if any of the data is faulty.
The I/O User Interface allows the application tasks to retrieve status information on an
I/O request, chain, or transaction basis. _e procedures that enable this access are illus-
trated in Figure 5-56.
CHECK_I/O_REQUEST FOR_ERRORS ( IOR_ID,
I OR_HAD_ERR OR ) ;
CHECK_CHAIN_FOR_ERRORS( CHAIN IO,
CHAIN-HAD ERROR,
ALL TRANSACTIONS ARE BAD,
CH_N_DID_NOT_CO_IPLETE);
CHECK_TRANSACTION_FOR_ERRORS( TRANSACTION ID,
TRANSACTION-HAD ERROR,
I/ O_D EVIC E_S'ffA TUS_"
Figure 5-56. The I/O User Interface Status Retrieval Procedures
Page 5-97
TheCheck_l/O_RequestFor_Errors procedure indicates whether or not any of the I/O
request's chains encountered an error during their execution. The lOR_Had_Error boolean
is set to true if an error has occurred.
The Check_Chain_For__Errors procedure designates that a portion of the chain's data is
faulty; it sets the Chain_Had_Error field to inform the application. It also returns other
status information: (1) All_Transactions_Are_Bad, a flag that indicates whether or not all
transactions in this chain have errors; and (2) Chain_Did_NotComplete, a boolean which
informs the user that, for some reason, some of the chain's transactions were not executed.
The Check Transaction For Errors procedure returns the Transaction_Had_Error pa-
rameter to inform the user whether or not a transaction has an error. In addition, it pro-
vides the I/0 Device Status field to indicate the status of the I/O device when the transac-
m
tion was executed (either active or failed).
5.7.9. Th_ AFTA I/O Communication MAtnagcx
The AFTA I/O Communication Manager supervises the execution and processing of the
I/O requests. It involves two key components: the Nonpreemptable I/O Dispatcher and the
I/O Request Tasks. These processes are illustrated in Figure 5-57.
The Nonpreemptable 1/O Dispatcher manages the execution of the I/O requests whereas
the I/O Request Tasks perform the error detection processing and return the data and status
information to the application tasks. These processes are discussed in Sections 5.7.9.1 and
5.7.9.2, respectively. In addition, the scheduling of the I/O Dispatcher and the Request
Tasks is outlined in Section 5.7.9.3.
Page 5-98
__ " " PTL• . , . : ,. ¸:::_ "'::......._:: "-_i_i_ • .... :.........._::: !_ ............._ •:: i_i:':_+"k !!_:_'::'i_'!_'_'_'_ii_:_'i__:ii:ii_:_i:_i:!:::::i:::!__iii!..i!: !:! :i! !i _ _i :If
i
] NONPREEM AB E _ UO .....
" I/0 REQUEST TASKS i i:Ji!:_!
DISPATCHER
• R4 Task (100 Hz.).
• Executes Concurrent I/O.
(Preemptable)
• Differing Rate Groups.
• Reads and Processes
"' ,B , ,,
Concurrent I/O.
• Executes and Reads
Sequential I/O. • Processes Sequential I/O.
i:
....... " i : ' _:,_,_'- _ :_ '_,:,. ::::,::;:_:_'!i:_::_;:i:: :: : ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::.
The Main Components 0fihe i/O communication
Figure 5-57. The I/O Communication Manager
5.7.9.1. The Nonpreemptable I/0 Dispatcher
The Nonpreemptable I/O Dispatcher is a task on the VG that manages the execution of
the I/O instructions that cannot be interrupted. For the AFTA, two types of nonpreempt-
able instruction sequences exist: (1) the execution and reading of sequential I/O; and (2) the
execution of concurrent I/O.
Sequential I/O must be carefully controll_ by the VG, because the associated destina-
tion I/O devices have limited processing and storage abilities. Furthermore, applications
that utilize sequential I/O often require that data be sent or received quickly and in au-
tonomous batches. If the VG is interrupted, then the I/O operation could be delayed con-
siderably. Thus, the execution of sequential I/0 can not be preempted. Additionally, since
these I/O devices have minimal memory Capabilities, the input and status data for each
transaction must be read before a subsequen t transaction can be executed. Therefore, the
reading of sequential I/O 'also cannot be interrupted.
In contrast to sequential I/O, concurrent i/O is managed by an intelligent I/O controller,
permitting the IOC and VG to run in parallel. The VG, however, must initiate the I/O ac-
tivity by sending a sequence of "start" instructions to the IOC. This sequence can not be
Page 5-99
interruptedif theI/O requestsareto executecorrectly. Accordingly, theNonpreemptable
I/O Dispatchermustinitiateall concurrentI/O.
To ensurenonpreemption,the I/O Dispatchermust complete in lessthan 10ms.,
which is theminor frame. Thus,theapplicationmustdesignandorganizeits nonpreempt-
ableI/O activity suchthat theI/O Dispatcherdoesnotexceedthisconstraint. In addition,
theDispatchercannotbeinterruptedby otherI/O activity (becauseit would be delayed and
the n possibly preempted); thus, it must have the highest priority of the I/O tasks.
The control flow of the Nonpreemptable I/O Dispatcher Task is illustrated in Figure 5-
58. The task is scheduled by the Rate Group Dispatcher every 10 ms. Since the type and
amount of I/O activity typically varies with each frame, the minor frame number must be
determined every time the task is executed. Frame tracking is accomplished by maintaining
a modulo 8 counter.
Once the frame number is identified, the Nonpreemptable I/O Task executes the associ-
ated I/O requests. The concurrent I/0 is executed before the sequential I/O. This allows
the VG to execute and process the sequential I/O while the associated I/O controllers are
processing the concurrent requests. Some I/O requests may be comprised of both sequen-
tial and concurrent I/O chains (referred to as "mixed I/O requests"). They are executed by
the Nonpreemptable I/0 Task after the concurrent I/O requests but before the sequential I/O
requests. This allows the mixed I/O chains to be executed nearly simultaneously while not
blocking the execution of the concurrent I/O requests. For clarity, the execution of mixed
I/O requests is not explicitly shown in Figure 5-58.
Page 5-i00
task NONPREEMPTABLE I10 DISPATCHER is
FRAME_COUNTER : lntege-r := O; ....
begin
loop
-- Wait for the I/0 Dispatcher tobe scheduled by the Rate Group
-- Dispatcher.
WAIT_FOR_SCHEDULE; : _ :
case FRAME_COUNTER
0 =>
for l/O_cnt in Concurrent fflO_Frame_.O
loop
Execute Concurrent I/0 (I/O_cnt);
end loop; -_
for l/O_cnt in Sequential_l/O Frame_O
loop
Execute._Sequential_l/O (l/Ocnt);
end loop;
FRAMECOUNTER := 1;
=>
for i/0 cnt in Concurrent I/0 Frame 1
w u
loop
Execute_Concurrentl/O( I/O cnt) ;
end loop; :
for llO_cnt i_ Sequential_l/O_Framel
loop
ExecuteSequential_I/O (llO_cnO;
end loop,"
FRAME COUNTER := 2,"
=>
for I/0 cnt in Concurrent I10 Frame 7
loop
Execute_Concurrentl/O(l/O_cnt) ;
e_u:l loop;
for l/O_cnt in Sequential_I/O_Frame7
loop
Execute_Sequentiall/O (I/O_cnt) ;
end loop;
FRAMECOUNTER := O;
end case;
end loop;
end NONPREEMPTABLE I/0 DISPATCHER;
Figure 5-58. The Nonpreemptable I/O Dispatcher
•Page 5-101
5.7.9.1.1. The 1/0 Request Tasks
lgO Request Tasks are primarly responsible for the two functions not completed by the
Nonpreemptable I/O Dispatcher: (1) processing sequential I/O; and (2) reading the concur-
rent I/O data and processing concurrent requests. Since these operations can be time-con-
suming and preempted by higher priority tasks, they are not performed by the I/O Dis-
patcher. (As mentioned earlier, the Dispatcher must complete its execution in less than 10
ms.)
To process sequential I/O, the I/O Request Task executes the error detection routines
and returns the input data and status information to the application; the redundancy man-
agement functions are not performed because the sequential input data has been previously
read and distributed to the VG by the Nonpreemptable I/O Dispatcher. In contrast, the I/O
Dispatcher does not read the concurrent input data; therefore, the concurrent I/O processing
invokes the redundancy management routines as well as executes error detection proce-
dures and stores the input data and status information.
An I/O Request Task is spawned for each I/O request that is created by the user. The
Tasks are scheduled by the Rate Group Dispatcher and each falls into one of four rate
groups:
• RG4 - 100 Hz. tasks.
• RG3 - 50Hz. tasks.
• RG2 - 25 Hz. tasks.
• RG1 - 12.5 Hz. tasks.
As described in Section 5.3.1, the RG4 group has the highest priority of the preempt-
able I/O; RG 1 has the lowest. Low priority I/O processes are interrupted by higher priority
tasks at the 10 ms. minor frame boundaries. These lower priority tasks are resumed after
all higher priority processes have completed.
5.7.9.1.2. Dispatching
The Rate Group Dispatcher uses events to trigger the execution the Nonpreemptable I/O
Dispatcher and the Preemptable I/O tasks as well as the other system and application tasks.
The Nonpreemptable I/O Dispatcher is the first I/O task to be scheduled in each frame to
ensure that it is not interrupted. It is, however, scheduled second overall. The Fault De-
tection, Identification and Reconfiguration (FDIR) task is executed before the Nonpreempt-
able I/O Task, because the configuration of the AFTA is important to the I/O Services and
FDIR's processing time is short and deterministic.
Page 5-102
Theexecutionorderof theI/O RequestTasksdependson therequest'srategroupand
inter-groupprecedence.As discussedin S_ons 5.7.2.2and5.3.1,RG4groupsareexe-
cutedbeforethe RG3, RG2, andRG1groups;RG3prior to RG2 andRG1; etc. More-
over, within eachrate group,precedencesdeterminethe schedulingorder; that is, tasks
with higherprecedencesexecutebeforethosewith lowerones.
The AFTA Dispatcherandrategroupschedulingparadigmarediscussedin detail in
Sections5.2 and5.3.
5.7.9.2. AFTA InputOutputServices:Examples
Two examplesarepresentedto illustratethe interdependencebetweenanI/O request's
executiontimeandtheframein whichtheI/O requestprocessingis performed.In eachex-
ample,four I/O requestshavebeencreated;oneperrategroup. For theillustration,asetof
parameterswasassignedfor eachrequest._e Execution_Frames and Chain_Array pa-
rameters were selected arbitrarily and do not effect the example. On the other hand, the Ex-
ecution_Time parameter, which specifies the amount of time necessary for the I/O request
to be executed and processed, was chosen C_efully and greatly effects the scheduling of
the I/O requests. This field is varied to illustra__te the I/O requests' execution dependence.
5.7.9.2.1. Exatnple #I: A 11I/0 Request_can be Completed in 10 ms.
In Example #1, all I/O tasks are begun and completed within one frame (FDIR is ne-
glected to clarify the example).
• Frames 0 - 7: The Nonpreemptable I/O Dispatcher and the RG4 I/O request are
executed and processed.
• Frames 1, 3, 5, 7: The RG3 request is executed and processed after the I/O
Dispatcher and RG4 tasks have completed.
• Frames 3, 7: The RG2 I/O request is executed and processed after the I/O Dis-
patcher, RG4, and RG3 tasks h_completed.
• Frame 7: The RG 1 request is executed and processed after all other tasks have
completed.
This example, which is specified in Figure 5-59 and depicted in Figure 5-60, comprises
the baseline illustration of I/O scheduling in the rate group paradigm.
__p!¢ #2: All I/0 Requests can not be Comp_Icted in 10 ms.
In Example #2, the ExecutionTime parameters, which are presented in Figure 5-61,
are larger than in Example #1.
• I/O Request #1" Execution_TimLequals 4.0 ms.; it was 1.3 ms.
Page 5-103
• I/O Request#2: Execution_Time equals 3.2 ms.; it was 2.2 ms.
• I/O Request #3: Execution_Time equals 3.7 ms.; it was 1.9 ms.
• I/O Request #4: Execution_Time equals 3.6 ms.; it was 2.1 ms.
The longer execution times cause I/O request #1's execution/processing to be delayed
and I/O request #2 to be interrupted. Specifically,
I/O Request #1 (RG1): This request was supposed to execute in frame #7.
However, it was executed and processed in frame #0, because the processing
requirements for the I/O Dispatcher, RG4, RG3, and RG2 tasks consumed all
of frame #7.
I/O Request #2 (RG2): This request was started in frame #3 but did not finish.
As a result, it was preempted in the beginning of frame #4 by the higher priority
tasks, the Nonpreemptable I/O Dispatcher and RG4. After these processes
completed, I/O request #2 was resumed and completed. Similarly, I/O request
#2 did not complete in frame #7; it was preempted and subsequently completed
in frame #0.
Even though i/O requests #1 and #2 are not completed in their designated minor frame,
the major frame requirements are met. That is, I/O requests #1, #2, #3, and #4 fulfill their
respective 12.5 hz., 25 hz., 50 hz., and 100 hz. I/O requirements. This scheduling exam-
ple is illustrated in Figure 5-62.
Page 5-104
CREATE_I/O_REQUEST(
CREATEI/O REQUEST(
CREATE_I/O_REQ UEST(
CREATE_I/O_REQUEST(
IOR #I,
RATE OF IOR => RI;
EXE_ION FRAMES => -I;
CHAIN_ARt_A Y,
EXECUTION TIME = > 1.3 ms );
IOR #2,
RATE OF IOR => R2,"
EXECUTION FRAMES => -1;
CHAIN_ARffAY,
EXECUTION TIME = > 2.2 ms );
IOR #3,
RATE OF IOR => R3;
EXECUTION FRAMES = > -I;
CHAIN__ARI?AY,
EXECUTION TIME = > 1.9 ms );
IOR _,
RATE OF IOR => R4;
EXEC/"UTION FRAMES = > -1;
CHAIN_ARRAY,
EXErTION_TIME => 2.1 ms );
Figure 5-59. I/O Requests for Example #1
10 ms. Frame
RG3
Nonpreemptable UO
Frame 3
RG2
Frame 4
Figure 5-60. Example #1
RGI
Page 5-105
CREATE I/OREQUEST(
CREATE..IIO_REQ UEST(
CREATE_I/O_REQUEST(
CREATE_I/O REQUEST(
IOR #1,
RATE OF IOR => R1;
EXECUTION FRAMES => -1;
CHA IN_A RRA Y,
EXECUTION TIME => 4.0 ms );
IOR #2,
RATE OF IOR => R2;
EXECUTION FRAMES = > -1;
CHAIN ARRAY,
EXECUTION TIME = > 3.2 ms );
m
1OR #3,
RATE OF IOR => R3;
EXECUTION FRAMES = > -1;
CHA IN_A RRA Y,
EXECUTION TIME => 3.7 ms );
IOR #4,
RATE OF IOR => R4;
EXECUTION FRAMES = > -1;
CHAIN_ARt_AY,
EXECUTION TIME => 3.6 ms );
Figure 5-61. I/O Requests for Example #2
ne2 Frame3 Frame4 [ FrameS ] Frame, l_-ame7-
,,;,od/ /
\ " RGI
Processing of RG2,
resumed and completed
Figure 5-62. Example #2
Page 5-106
6. Fault-Tolerant Data Bus
The fault-tolerant data bus (FTDB) is a l_al-area network designed around the same
principles of Byzantine resilience as the AFT A. The FTDB is a highly reliable end-to-end
communication system interconnecting the AFTA, other fault-tolerant computers, the
Silicon Graphics display processor, the Merit Technologies MT-1 VME system, the real-
..................
time AI system, sensor and image processors, and flight and engine controls.
6.1. Objective and Approach
The objective of the fault tolerant data bus is to provide an optimal internetworking
system between simplex and redundant processing sites. The approach taken in this report
to develop such a system is to first identify requirements to which the FTDB should
conform. Next, architectural options for the _B are described and evaluated with respect
to figures of merit. Promising options are d escn'bed in greater detail as a proposal for an
FTDB architecture. Finally, a development plan for execution under the Detailed Design
and Brassboard Fabrication phases of the AFTA program is described.
The following VI'DB architectural optiofis are investigated.
broadcast buses
token rings
circuit switched network
packet switched network
fiber optic networks
authentication protocols ....
The FTDB architectural options are evaluated according to the following figures of
merit.
bandwidth
latency
determinism
compatibility with applicable standards
fault tolerance
topological flexibility
complexity
development cost/effort/risk ii
The following standards are investigated and considered for use in the FTDB.
AIPS Inter-computer network
JIAWG high-speed data bus
SAVA high-speed data bus
FDDI
SAFENET II
Page 6-1
A design for the FTDB architecture is presented based on the findings of the
architecture survey. The conceptual design describes an end-to-end communication system
for use between the AFTA, other fault-tolerant architectures, and simplex sites. A plan
showing the development of a prototype FTDB for the AVRADA hotbench is described.
6.2. Fault-Tolerant Data Bus Requirements
The anticipated requirements for the fault-tolerant data bus are given below. Potential
FTDB implementations are evaluated based on their adherence to these requirements. The
requirements given below represent goals based on the anticipated needs of critical real-time
systems. Some requirements may not be achievable due to time constraints, money
constraints, or standards restrictions.
6.2.1. Packet Requirements
The requirements de_ribed in this section detail restrictions on packets in the bTDB.
(5.2. I.I. Word Length
The b'TDB shall incorporate a basic word length of 8 bits.
0.2.1.2. Packet Length
The FTDB shall allow packets to be any length between 1 basic data word and 2048
basic data words.
6.2.2. Network Control Rec!uirements
This section details requirements for media access control, station addressing, and flow
control.
6J2.2.1. Access Control Modes
Access to the bTDB shall be controlled using a distributed and symmetric access
protocol. A station on the FTDB shall be able to obtain access to the network within an a
priori determined latency.
6.2.2.2, Address Modes
The b'TDB shall support the following addressing modes:
Page 6-2
Physical addressing-The FTDB shall
stations.
Support physical addressing of up to 65535
Logical addressing-The FTDB shall support logical addressing of up to 65535 virtual
station groups.
Broadcast addressing-The FTDB shall support broadcast addressing of all stations
attached to the same physical FTDB.
6.2.2.3, Uncontrolled Transmit lnhibit
Each station on the FTDB shall provide a mechanism to prevent uncontrolled
transmission (babbling). Transmission shall be inhibited if a continuous transmission
exceeds 1.1 times the maximum message length. Transmission shall also be inhibited if a
station exceeds its allotted transmission frame:
6.2.2.4. Flow Control
The FTDB shall implement a flow control_mechanism so that a station with full receive
buffers will prevent packets from beingdelivered to that station. Packets must not be lost
due to a receive buffer full condition.
6.2.3. Network Function Requirement
This section describes the functions that the FTDB provides to subscribing stations.
6.2.3. I. Broadcast_and Multicast Functions
The FTDB shall support broadcast and multicast functions. Only stations connected to
the same physical medium as the source of the broadcast or multicast are required to receive
the packet.
62.3.2.• Periodic and_Aperiodic Tran_fcgs
The FTDB shall provide both periodic and aperiodic packet transfers. Periodic transfers
will request a fixed bandwidth allocation, The FTDB shall guarantee the bandwidth
allocation to all periodic transmitting entities:
Page 6-3
6.2.33. Packet Ordering
The FTDB shall deliver packets to station members, and to members of a multicast or
broadcast group in the same relative order that the packets were transmitted.
6.2.3.4. Station IdentitTcation
It shall be possible for any station on the FTDB to determine if an expected station is
active.
6.2.4. To_Dolo_ and Architecture Requirements
This section describes requirements on the topology and architecture of the FTDB
6.2,4.1. Growth
The FTDB shall permit the addition or deletion of stations to an existing network. This
addition or deletion shall not require modifications to either hardware or software of any
station which does not communicate with the station in question.
6.2,4,2. Topology
The topology of the FTDB shall support from 2 to 100 stations on a single physical
medium. The topology shall not restrict the physical or logical location of a station.
6.2.4.3. Station Insertion and Removal
The insertion or removal of a station shall not disrupt network traffic on an active
network for longer than 1 second.
6.2.4.4. Bridges for Interconnected Buses
The architecture of the FTDB shall not preclude the use of bridges or gateways between
FTDBs. Bridges and/or gateways do not need to support all traffic between FTDBs. In
particular, the following are NOT required of bridges/gateways:
Broadcasts and multicasts.
Periodic data transfers.
Deterministic latency.
Page 6-4
6.2.5. Physical Requirements
This section describes requirements of the physical elements interconnecting stations on
the FTDB.
6.2.5.1, Serial Transmission
The FTDB shall be implemented using a serial bus. All information, including data,
clock, address, and control signals, must be capable of being placed on a single
transmission medium (e.g., twisted pair, coaxial cable, fiber optic cable, etc.)
6.2.5.2. MediaSupport
The FTDB shall be compatible with any serial transmission media, including, but not
limited to, fiber optic cables, twisted pair wire, or coaxial cable.
6.2.5,3. Electrical l_olation
There shall be no DC coupling between any two stations on the FTDB. The FTDB shall
provide for at least 1000 volts of common-m_e voltage rejection between stations.
6.2.5.4. Station Separation
The FTDB shall provide for separation of up to 1000 meters between stations on the
same physical medium. .....
6.2.6. Fault Tolerance Requirements
This section describes requirements on the FTDB necessary to ensure the high
reliability of the system.
6.2.6.1. Packe_9_e "vliy__
Delivery of packets from a source to a destination in the FTDB shall be reliable. Packets
shall not be lost due to single network faults, flow control, or collisions. The FTDB shall
not require a retry mechanism to ensure packet delivery through an unreliable medium.
6.2.6.2. Synchronization
The FTDB shall provide a mechanism for stations to synchronize with other stations on
the FTDB.
Page 6-5
6.2.6.3.._ Source Congruency_
The FTDB shall provide a mechanism for delivering packets transmitted by a simplex
computing site to a redundant computing site such that the members of the recipient
computing site receive bitwise identical copies of the packet. The b'TDB shall correctly
implement transmission between stations of simplex, triplex, and quadruplex redundancy
levels to support Byzantine resilience.
6.2.6.4_ Connectivi_
An FTDB implementation shall provide sufficient connectivity so that multiple
independent paths are provided between any two stations on the network. These
independent paths shall have no common element. This connectivity may be provided with
either multiple media layers or with a sufficiendy interwoven network.
6.2.6.5. Station to Network Interface
The network interface unit (NIU) which connects a station to the FTDB network shall
not permit a single station member to disrupt a path on the network.
6.2.6.6. Redundancy
The FTDB shall provide sufficient redundancy to allow reconfiguration around any
fault, given that the fault is detected.
6i2,6,7, Station Redundancy
The b'TDB shall support station redundancy levels of simplex, triplex, and quadruplex.
The FTDB shall deliver data from a simplex to a fault-masking group (triplex or
quadruplex) such that each member of the fault-masking group receives bitwise identical
copies of the data.
6.2.6.8. Error Detection
A receiving station shall be capable of detecting errors in transmission. Upon error
detection, it must be possible for the receiving station to unambiguously select a correct
copy of the packet from a set of multiple copies without requesting retransmission of the
packet. The error detection mechanism must be designed so that the likelihood of an
undetected error is sufficiently small.
Page 6-6
L_
6.2.6.9. Diagnosability
The FTDB shall provide the capability to monitor the network and to detect any single
fault.
6.2.6.10. Self-Test
The network interface unit (NIU) must provide sufficient independent self-test
capability to ascertain the level of NIU functionality. The built-in self test capability shall
have a coverage of at least 90%.
62.6J1. Byzantine Resilience
A single, arbitrarily behaved media, station member, or NIU fault shall not affect the
overall operation of the FTDB. All packets must be delivered to their destinations in the
presence of a single undetected, unreconfigured fault. After successful detection and
re.configuration of a fault, the reliability of the network must be returned to its original state,
with the exception of the station(s) directly affected by the fault.
6.2.6.12. Fault Isolation and Containment
A fault in a NIU, station member, or interconnecting link shall not cause a fault in any
other NIU, station member, or interconnecting link.
6.2.7. Performance Requirements
The following section describes the FTDB _rformance requirements.
6.2.7.1. Message Priorities
The FTDB shall provide for the following message priorities:
Priority S - Synchronous data exchange. The latency of a synchronous message must
be guaranteed to be less than an a priori dete_ined Value. Synchronous data exchanges are
the highest priority data. Any synchronous data messages enqueued in a transmitting
station shall be transmitted before messages 0f-_y other priority level.
The FTDB shall also support four additional levels of message priority for normal data
exchanges, named Priority 1, Priority 2, Priority 3, and Priority 4. The lowest numbered
Page 6-7
priority level, Priority 1, is the highest priority of the normal data exchanges. Normal data
exchanges are lower priority than Priority S.
A transmitting station shall transmit all messages of a given priority before messages of
a lower priority are transmitted.
Only normal data exchanges must be supported across bridges or gateways.
6.2.7.2_ Network Bandwidth
The FTDB shall provide a useable data transmission rate of at least 100 Mbits/second.
6.2_7.3. Initialization Time
The F'TDB shall be available for packet transmission between the first two active
stations on the network within 1 second of activation of the second station.
6.3. FTDB Architecture Study
This section discusses some of the architectural options for the FTDB. These options
include topology, media technology, media access control protocols, and reliability
enhancements. This section is not meant to be a comprehensive study of all possible
options for the F_DB. Instead, only options which were deemed to have merit for
applications in critical real-time environments were considered.
6.3.1. Broadcast Buses
A common type of physical network topology is the broadcast bus. An example of the
broadcast bus topology is shown in Figure 6-1. Many examples of the broadcast bus exist,
including IEEE 802.3, IEEE 802.4, and MIL-STD-1553. A common characteristic of
these systems is that every node on the bus receives all data transmitted over the network.
Each node selectively records data based on an address. If the address presented does not
match the address the node is programmed to look for, the node ignores the data.
Page 6-8
Figure 6-1. Broadcast Bus Topology
station
bus
terminator
Only one node at a time is allowed to transmit data on the bus. Media access control is
necessary to prevent the presence of two active transmitters. Many schemes exist for media
access control on buses, including carder sense, token passing, and centralized arbiters.
The IEEE 802.3 LAN [IEEES023] is an example of a network that uses carder sense
for media access control. The exact protoc_ol used by 802.3 (and Ethernet, on which 802.3
is based) is carrier-sense multiple access with collision detection (CSMA/CD). The process
of transmitting data on 'the bus begins with the station checking the bus to make sure the
media is clear. If so, the station begins tr_mitting. During transmission, the station
monitors the bus, and if a collision with ano_er transmitting station is detected, the station
retries the broadcast. The time to retry is determined by a pseudo-random number to
minimize the possibility of repeated collisions between stations. A variation of the
CSMA/CD protocol is carrier-sense multiple access with collision avoidance (CSMA/CA).
While the carrier sense media access protocols are very common, they have some
limitations. First is non-determinism. Slight differences between network implementations
may favor one transmitter over another in a particular arbitration instance. Also, the
bandwidth available to a transmitter may v_ widely depending on other network traffic.
In a system with varying bursts of data, it may be difficult for a transmitter to obtain the
network regularly. Since regular, periodic data transfers are characteristic of real-time
systems, carder sense protocols may not be appropriate for a real-time system. Finally, a
babbling station can monopolize the bus, preventing other stations from communicating.
A second common media access contr0! is the token passing protocol. Buses which
implement a token passing protocol are usually referred to as linear token passing buses
(LTPB), an example of which is IEEE 80_.4 IIEEE8024]. Each station on a LTPB is
arranged in a virtual ring. A token is passed between stations on this virtual ring. A station
is only allowed to transmit when it possesses the token.
Pagc-¢/¢-6 
The token passing protocol solves many of the deficiencies of carder sense protocols
related to real-time systems. Most token passing protocols use a token rotation timer to
ensure that the token is delivered to each station within a fixed period of time. This token
rotation time can be tailored to the periodic data transfer characteristics of the real-time
system, thus ensuring that the deterministic data transfer needs of the system are met.
The MIL-STD-1553 [MIL-STD- 1553] data bus uses a centralized bus arbiter to control
access to the bus. A 1553 network has a single controller device. Initially, only the
controller is allowed to transmit on the bus. The transmitter selects other nodes on the
network to allow them to drive the bus for fixed periods of time. When the remote node is
finished transmitting, bus ownership is returned to the bus controller, which then selects
another node to drive the bus.
Centralized media access control has the advantage of simplicity over most other media
access protocols. However, it also has many limitations. Since the media access control is
centralized in the bus controller, the bus controller becomes a single point of failure. Also,
interrupt delivery from a remote node to the bus controller is difficult. Thus, each node on
the network must be polled by the controller to determine if it has data to transmit. This
polling could be significant for high iteration rate control systems, reducing the available
bandwidth for actual data transfers.
Broadcast buses are typically built using electrical components, such as twisted pair
wire or coaxial cable, with transformer coupling for isolation. Optical devices do not lend
themselves to the construction of broadcast buses. Fiber optics are inherently unidirectional
devices, whereas electrical wires are easily made bidirectional. Typically, multiple fiber
optic splitters/mixers are required to build a broadcast bus with fiber optics. The optical
losses associated with the splitters/mixers become significant for a very small number of
network connections. One of the advantages of the broadcast bus topology is the simplicity
of connecting a station to the network. This advantage is lost when fiber optics are used in
place of electrical components.
The fault tolerance of a simplex broadcast bus is not good. A single station or link fault
can disrupt the entire network. There are few remedies for these situations except to
physically remove the faulty station or fix the broken link.
Some broadcast buses have been designed for fault-tolerance. These designs
incorporate redundancy in the form of spare links, redundant media layers, or both. A bus
with sparelinks built into the network is re¢onfigured by switching in a spare link to
replace a faulty link, regrowing the original bus topology. Redundant media layers are
usually used in an active/standby mode. One layer is used until a fault is detected, at which
point all stations switch over to the secondary layer. The MIL-STD-1553 bus is an example
of a bus with active/standby layers. Alternatively, three or more redundant layers can be
used to transmit redundant copies of data, which are voted upon reception. The AIPS
intercomputer network [CSDL9214] uses both redundant, voted media layers and spare
links to achieve high reliability.
6.3.2. Token Rings
Token ring networks are another common network topology. Token rings are
constructed using electrical components or fiber optics. The inter-station links on a token
ring network are unidirectional, making fiber optics more viable for a token ring system
than for a broadcast bus. A diagram of a simplex token ring network is shown in Figure 6-
2.
station
._> unidirectional link
Figure 6-2. Token Ring Topology
Token ring networks use a token passing protocol for media access control. The
deterministic nature of the token passing protocol makes it suitable for use in real-time
systems.
A single ring network is highly susceptible to interruption from faults. Any single
station or link failure will disrupt the network. Even passive station faults, such as loss of
power, are not tolerated since such failures disrupt the passing of the token. Most token
rings have mechanisms for regenerating a lost token. However, full network functionality
can be regained only by fixing the broken station or link.
ii i
Page 6-11
Tokenpassingprotocolsassumethe ring is contiguous,soring topologiesmust be
reconfigured around faults to enable the tokens and data to make a complete
circumnavigation of the ring. Ring regenerationoptions include chordal rings, dual
counter-rotatingrings,andstationbypass.
Thechordalring approachrequiresredundantlinks which bypass nodes and links on
the primary ring. A fully braided ring, such as that shown in Figure 6-3, can be
implemented for arbitrary reconfigurability, or bypasses can be placed only across nodes
which are expected to fail most often or which are considered non-essential. When a failure
is detected, a redundant link is switched in to replace the broken node or link. The
redundant link re-forms the ring, and network traffic can proceed around the ring
uniilhibited.
1
station
_ primary linksecondary link
Figure 6-3. Fully Braided Chordal Ring
A redundant counter-rotating ring, shown in Figure 6-4, is built with bidirectional links
between each station and the two adjacent stations. The links traversing the ring in one
direction are used as the primary ring, and the links in the other direction make up the
secondary ring. A fault on the primary ring causes all stations to switch over to the
secondary ring. The network can reconfigure around a fault on the secondary ring by
building a loopback ring with pieces of both the primary and the secondary ring. However,
depending on the location of the fault, certain stations may be isolated from other stations
on a loopback ring.
Page 6-12
station
primary link
secondary link
Figure 6-4. Dual Counter-rotating Ring
Station bypass is a mechanism for reconfiguring around passive station faults that
prevent forwarding of tokens or data. Station bypass allows a station to voluntarily bypass
its connection on the ring. The station bypass switch is usually designed so that if a station
is powered down, the network inputs to the station are shunted to the outputs. Station
bypass only works for passive faults since a maliciously failed station will refuse to shunt
the station bypass switch.
6.3.3. Circuit Switched Network
A circuit switched network provides guaranteed bandwidth between any two stations.
Bandwidth is allocated by establishing a connection between two interacting stations. Part
of the connection establishment primitive is a bandwidth request. A fixed percentage of the
bandwidth of an internode network link is aJ_[gcated, either by time or frequency domain
multiplexing, when a connection is established. The network assumes that the transmitting
station will use all of its available bandwidth_ !_fthe connection is under-utilized, the unused
bandwidth is wasted.
A possible topology for a circuit switched network is shown in Figure 6-5. Virtually
any interconnection topology is possible. The number and connectivity of links in the
network varies depending on the expected traffic between each possible pair of nodes.
Page_
]
Figure 6-5. Example Circuit Switched Topology
The latency of circuit switched networks is very good. Typically, once a connection is
established, the latency imposed by intermediate nodes in the path is minimal. As an
example, a cross-country telephone connection typically has a latency of less than 30ms
[Tan88].
All communications on a circuit switched network must occur within the allotted
bandwidth slot. Thus, a connection must be established before a station can communicate
with any other station. Circuit-switched networks work best with connection-oriented
communication protocols; they do not work well with datagram protocols. A station could
establish connections with all other stations with which it might want to communicate.
However, this action results in a large amount of wasted bandwidth unless communications
are very regular. Alternatively, a station could establish a connection for each desired
communication, transmit the message, and destroy the connection. This procedure would
minimize wasted bandwidth; however, message communication would be delayed by the
overhead of repeated connection/disconnection operations. Also, broadcast and multicast
communications are not possible on a circuit switched network.
Circuit switched networks are typically used in situations where a steady stream of data
must be delivered between distinct points on the network. An example is a signal
distribution network. Digital or analog signals typically have constant bandwidth
requirements, therefore a connection can be established for transmission of a signal without
severe bandwidth wasting. A system with highly irregular bandwidth requirements, such
as a typical data processing system, are unsuitable for circuit switched networks. While a
Page 6-14
real-time system has constant components t0 the network utilization model, these
components are a small part of the total data _unication.
The circuit switched network is easily built with redundant links for enhanced
reliability. Also, the network can be designed so that a node will only affect nodes and
links to which it is directly connected. A node cannot affect another node which is more
than one network hop away. Thus, mutuallY exclusive paths between nodes can be
constructed with a circuit switched networkwithout the necessity of redundant media
layers.
6,3,4, Packet Switched Network
An alternative to static bandwidth allocation in a circuit switched network is to
dynamically allocate bandwidth based on instantaneous network usage. A network using
this method of bandwidth allocation is ca!leda packet switched network. Each packet
arriving at a node is either delivered to a station within the node or forwarded to an output
port on the node. The choice of output port is made based on the eventual destination of the
message and the current utilization of the output ports. The network utilization may vary
widely, so each node must be able to buffer mgssages until the network utilization declines
enough to allow the message to be sent. Because of this property, packet switched
networks are sometimes called store-and-forw__d networks.
Packet switched networks are typically used for wide-area networks (WAN) across
countries or even the entire globe. Wide-area packet switched networks usually provide
interconnections of local-area networks of other topologies such as buses or rings. The
Internet is one well-known example of this type of packet switched network.
The packet switched network has some limitations in local area network applications.
Packet switched networks usually require complex distributed routing algorithms, and each
node must apply the algorithm to each i0coming packet, increasing packet latency.
Broadcasts and multicasts over packet switched networks are difficult, since there is no
convenient way to determine whether or not_ node has received a broadcast or multicast
message. Packet ordering is also a problem if the network uses datagram routing. Because
two packets may follow different routes, the packets may arrive at the destination in the
reverse order that they were transmitted in. Vigu_l circuit routing solves the packet ordering
problem, since all packets are forced to follow the same route. However, virtual circuit
routing requires the establishment of a connection before communication can take place.
Page 6-15
i ........_i?i.....
Finally, because of the complexity of a packet switched network, the validation of the
network to guarantee deadlock-free operation is difficult.
Packet switched networks can be built with the same topological freedom afforded
circuit switched networks. Thus, highly reliable packet switched networks can be built with
only a single media layer, provided that there are multiple mutually exclusive paths between
any two nodes on the network.
6,3.5. Fiber Optic Networks
The use of fiber optics for computer networks is becoming increasingly widespread.
Fiber optics provide many advantages over traditional electrical interconnect such as coaxial
cable or twisted pair wire.
The advantages of fiber optics over copper media are numerous. Fiber is not
susceptible to electromagnetic interference (EMI). Also, because the fiber does not radiate
any EMI of its own, it is more resistant to eavesdropping. Fiber provides excellent
electrical isolation between systems, an important property for fault-tolerant systems. The
bandwidth capacity of fiber is considerably higher than coaxial cable, particularly over long
distances.
Most of the disadvantages of fiber optics are related to cost. A point to point
communication link using fiber optics can cost 10 to 50 times as much as a comparable
system using coaxial cable. Building a broadcast bus using fiber optics is difficult because
many fiber optic splitters are required. A simple 4 x 4 fiber optic splitter costs about $500,
whereas a splitter using coaxial cable, a few T-splitters, and BNC connectors costs about
$10. An electrical connection can be shunted using a simple relay from Radio Shack, but an
optical bypass switch is very expensive.
6.3.6. Authentication Protocols
The solution to the Byzantine Generals Problem [LSP82] prescribes a method, called
source congruency, for guaranteeing that data delivered to each member of a redundant
processing site, or fault-masking group (FMG), is always in agreement, even in the
presence of a single random fault of arbitrary behavior. Additionally, the source
congruency guarantees that, if the original source of the data is non-faulty, the data
delivered to each member of the FMG is valid. Traditional implementations of the source
congruency algorithm use at least three unsigned copies of a message exchanged in a two
Page 6-16
roundexchangepatternto providesufficient redundancyin theeventof a single random
fault.
An alternative to the source congruency algorithm using triplicated unsigned messages
is to attach a signature to each message. For a source congruency algorithm using signed
messages, the following conditions must be Satisfied:
• A faulty node can not interfere with _e communication between two other nodes,
unless the communication path involves the faulty node.
• The receiver of a message knows wh_ sent it.
• The absence of a message can be detected.
• Any alteration of the message can be detected.
• The signature of a functioning station cannot be forged.
• Any node can verify the authenticity of a signature.
When a node receives a message, it checks the authenticity of the signature. If the
signature is valid, the node assumes that the contents of the message are the contents the
sender intended. Thus, interference by intermediate nodes is ruled out. If an intermediate
node corrupts the content of the message, the receiving node would detect the corruption
and declare the signature invalid. ..........
The message signature can either be message specific or non-message specific.
Traditional signed paper documents, such as checks, wills, or contracts, use a non-
message specific signature. Non-message specific signatures assume that the exact pattern
of the signature is difficult to duplicate. Any attempt to duplicate the signature by tracing,
photocopying, etc. is easily detected. However, in a computer system, any bit stream is
easily copied. A system need only contain enough memory to store the signature stream to
be able to reproduce the signature pattern. The nature of computer systems prevents this
signature copy from being distinguished from the original.
The solution to the signature copying problem is to use message specific signatures. A
signature is calculated as a function of the message. Thus, each message has a different
signature attached to it. To verify the signature, the receiving node applies another function
to the message to determine if the signature is valid.
The functions used to generate and authenticate the signature must be chosen to
minimize the likelihcxxl of a successful forgery attempt. Ideally, no node except the sending
node should know how the signature was generated. The receiving node should be able to
Page 6-17
test the authenticity of the signature without requiring knowledge of the signature
generation.
Functions with the properties described above can be found in the field of public key
cryptography. Public key cryptography uses two functions, E0 and D0. E0 is a public
encryption function, of which everyone has knowledge. The decryption function, D0, is
private and known only to the receiver. The function D0 cannot be easily deduced from
knowledge of E0. Anyone can encrypt a message using E0, but only the intended receiver
can decrypt an encrypted message. Thus, an intercepted message cannot be decoded by an
unauthorized station. The encryption/decryption function pairs have the property:
D(E(M)) = M
For authentication protocols, a slightly different operation is needed. The source station
keeps the private key to generate signatures, and the destination stations use the public key
to test the authenticity of the signature. The source must keep the private key so that no
other station can forge a signature, and the destinations must have the public key so that
any one of them can verify the authenticity of a signature. Note that the message itself is not
necessarily encrypted for authentication protocols; if such encryption is desired, another
private/public key pair can be used as described above. However, the keys for encryption
and the keys for authentication are distinct.
Since the private key is applied before the public key, only isomorphic function pairs,
with the following property, are suitable for use in authentication protocols:
E(D(M)) = M
The signature for a message is generated by applying D0, the private key function, to
the original message. Then the message and the signature are transmitted to the receiving
node. The receiver applies E0, the public key function, to the signature, and if the result
matches the original message, the signature is valid.
In practice, a signature calculated from the message using the procedure outlined above
is extremely unwieldy. The signature, D(M), is at least as long as the original message, M,
since M can be regenerated from it. Thus, half of the communication bandwidth is
consumed by the signature transmission. However, the signature does not have to be the
same length as the message. For the AFTA project, a fixed length signature of 64 bits is
sufficient to meet the reliability requirements [Gal90]. A system using fixed length
Page 6Zi 8
signatures requires a common function, CO, to be applied to the message. The sender
generates the signature by applying the following:
S = D(C(M))
The receivervalidates the signature by applying the following test:
C(M) E(S)
A possible implementation of the common function is the cyclic redundancy check
(CRC) function. The CRC is widely used as an error detecting code for data
communication. The detection of random, _bitrarily behaved errors with CRCs is very
good, thus an authentication protocol that uses CRCs for error detection is reasonably
secure against forgery. In addition, the CRC function can be tailored to detect some
expected error patterns, such as burst errors or double-bit errors.
One possibility for the public/private key functions is a system using modular inverses.
Modular inverses have the following prope_T
M "2"N = 1
where M and N are both 64 bit integers_ Modular inverses are similar to multiplicative
inverses in the set of real numbers, except that modulo multiplication is used instead of
arithmetic multiplication. The modular inverse scheme is not cryptographically secure,
since the calculation of modular inverses _is relatively straightforward for a human
cryptographic expert. However, the reasonable assumption is made that a station will not
fail in a manner that turns it into a cryptographic expert with the ability to calculate modular
inverses.
The signature generation and authentication procedures would be as follows: the sender
multiplies the 64 bit CRC by the 64 bit private key integer using modulo-64 multiplication,
yielding another 64 bit number which becomes the signature. The receiver regenerates the
CRC by multiplying the 64 bit signature by the public key, the 64 bit modular inverse of
the private key. This CRC is compared by the i'eceiver with a locally generated CRC on the
received message to test the authenticity of the signature.
Message specific signatures, such as those described above, prevent a node from
saving another node's signature for use in forging a new message. However, an
intermediate node can save and retransmit the message itself, complete with valid signature.
The receiving node must be able to distinguish between bogus copies of a message from a
Pager_6,i9
brokenintermediatenodeandlegitimaterepetitionsof themessagefrom theoriginalsource.
A solutionto thisproblemis to forcethemessageto containaknownvaryingcomponent.
Evenarepeatedmessagefrom thesamenodewill neverbeexactly identical to previous
copies.An exampleof a varying componentis a sequencenumber.A receiving node
knows what sequencenumbershouldbecontainedin the next messagefrom a source
node.If a messagearrivesfrom anodewith an incorrectsequencenumber,thereceiving
noderejectsthemessageaserroneous.
6.4. Existing and Proposed Standards
This section describes several networking standards, both existing and proposed, that
are of interest. Many of these standards are designed for application in real-time military
systems. Since the FTDB is targeted for military systems, these standards may have a great
influence on the acceptance of the FTDB.
Each standard defines certain aspects of a local-area network for real-time applications.
The scope of each standard varies. Some standards only define the physical and data link
layers. Other standards define a complete end-to-end communication system with a fully
specified protocol stack.
None of the standards presented below, with the exception of the AIPS IC network,
will survive Byzantine faults without modification. However, most of these standards were
developed with real-time system applications in mind. Thus, some of the techniques
specified in these standards can be applied to the fault-tolerant data bus design.
6.4.1. AIPS lntercomputer Network
The Advanced Information Processing System (AIPS) developed at the Charles Stark
Draper Laboratory |CSDL9214] includes the definition of an intercomputer, or IC,
network. The IC network is designed to interconnect simplex, duplex, triplex, and
quadruplex AIPS processing sites. The fault-masking processing sites and the IC network
are designed to be Byzantine resilient.
A diagram of the AIPS IC network is shown in Figure 6-6. The network consists of
three identical layers. Each layer is totally isolated from the other layers. Each member of a
processing site only transmits on one layer, although it receives from all three layers. There
is no cross-strapping between the layers except within a redundant processing site and on
delivery of data from the network to a processing site (the latter is not shown in Figure 6-6
Page 6-20
for clarity). Each layer forms a broadcast bus with media access controlled
deterministic bus arbitration mechanism.
al " iI
_1 |r,p|ex |r'plex I
.,up,°x - - - "- I
by a
primary link 0 network node
secondary link _ processing site member
Figure 6-6. AIPS IC Network
Although the IC network acts as a broadcast bus, the network is actually implemented
as a set of point to point links. Each layer contains more than enough links to form a bus to
interconnect all nodes. The network manager selects a subset of the interconnecting links to
form a bus. The remaining links in the layer_ unused until a fault is detected, at which
time the network manager switches in a redundant link to replace the failed link. The
network configuration established by the network manager is not changed unless a fault is
detected.
The AIPS IC network uses a media access _scheme called the Laning poll [CSDL9214]
to control access to the bus. The Laning p011 guarantees that all members of a station
contending for the network deterministically obtain or relinquish the bus. Each station is
assigned a unique priority number which is used during bus arbitration to request access to
the bus. The requesting station with the highest priority number obtains the bus.
6.4.2. SAVA High-Speed Data Bus
The Standard Army Vetronics Architecture (SAVA) defines a high-speed data bus
(HSDB, not to be confused with JIAWG HSDB) for interconnection of computing sites
within the SAVA architecture. The SAVA architecture is designed for implementation
inside ground vehicles, particularly battlefield vehicles such as tanks.
The current draft standard for the SAVA HSDB is [MIL-STD-344]. Lt this time, [MIL-
STD-344] is still under development. Some of the characteristics discussed in this section
were inferred from data in the draft standard and may be incorrect. The SAVA HSDB draft
Page 6-21
specificationdefinesonly the physical and data link layers of the ISO/OSI protocol model.
No mention is made in [MIL-STD-344] of any layers above the data link layer.
The SAvA H-SDB allows interconnectionof up to 32 nodes on a single physical LAN.
The network uses transformer coupling to a twinaxial cable bused throughout the vehicle.
A 12 MHz signaling rate using Manchester encoding is used to transmit data on the bus,
thus an effective data exchange rate of 6 Mbits/sec is realized.
Access to the media is controlled by a token passing protocol. Tokens are not really
captured for any length of time. Instead, a station can transmit on the bus only after a token
is received, and the token is transmitted immediately following the message. Thus, each
station is only allowed to transmit one packet per token reception.
Network fault-tolerance is specified by a dual-layer network. Each station is connected
to both layers, thus the SAVA HSDB will not tolerate Byzantine faults without additional
isolation between media layers. Stations are equipped with media bypass so that passive
station faults can be tolerated.
6.4.3. JIAWG High-Speed Data Bus
The Joint Integrated Avionics Working Group (IIAWG) defines a high-speed data bus
(HSDB) for interconnecting modules within the advanced avionics architecture (A3). The
A3 architecture is targeted for application in the advanced tactical fighter (ATF), the
advanced tactical aircraft (ATA) (cancelled), and the light helicopter (LHX) [J8701].
The current draft standard for the JIAWG HSDB is [J88N2]. At this time, [J88N2] is
still under development. Some of the characteristics discussed in this section were inferred
from data in the draft standard and may be incorrect. The JIAWG HSDB draft specification
defines only the physical and data link layers of the ISO/OSI protocol model. No mention
is made in IJ88N2] of any layers above the data link layer.
The JIAWG HSDB uses an optical bus topology with token passing media access
control. A virtual ring network is superimposed on the physical bus topology. The network
uses Manchester encoding, so a maximum data transfer rate of 50Mbits/sec is obtained
using a 100MHz signaling rate.
The JIAWG HSDB provides for network fault-tolerance by specifying dual redundant
buses. While the HSDB as defined in [J88N2] does not describe the network topology, it
-Page 6-22
is highly likely thatthedualbusesarenotsumcientlyisolatedfor Byzantineresilience.Past
experienceindicatesthat unlessa designspecifically addressesthe issueof Byzantine
resilience(which JIAWG doesnot), thedesigncannotsurvive Byzantinefaults without
additionalredundancy.
TheJIAWG specification[J88N2]doesnot describehow to constructabroadcastbus
usingfiber optics.
6.4.4. Fiber Distributed Data Interface (FDDI)
The fiber distributed data interface (FDDI) was developed by the American National
Standards Institute (ANSI) to satisfy increasing demands for high bandwidth local-area
networks. The FDDI standard is gaining momentum as the next generation local-area
network topology, complementing the current popular choice, Ethernet. FDDI is also
recognized by the International Standards _anization (ISO) as a protocol for open system
interconnect (OSI). FDDI is currently defined by three American National Standards,
[ANSI139], [ANSI148], [ANSI166], and one draft standard, [X3T95]. Most FDDI
implementations also use the logical link control specified by [IEEE8022].
FDDI defines two counter-rotating rings, a primary ring and a secondary ring. The
primary ring is used unless a fault is detected, in which case the secondary ring may carry
all or part of the network traffic. The two rings both connect to a single network interface.
The dual ring design of FDDI allows reconfiguration around detected faults. However,
because the two rings share a network interface, additional redundancy is required for
Byzantine resilience.
The media access control in FDDI is managed by a token passing protocol. The token
passing system guarantees a bounded latency for interstation communications. The FDDI
token passing system defines synchronous and asynchronous bandwidth allocation.
Synchronous bandwidth allocation is guaranteed to each station. Asynchronous bandwidth
is taken from whatever is left over after al! synchronous messages have been transmitted.
The inclusion of synchronous bandwidth makes FDDI ideal for real-time systems where
bounded, deterministic network response is necessary.
The specification of FDDI is divided into four major sections, corresponding to each of
the four accepted and draft standards. These sections are the physical layer protocol, the
physical medium-dependent layer, media access control, and station management.
;
The physical layer protocol (PHY) and physical medium-dependent layer (PMD) make
up the ISO/OSI physical layer. The physical medium-dependent layer defines the hardware
of the FDDI network, such as light wavelength, fiber diameter, cable plant dimensions, and
optical transceiver characteristics. Physical characteristics not directly affecting
interoperability, such as fiber sheathing, are not defined by PMD. The physical layer
protocol defines the medium independent characteristics of the FDDI network. The scope
of the PHY standard includes coding, symbol set, signaling protocol, and clock
synchronization.
The media access control (MAC) and station management (SMT) reside in the data link
layer of the ISO/OSI model. The media access control specification defines the token
passing protocol with synchronous and asynchronous bandwidth, station physical, logical,
and broadcast addressing, packet formats, and network initialization. The station
management includes provisions for network configuration management, fault isolation and
recovery, ring scheduling procedures, and station initialization.
An FDDI implementation also requires a logical link control (LLC) protocol. The LLC
is responsible for delivering packets, or Protocol Data Units (PDU), to the appropriate
higher level protocol stack. The current definition of FDDI in the four ANSI standards does
not specify a logical link control. However, by convention, most FDDI implementations
use the IEEE LLC [IEEES022]. This LLC provides for delivery of packets between service
access points (SAP). A source LLC specifies a SAP to which the LLC on the destination is
to deliver the packet. The destination SAP usually specifies a protocol stack.
Administration of SAPs is global, so that all systems using the IEEE LLC will properly
recognize or ignore packets as they arrive at the station.
FDDI has been studied for use in other real-time systems, with the conclusion that real-
time systems requirements can be met [Coh87I.
6.4.5. SAFENET II
The Survivable Adaptable Fiber Optic Embedded Network, or SAFENET, is being
developed by the Navy to satisfy intercomputer communications requirements for
shipboard, aircraft, and ground-based systems. The current draft standard for SAFENET II
is IMIL-HDBK-00361. SAFENET II is intended to meet the needs of both i.-,teractive and
real-time systems. The SAFENET II defines a complete end-to-end local-area network. The
Page 6-24
entireISO/OSI protocol stack, from the physical layer to the presentation layer, is defined
by SAFENET II.
The design of SAFENET II is based on the FDDI specification. SAFENET II defines a
new physical medium dependent (PMD) layer using militarized components. Each station is
connected to the medium by a trunk coupling unit (TCU). The TCU contains an optical
bypass switch with which a station can voluntarily bypass itself on the network. The TCUs
are interconnected by fiber optic cable. All fi_r optic cable junctions are constructed using
fiber optic splices, except for station connections. A network station is allowed to connect
to its TCU using a militarized fiber optic conn_tor.
A station may connect to either one (single attach) or two (dual attach) of the SAFENET
II rings. Since a station, consisting of one fault-containment region, may connect to both
rings, the dual ring design is not sufficiently isolated to tolerate Byzantine faults. A dual
attach station may listen on only one of the rings at a time. One ring is designated the
primary ring, and is used until a fau.lt is det_ted. Upon fault detection, all dual attach
L .................
stations switch over to the secondary ring. If a fault is detected in the secondary ring, a new
ring may be constructed using segments of the two original rings. All station bypass and
ring reconfiguration operations assume that no Byzantine faults have occurred.
The SAFENET II specification also defines _e higher layers of the protocol stack. Two
protocol suites are specified: an OSI compliant protocol suite based on the Manufacturing
Automation Protocol (MAP) and a lightweightprotocol suite based on the Xpress Transfer
Protocol ® (XTP ®) [PEI90120]. A SAFENETii station may implement either or both of
these protocol suites.
6.4.6, $_,lnlm_ry_
Table 6-1 below summarizes the differences between the various physical and data link
protocols discussed above. Several interestingconclusions arise from this table. First, most
of the systems outlined use token passing media access control, suggesting that such a
media access scheme is optimal for real-time systems. Also, all systems consider
redundancy to enhance system reliability. However, only the AIPS IC network defines
sufficiently isolated redundant network layers to tolerate Byzantine faults. Finally, the
FDDl-based systems provide the highest bandwidth of any of the systems investigated.
The nearest competitor is the JIAWG HSDB, which has only 50% of the FDDI bandwidth.
Page 6-25
StandarclFDDI
Characteristic\
topology
mediaaccess
control
signalingrate
datarate
ignaiingmethod
! _hysicalmedium
reconfiguration
mechanism
SAFENETII
dual ring T"
token
passing
125MHz
100Mbits/sec
4B/5B/NRZI
fiber optics
dual rings,
bypass
Table 6-1.
Ifl_l
dual ring
token
passing
125MHz
100Mbits/sec
4B/5B/NRZi
fiber optics
dual rings,
bypass
JIAWG
HSDB
dual bus
token
passing
100MHz
50Mbits/sec
Manchester
fiber optics
dual buses
SAVA
HSDB
dual bus
token
passing
12MHz
6Mbits/sec
Manchester
twinax cable
dual buses,
bypass
AIPS IC
triplex bus
Laning poll
2MHz
2Mbits/sec
HDLC
coaxial cable
redundant
links
Comparison of Standards
These conclusions direct attention toward FDDI as an excellent candidate for use in the
FTDB. The throughput of FDDI is higher than any other system investigated. Since FDDI
uses fiber optics, inter-node isolation is excellent. The token passing media access for
FDDI ensures deterministic bus access for real-time tasks. While FDDI does not provide
sufficient isolation between its dual rings, additional redundancy in the form of multiple
FDDI networks can be used to provide the necessary isolation. Finally, the FDDI standard
itself is very stable. Several of the other standards, particularly JIAWG HSDB and SAVA
HSDB, are very preliminary and implementation details are very sketchy. Neither of these
systems is adequately specified in the current version of the standard to construct a working
system. The FDDI standards, on the other hand, have been accepted by both ANSI and
ISO (with the exception of the station management standard, which is in the final stages of
acceptance), and working systems employing FDDI are available as off-the-shelf items.
6.5. FTDB Brassboard Design Proposal
This section describes the conceptual design for the AFTA fault-tolerant data bus. The
brassboard FTDB design will be constructed as described by the FTDB brassboard
development phm.
The FTDB design is based on the ISO/OSI model for data communications [Bla91].
The ISO/OSI model of the FTDB is shown in Figure 6-7. Only the lower four layers are
described by the conceptual design; specifications for the upper layers are currently beyond
the scope of the FTDB.
The physical and data link layers are based on the FDDI physical and data link layers.
The network layer protocols are designed to handle the redundancy management and
message authentication necessary to support Byzantine resilience. The transport layer
Page 6-26
protocolsprovide severaldifferent data m_els for inter-processor communication. The
transport protocols are integrated with the AF_A Ada run-time system. All redundancy
management issues are hidden from the user by the transport and network protocols.
Level 6-Presentation
Level 5-Session
Level 4-Transport
Level 3-Network
Level 2-Data Link
Level 1-Physical
I[ Services ] Periodic Network Asynchronous [ Network IB
I TransactionI Datagram Diagnos_ Oatagram I _t,s_, I!1
I Protocol I Protocol Protocol Protocol I Protocol I !i
STP POP .NDP___ ADP NDSP
Protocol [=_ Network Protocol I=1 Protocol M
FDDI Media Access Conlrol MAC
i FDDI Station I-
[ FDDI Physical Layer PHY Manag=ement H
I - _---l'---'_ "" !1
FDDi " OTS I I SAFENETII !1
Figure 6-7. ISO/OSI Model of FTDB
The physical and data link layers are designed around the ANSI specifications for FDDI
and the IEEE logical link control standard. The FDDI standard is rapidly gaining
momentum as the next generation local-area network. Many existing network standards,
such as MIL-STD-1553 and Ethernet, are widely used and well established, but the
technological state-of-the-art is making these systems obsolete. Exotic technologies like
gallium-arsenide (GaAs) make very high throughput (above 1 Gbit/sec) communication
Page 6-27
possible;however,suchtechnologyis very newand standardsbasedon this technology
have not yet beenestablished.The FDDI standards,at 100 Mbits/sec, representan
appropriatebalancebetweenhightechnologyandstandardization.
FDDI containsmany featuresthat are useful for real-time fault-tolerant systems,
including high raw bandwidth,low latency,deterministictokenpassingmediaaccess,
synchronousandasynchronousbandwidth,andnetworkfaultdetectionandrecovery.
The 125 MHz signaling rate and the 4B/5B code with NRZI signaling of FDDI
providesa raw bandwidthof 100Mbits/sec,higher thanother establishedstandardsin
eitherthecommercialor themilitary sector.A maximallyconfiguredFDDI network,with
1000stationsand 200 km of cabling, hasa ring latency of 1.617ms. A network of
reasonablesizefor embeddedapplications,with 100stationsand 10km of cabling,hasa
ring latencyof 0.111ms,slightlyover 1%of thenominalAFTA run-timesystemiteration
rate. Token acquisition latency for synchronousda_awill be no larger than 8.0 ms; a
smallervaluemaybeestablishedatring initializationtime.
The token passingmediaaccesscontrol guaranteesaccessto the network by each
stationwithin apredeterminedtime period.Eachstationusesatokenholdingtimer (THT)
to limit theamountof timethestationtransmitson thenetwork.A properly designed station
will release the token when the THT expires. A station which does not relinquish the token
after the THT expires is faulty.
The token passing protocol in FDDI is also designed to handle synchronous and
asynchronous bandwidth. Synchronous bandwidth is allocated statically for each station
and is guaranteed. Allocation of synchronous bandwidth is done in a manner to ensure that
the total synchronous bandwidth does not exceed the maximum practical bandwidth
capacity of the physical link. After all stations have transmitted synchronous data, any
bandwidth left over is available for asynchronous bandwidth.
An FDDI network is constructed using either a single ring or a dual, counter-rotating
ring configuration. The dual ring design provides some degree of fault-tolerance to the
network. One ring in the dual ring configuration is deemed the primary ring and the other
ring is the secondary ring. The primary ring is used unless a fault is detected, at which
point all stations switch over to the secondary ring. An additional fault on the secondary
ring can sometimes be tolerated by connecting segments of the two rings into a new
configuration.
Page 6-28
Eachnetworkinterface unit on an FDDI network is either a single attach or a dual attach
station. Single attach stations are only connoted to the primary ring, whereas dual attach
stations connect to both rings. Dual attach stations only listen on one ring at a time. Since
dual attach stations connect to both rings, a single malicious fault in a network interface unit
can interrupt communications on both rings simultaneously. Thus, the dual ring FDDI
configuration is not sufficient alone to provide Byzantine resilience.
An FTDB implementation utilizing FDDI requires at least two distinct FDDI networks
for Byzantine resilience. Each network, henceforth referred to as a media layer, is used to
transmit a packet copy between stations. Stations are connected to the dual media layers in a
manner that prevents any single fault within a fault containment region from disrupting
more than one media layer. Each media layer is either a single or a dual FDDI ring.
An FTDB implementation using dual, counter-rotating FDDI rings uses the secondary
rings to reconfigure around diagnosed passive faults. Note that even an implementation
using single FDDI rings is Byzantine resilient:
A diagram of the FTDB architecture is shown in Figure 6-8. The FTDB supports
stations of simplex and fault-masking (triplex or quadruplex) redundancy levels. The
FTDB protocols guarantee agreement on data _nsmitted from a simplex to a fault-masking
group. Validity is guaranteed if the simplex source is functional. The FTDB also guarantees
agreement and validity on data transmitted between fault-masking groups.
Page 6-29
Source Station J Network Interface [
I
I I
I I
''".......................................................................I I
f iii!'..........' '
i! w
/_ Member of FMG
O Member of non-FMG
_> Voter
, _ Fault-containment region
Network J
Interconnect
n n
Network [
Interface
I I
I I
I I
Destination
Station
Message signer
Signature authenticator
_'] Network interface unit
Station boundary
Figure 6-8. FTDB Architecture
The FTDB is built around two FDDI media layers to provide reliable communication
between network stations. Each media layer is physically and electrically isolated from the
other layer and from all network station members; thus a single failure in the system will
disrupt at most one media layer and will not disrupt any station member. A media layer may
be either a single or a dual FDDI ring.
All stations are connected to the network using one of the interface architectures shown
in Figure 6-8. All stations must provide a network interface unit (NIU) to each of the two
redundant media layers. Each NIU must reside in a separate fault-containment region to
ensure Byzantine resilience.
Page 6-30
The steps taken by a message as it is transferred through the FTDB are described
below. These steps are illustrated for a source consisting of a triplex AFTA Virtual Group
sending a message to a destination consisting of an arbitrary triplex processing site. The
AFTA Virtual Groups view the interface to the FTDB as simply another type of I/O
Controller, and use the I/O message exchange primitives enumerated in Section 3. To
illustrate the linkage between the AFTA and the VI'DB, Figure 6-9 is a redrawing of Figure
3.43, showing how triply redundant VG T1 simultaneously writes voted output data to
multiple IOCs, which in this case are the Signer/Checker (S/C) components of the FTDB.
NE 0
Figure 6-9. Triply Redundant VG T1 Simultaneously Writes Output Data to
Signer/Checker Components of Fault Tolerant Data Bus
Step 1, Figure 6-10. Data transmitted from the station to the network passes through a
message signer. The signer is in the same FCR as the data source. The signer attaches a
sequence number and an authentication signature to the packet.
Page 6-31
Source Statio
TriplexAFTA
VG
] Network lnterfm ]
I I
J !
I I
I , I
I i
! I
Network I Network
Interconnect ' Interface
m
i _
I I
I I
I I
I I
I I
I I
Destination
Station
Triplex
Processin_ Site
Figure 6-10. Step 1 of FTDB Message Transfer
Step 2, Figure 6-11. After the authentication information is attached to the packet, the
packet is transmitted to the network interface units. In the case of an FMG, the redundant
copies of the packet are transmitted to the FTDB interfaces, each of which votes the three
copies to prcxluce two redundant, signed packets.
Source Statio
Triplex AFTA
VG
Network Interfa_
I I
! !
I I
I I
....._"_::i::ii_::i%i::.........................I
I I
! I
Network
Interconnect
On
i
I
J
Network
[ Interface ]
I I
I I
!
I
q I
!
Destination
Station
Triplex
. Pr_essin_ Sit e
Figure 6-11. Step 2 of FTDB Message Transfer
Step 3, Figure 6-12. Each packet is then transmitted over the dual FTDB media layers
to the appropriate receiver station.
Page 6-32
Source Statio
Triplex AFTA
VG
::_::ii:i_i_i_i:!:_!?::<'i_i-liiiiii!iii!i!iiiiii_iiiiii:iii:iiiii:,:i
[ Network Intedat [
I I
I I
I I
I I
I I
I !
Network j
Interconnect
-D
Network j
Interface
r i i
I I
I I
I I
I I
I !
Destination
Station
Triplex
Pr_e_ln_ Site
Figure 6-12. Step 3 of FTDB Message Transfer
Step 4, Figure 6-13. The receiving FTDBNIUS transmit the received packet copies to
the signature authenticator stage of the receivifig station. The authenticator stage nominally
receives two copies of each packet, one fforh_ach media layer. One copy is guaranteed to
be correct in the presence of any single fault in the network.
Source Statio Network lnterfa_
Triplex AFTA
VG
.......... •.,.,.--
I I
I I
I I
I I '
I ::"%_iiiill............................iiil:,
i
.......Network Network Destination
Interconnect I Interface I Station
I I Triplex
I I Processln_ Site
I I
I I
Figure 6-13. Step 4 of FTDB Message Transfer
Step 5, Figure 6-14. The authentication stage checks the sequence number and the
signature on the packet to make sure the packet is valid. Both the sequence number test and
the signature authentication test must succeed for the packet to be considered valid. If both
packet copies available at a given checker pass both tests, either copy may be selected as
Page 6-33
valid by the receiving member of the desination triplex processing site. If neither copy
passes both tests, the packets are discarded and the fault information is recorded in the
network diagnostic log. Subsequently, the received packets and fault information may be
exchanged using one of the standard A_TA I/O exchange primitives enumerated in Section
3. The specific exchange primitive used will depend on the redundancy level of the
recipient VG and the number of FTDB interfaces that VG possesses.
Source Statio [ Network lnterfac
Triplex AFTA
VG
:::::::::::::::::::::::::::::::::::::::::::::::::::::::: :: ??:::?:i :: i
J
' /I I
I I
I I
e •
Network
J Interconnect I
I t_
I
Network
Interface J
I I
I I
I I
I I
I I
| I
Destination
_stion
Triplex
Processln B Site
Figure 6-14. Step 5 of FTDB Message Transfer
6,5, !. Physical Laygr
The physical layer of the FTDB is based on the specifications for the Fiber Distributed
Data Interface, or FDDI. The physical layer specification in the FTDB is divided into two
major segments. The physical layer protocol defines the data and control signaling and
clock recovery. The physical layer medium dependent describes the actual electrical and
optical hardware used to implement the inter-station communication link.
6.5. I. 1. ..Ehysical Layer Protocol
The physical layer protocol (PHY) for FTDB is described in [ANSI148]. The data is
encoded using a 4B/5B code to maintain a DC balance on the output waveform. The code
also ensures that the serial data stream will contain no more than three adjacent zero
symbols. This property assists the clock recovery circuitry by providing enough transitions
to derive the clock from the incoming data stream. The stream of 5 bit codes is converted
into an NRZI serial data stream for transmission over the serial medium.
Page 6-34
Theuseof the 4B/5B code makes better Use of the media bandwidth than Manchester
encoding. For example, the raw bandwidth of the FDDI medium is 125 Mbaud. Using
4B/5B encoding, a data rate of 100Mbits/sec is obtained. Using Manchester encoding, only
62.5Mbits/sec would be available. The 4B/5B coding requires a more sophisticated clock
recovery circuit and more accurate oscillators than Manchester encoding. However,
oscillators satisfying the 50 ppm specification for FDDI are widely available, as are
monolithic integrated circuits to perform the clock recovery [AMD89a].
6.5.1.2. Physical Layer Medium Dependent
The FDDI physical layer medium dependent (PMD) standard [ANSI166] defines the
physical medium to be used for the data communication channel. The standard includes
specifications for fiber-optic type, fiber-optic-diameter, light wavelength, transmitter type,
receiver type, and connector dimensions, These specifications are either necessary to
achieve the 125MHz signalling frequency or to ensure compatibility between stations on an
FDDI network.
Other medium dependent specifications can be used to replace the FDDI PMD,
provided that the replacement specification is compatible with the rest of the FDDI
specification. Two examples of alternate medium dependent specifications are the
Militarized Fiber-Optic Transmission System specified by FDDN [Coh88] and the
SAFENET II media dependent layer IMIL-HDBK-0036]. Each of these specifications
defines militarized components not considered by the FDDI PMD specification. However,
both are compatible with the remainder of the FDDI specification.
The modularity of the FDDI PMD permits the substitution of different physical medium
dependent layers on a per-network basis. Thus, a militarized b-TDB network can be
constructed by simply replacing electrical components with their military equivalents, and
replacing the PMD layer with a militarized medium dependent layer, such as one of the two
examples presented above.
6.5.2. Data Link Layer
The data link layer of the FTDB is basedon the FDDI data link layer standard and the
IEEE standard for logical link control. The FDDI media access control arbitrates access to
the physical network. The FDDI station management protocol provides a set of primitives
for maintaining the processes in the physical and data link layers. The IEEE logical link
Page 6-35
control (LLC) maintains the link between the physical layer and the network layer
protocols, and provides peer-to-peer communication with other LLC entities on the FTDB.
6.5.2.1. Media Access Control
The media access control (MAC) for the FTDB is defined in the FDDI standard
[ANSI139]. The FDDI MAC is based on a token passing protocol. Token passing exhibits
many characteristics desirable for real-time systems, including guaranteed deterministic,
low latency data transmission and the ability to schedule synchronous data exchanges.
Deterministic token rotation is guaranteed using a token rotation timer (TRT). The TRT
is initialized during a bidding process. Each station on the network broadcasts a desired
maximum token rotation time. If the station receives a request for a TRT less than what the
station requested, the station drops out of the bidding. If the received TRT request is
greater than that the station requested, the station ignores the request and continues the
bidding process. The last station in the bidding selects the local TRT request and broadcasts
it to all other stations as the target token rotation time (TTRT). The bidding process ensures
that the shortest TRT request is used for the TTRT.
The TTRT is used during normal data transmission to prevent a station from
monopolizing the network. A token that arrives before the TrRT is an early token and can
be used to transmit either synchronous or asynchronous data. A token that arrives after the
TTRT is a late token and can be used only to transmit synchronous data. Synchronous
bandwidth is allocated such that all synchronous data is guaranteed to fit within one T'I'RT.
Each station should normally have an opportunity to transmit every token rotation time.
Access to the network is guaranteed within two target token rotation times.
The MAC specification also defines the station addressing protocol. FDDI station
addresses are categorized by three characteristics as outlined below.
• Physical, logical, or broadcast. Only one station listens on a physical address.
Multiple stations may listen on a logical address. A station may listen on more
than one logical address. All stations listen on the broadcast address.
• Universal or local _tdministration. Universally administered addresses are
assigned by a single authority and are guaranteed to be unique throughout the
world. Locally administered addresses are assigned by the manager of the local
network. The local manager is responsible for preventing local address conflicts.
• Length. FDDI addresses are either 16 bits (short address) or 48 bits (long
address) in length. Only long addresses can be universally administered.
Page 6-36
6.5.2.2. Station Management
The station management of an FTDB station is defined by the station management
(SMT) standard for FDDI [X3T95]. The SMT controls various processes in an F'Iq)B
node, including station insertion and removal, initialization, fault detection, isolation and
recovery, ring bandwidth allocation and scheduling, and configuration management.
Although included with the data link layer in this discussion, SMT actually controls the
local PMD and PHY entities in the physical layer as well as the MAC entity in the data link
layer. ................
6.5.2.3. Logical Link Control
The FTDB logical link control (LLC) protocol conforms to the LLC defined in
[IEEE8022] and [IEEE8021 ]. The IEEE LLC is not a part of the FDDI specification, but
most FDDI implementations (including SAFENET II installations) use this LLC by
convention. Conformance to the IEEE standard ensures compatibility with other FDDI
stations using the same FDDI media.
The LLC defines service access points (SAPs) which specify the location to which the
LLC delivers incoming packets. The destination SAP (DSAP) usually indicates a network
layer protocol stack. Most SAPs are reserved by IEEE for use by public protocols.
However, the LLC reserves one SAP for usein a protocol extention, known as the sub-
network access protocol (SNAP), for private protocols. Since the Byzantine resilient
network protocol of the FTDB is considered a private protocol, the FTDB LLC uses the
SNAP extention to distinguish BRNP packets from other types of packets.
The LLC defines both datagram (Type !) and connection-oriented (Type 2)
communication between SAPs. Implementation of Type 1 is required, whereas Type 2
functionality is optional. The FTDB only requires Type 1 capabilities, since the BRNP is
designed around datagram protocols. However, the FTDB may optionally include Type 2
capabilities if a protocol stack that requires Type 2 capabilities is developed for the FTDB.
6.5.3. Network Layer
The network layer protocols are responsible for fulfilling station requests for message
transmission, address resolution, message authentication, and message delivery to stations.
Page 6-37
6_5.3.1. Byzantine Resilient Network Protocol (BRNPI
The Byzantine resilient network protocol implements the Byzantine Resilient Virtual
Circuit (BRVC) model [Har87]. The BRVC model guarantees delivery of all messages
transmitted by BRNP. Most existing network layer protocols, IP, for instance, are best-
effort systems that make no guarantees about message delivery; the transport layer
protocols are responsible for providing reliable communication through retry mechanisms.
However, retry mechanisms can be fooled by Byzantine faults [Ber87]. Therefore, true
Byzantine resilience can not be implemented unless the underlying network layer protocol
supports Byzantine resilience through redundancy and/or message authentication.
BRNP supports communication between stations of varying redundancy levels. The
three redundancy levels of the AFTA (simplex, triplex, and quadruplex) are currently
supported. BRNP guarantees agreement on data transmitted from a simplex to a fault-
masking group. Validity is guaranteed if the simplex source is functional. The FTDB also
guarantees validity on data transmitted between fault-masking groups.
BRNP supports synchronous and asynchronous bandwidth. Synchronous data will
always pre-empt asynchronous data in the output queue. The token passing media access
control of FDDI supports this model well. This data model blends well with the AFTA run-
time system model of synchronous (rate-group) and asynchronous (background) tasks.
Data link and physical layers connected to BRNP must provide at least two connections
to the media layers they represent. These two connections must be ports into mutually
exclusive paths to all other stations that support BRNP. The mutually exclusive paths are
required so that BRNP can transmit two packet copies that traverse the network with no
interconnecting link or node in common. If the packet copies were to pass through the same
link or node, the FTDB would be susceptible to single point failures.
The transmitting BRNP entity receives message transmit requests from the transport
layer protocols in the station. The transmitting entity communicates with the peer BRNP
entity on the receiving station over the redundant FTDB communication links. The message
received from the transport layer is inserted into a BRNP packet with a sequence number
and a signature from the authentication protocol. The BRNP packet is then delivered to the
LLC entity in the station, which transmits a copy of the packet over each of the dual FTDB
media layers to the receiving station.
Page 6-38
The receiving BRNP entity is responsib!e for resolving the multiple packet copies
arriving over the two media layers into a single copy to be delivered to the transport layer.
The two media layers are not bitwise or token synchronized. However, the maximum
latency of a packet on the physical layer is bounded by the token rotation timer. This
characteristic is used by the receiving BRNPentity to maintain functional synchronization
of the two media layers.
When BRNP receives a packet, the sequence number is checked to see if the other copy
of the packet has already arrived. If so, the packet is treated as the second copy, otherwise
the packet is treated as the first copy. The packet is processed by ATP to determine the
validity of the sequence number and the signature. The results of the validation test are
ANDed together to create a packet status bit (PSB). A fault masking station performs a
source congruency on the PSB of each member to generate a packet status vector (PSV).
The PSV for the first packet copy is treated using the following protocol:
• If no members receive a valid pack&, the packet is discarded.
• If a minority of members receive a valid packet, a timeout equal to 2 times the
TIRT is started. If the second packet copy does not arrive before the timeout
expires, the packet is discarded.
• If a majority of members receive a valid packet, a time.out equal to 2 times the
TI'RT is started. If the second packet copy does not arrive before the timeout
expires, the packet is delivered and the sequence number is incremented.
• If a unanimity of members receive a valid packet, the packet is delivered and the
sequence number is incremented.
The PSV for the second packet copy is pr0cessed using the following protocol:
• If a minority or no members receive a valid packet, both copies of the packet are
discarded.
• If a majority or a unanimity of members receive a valid packet, the second copy of
the packet is delivered and the sequence number is incremented.
The protocol outlined above ensures the earliest delivery of valid packets to their
destination while still guaranteeing correct behavior in the presence of a single Byzantine
fault. One interesting characteristic is that if the first packet is valid on all receiving station
members, that copy will be delivered immediately and the second packet copy will be
discarded as invalid since the sequence number is incremented on delivery. Another
characteristic is that the network peer-to-peer latency is no worse than for a simplex
physical layer, whether or not faults are present. In a fault-free system, the fastest physical
layer will deliver the first copy, which will be immediately delivered to the destination. If
Page 6-39
the first packet is lost or corrupted, the second packet copy will still arrive within the
predetermined latency (2xTTRT).
6.5.3.2. Authentication Protocol (ATP)
The Byzantine resilient network protocol uses the authentication protocol (ATP) to sign
outgoing packets and to test the authenticity of incoming packets. The transmitting ATP
entity attaches two pieces of information to each outgoing packet for use by the receiving
ATP entity: a sequence number and a signature. These two items are used by ATP and
BRNP to determine if a given packet was sourced by the appropriate station, and to select a
valid packet from multiple packet copies.
The sequence number defines the sequence of packets leaving a station destined for a
specific address. Two sequence number tables are kept by each station, one for transmit
and one for receive. The transmit sequence number table contains a separate sequence
number for each physical, logical, and broadcast address to which the local station
transmits. When a sequence number is requested by BRNP for a destination address, the
ATP returns the current sequence number in the table and automatically increments it in
preparation for the next packet.
A corresponding sequence number table is kept with sequence numbers for each
physical, logical, and broadcast address the station listens on. When a packet is received
from a remote station, the ATP checks to see if the sequence number on the packet is the
number in the receive table. If the sequence number is correct, the packet passes the
sequence number test. The values in the receive sequence number table are not incremented
automatically on success of the sequence number test. BRNP increments the receive
sequence numbers only on delivery of a packet to a destination.
Signatures on outgoing packets are calculated by applying a private key function to the
data contained in the packet, including the sequence number. Each station has a different
private key function, thus no station can impersonate another station by using another
station's key. The same private key function is used regardless of whether the packet is
addressed to a physical, logical, or broadcast address.
The ATP validates the signature by applying a public key function to the packet data
and to the signature. If the results match, the packet passes the signature authentication test.
Each private key function has a different corresponding public key function, so each station
Page 6-40
mustkeepa table of the public key functions for each remote station from which the local
station expects to receive a packet.
The installation of a new station on the FTDB network requires synchronization of the
sequence number tables and distribution of the public key for the new station.
6.5.3.3, Address Resolution Protocol (ARP )
The address resolution protocol (ARP) maps network addresses to station physical,
logical, or broadcast addresses. The ARP is used by BRNP to determine the correct
physical address to attach to each packet destined for another station. The assignment of
network addresses is static in a particular _B implementation, thus the ARP is simple
and fast.
7
Network addressing in BRNP is designed to allow a station to be addressed as a single
unit, regardless of the station's redundancy level. This concept is similar to the virtual
group identifier (VID) numbers used in theAFTA. Consequently, a single network address
may map to several physical addresses.
6.5.4. Transport Layer
The transport layer protocols provide convenient application user interfaces for different
data communication models. Each transport layer protocol communicates through BNRP.
The protocols handle all redundancy management and fault-masking issues. The user
programming model for each protocol is a virtual simplex unidirectional or bidirectional
data port. All protocols are supported on sites of simplex, triplex, or quadruplex
redundancy level. In the AFTA, the transport layer protocols are implemented in the Ada
run-time system as a part of the I/O systems services. However, the protocols are not tied
to any particular programming language, development environment, or operating system.
Normally, the transport layer is used to build reliable message delivery on top of an
unreliable network layer. For example, the transmission control protocol (TCP) uses a retry
mechanism to ensure message delivery using the best-effort internet protocol (IP).
However, in the F'IT)B, the network protocol itself is reliable, so the transport protocols do
not need to implement retry or other mechanisms for reliable message delivery.
Page 6-41
Most of the transport layer protocols described below use sockets to discriminate
between multiple application tasks using the same protocol. The socket model is similar to
that employed by the TCP/IP protocol suite [Corn91 ].
6.5.4.1. Periodic Datagram Protocol CPDP_
The periodic datagram protocol (PDP) uses a connectionless socket to periodically
transmit or receive data using synchronous bandwidth. An example of a possible
application for PDP is for an intelligent sensor computer that periodically transmits a sensor
reading to one or more controller tasks. The protocol provides a method for synchronizing
sender and receiver. Since the protocol is based on reliable datagrams, the sender does not
care if the receiver is present or not; data delivery is guaranteed if the receiver is present.
Data received on a PDP socket also contains a timestamp, so the user task can determine the
relative age of the data. The protocol provides an exception mechanism for the receiver if an
expected datagram doesn't arrive within the expected timeframe.
(22_.4.2. Services Transaction Protocol (STP)
The services transaction protocol (STP) implements a point-to-point transaction model
for data transfer. A typical transaction-type situation is a server/client paradigm. The client
sends a request for a transaction to the server and waits for a response. The server
completes the transaction and returns the result to the client. The client blocks until the
server responds. The protocol provides a services registration mechanism, so that a server
does not need to always reside at the same station. In fact, a server could be moved if the
station on which a server resides becomes faulty.
6.5.4.3. A_nchronous Datagram Protocol (ADP)
The asynchronous datagram protocol (ADP) uses asynchronous bandwidth and a
connectionless socket to transmit data. ADP sockets are non-blocking, as a response is not
expected within a short timeframe. Reception of an ADP datagram is treated as an
exception. A task can either use an exception mechanism or polling to determine if an ADP
packet has arrived. ADP supports point-to-point and multicast communication. The ADP
programming model is very similar to the send msg and get_msg primitives of the intra-
cluster communication system in the AFTA Ada run-time system.
Page 6-42
6.5.4.4. Network Data Stream Protocol(NDSP_
The network data stream protocol (NDSP) provides a virtual circuit connection between
two sockets. The operation is similar tO that provided by TCP. Asynchronous or
synchronous bandwidth can be used. An attempt is made to allocate synchronous
bandwidth, if requested, when an NDSP s_ket is opened. If the synchronous bandwidth
is not available, asynchronous bandwidth is used until the requested synchronous
bandwidth becomes available. If there is no-outgoing data in the NDSP output buffer, the
synchronous bandwidth is wasted, so synchronous NDSP sockets should be used with
discretion.
6.5.4.5. Network Diagnostic Protocol (N{)P)
The network diagnostic protocol (NDP) is used to perform diagnostic testing of the
network. An NDP datagram can be pre-routed by the sender and can be transmitted through
a loop back to the sender. NDP can also r_uest nested authentication, so that each station
that supports BRNP will concatenate its signature to the NDP datagram, providing a
mechanism for testing each node on a network.
The NDP includes a diagnostic log which records recently observed authentication
errors by the local authentication protocol. The network FDIR task uses the diagnostic log
to attempt to diagnose faults in the system.-Also, the diagnostic log can be copied to a
permanent database for use in system maintenance.
The NDP also provides a mechanism for reconfiguring around diagnosed faults in the
physical layer if the physicai layer supports reconfiguration. In the FDDI implementation of
the FTDB, the NDP interfaces to the station management (SMT) entities for reconfiguration
of a media layer.
Since network diagnosis and reconfiguration is dependent on the physical layer, NDP
is dependent on a particular physical layer implementation.
6,5.4.6. Echo ProtocolLF.,P_.l
The echo protocol (EP) simply returns an echo response whenever an echo request is
received. The echo protocol provides a simple and convenient method for determining the
status of a network station.
Page 6-43
6.5.4.7. Ti/rt¢ Management Protocol (TMP)
The time,, management protocol (TMP) maintains a global time value for all TMP
subscribers. The source of the global time value can be either a single external source, such
as an accurate time reference, or a distributed time agreement algorithm based on multiple,
less accurate time references. The TMP is used by the PDP to determine the age of periodic
data.
6.6. FTDB Development Plan
This section presents a proposal for development of a fault-tolerant data bus brassboard
system. The FI'DB brassboard can be used to interconnect redundant (C2, Ab_A) and
non-redundant (Silicon Graphics, MT-1) computing sites. The development plan is
segmented into subtasks, with each subtask emphasizing a different area of development.
6.6j 1. Developmental and Non-developmental Items
This section details the hardware items to be developed or acquired as part of the
proposed FFDB development plan.
The authenticator module (ATM) is a developmental item. The ATM implements the
authentication protocol (ATP) described above. The ATMs reside in the same FCRs as the
AP-TA, functioning as a redundant I/O device in the AFTA I/O paradigm. For an FMG,
there are 3 or 4 authenticator modules, depending on the redundancy level of the FMG. The
ATM signs outgoing messages, and authenticates signatures on incoming messages. The
design of the ATM is based on an ATM designed under a CSDL IR&D project.
FDDI interface boards are available as non-developmental items. The boards tentatively
selected for the initial FI'DB brassboard are the Interphase V/FDD14211 Peregrine boards.
These boards are available with either a single attach or dual attach interface and contain an
embedded AMD 29000 Processing Element to assist in implementation of the data link
protocols, especially the station management protocol. This processor can also implement
parts of the network layer protocols for the FTDB.
The voting interface module (VIM) is a developmental item. The VIM takes 1, 3, or 4
copies of a signed packet from the ATM and votes them. The VIM resides in the same FCR
as the FDDI interface board. The VIM is a very simple device. The design of the VIM is
based on the receive and vote stages of the AFFA Network Element.
Page 6-44
An optional developmental item is a single board interface module. The single board
interface reduces the FTDB interface from 2 boards per layer (not including the ATM) to
one board per layer by combining the VIM with an FDDI interface. The development of the
single board interface module requires building a custom FDDI interface using a
commercially available FDDI chip set such as that available from Advanced Micro Devices
or National Semiconductor.
6.6.2. Prooosed FrDB Brassboard D¢velop_nt plan
The proposed FFDB brassboard development plan is segmented into 4 subtasks. Each
subtask is considered an upgrade to the previ-ous subtasks, so the costs of each subtask are
incremental.
6.6.2.1. Subtask l-Authentication Protocbls ....
The emphasis of subtask 1 is to demonstrate the use of signed messages for
authentication, the embedding of authentication protocols onto an existing network
standard, and compatibility between BRNP and other FDDI traffic. The system developed
under subtask 1 also serves as a base for additional protocol development.
The characteristics of the system develo_d under subtask 1 are as follows:
• 2 station system of simplex devices ::
• single attach FDDI
• single media layer
Tasks:
2 authenticator modules
2 voting interface modules
2 single attach FDDI interfaces
2 FCR enclosures
data link layer protocols
BRNP, ATP
system integration
6.6.2.2. Subtask 2-Byzantine Resilience
The system developed under subtask 2, an upgrade to the system built under subtask 1,
demonstrates the Byzantine resilience of the FTDB, support for mixed redundancy, and the
reduction in hardware required for the network interface unit. One of the stations in the
subtask 1 system is upgraded to a fault-masking group, and an additional FMG is
connected.
Page 6-45
The characteristics of the subtask 2 system are as follows:
• 3 stations, 2 fault-masking
• single attach FDDI
° dual media layers
Tasks:
5 authenticator modules
4 self-contained interface modules
4 FCR enclosures
system integration
6.6.2.3. Subtask 3-Network FDIR
Subtask 3 demonstrates the ability to diagnose and reconfigure around faults in the
network. To demonstrate this capability, all stations must be upgraded to dual-attach FDDI,
and the network diagnostic protocol and network FDIR tasks must be written and
interfaced to the station management protocol. Software and hardware fault injection is
used to test the functionality of NDP and the network FDIR task.
The characteristics of the subtask 3 system are as follows:
° 3 stations, 2 fault-masking
° dual attach FDDI
• dual media layers
Tasks:
6 dual attach upgrades
NDP, network FDIR, fault injection
6.6.2.4. Subtask 4-Transport Laver Protocols
The purpose of subtask 4 is to develop transport layer protocols for use by user
application tasks on the AFTA or other FTDB subscriber. These protocols are intended to
make the FTDB useful in a real-time system. The subtask 4 system demonstrates the
applicability of these protocols for real-time applications. At the conclusion of subtask 4, all
developmental work for the brassboard FTDB is complete.
Tasks:
transport layer protocols
Page 6-4_i
6.6.3. FTDB Brassboard Develot_ment Sch_ule,
,Figure 6-15 describes the development of the FTDB brassboard. The development
schedule is intended to correspond to the development of the AFTA FTPP. Delivery of the
FTDB brassboard is targeted for sometime in March, 1993, near the projected delivery date
for the rest of the AFTA brassboard design.
1991 I 1992 I 1993
J AS ONDIJ FMAMJ J AS ONDIJ FMAMJ J AS
_ii_i|DetaJled Design_ _..,_._?i_ iiiFabrication, Integration, Validation
_1.:,,',._ _ _..'.._.,....'.._.__ _: ...................................................................................................................
1J II
_z
_k
4Z_
Kick-off Review
Meeting Preliminary Critical Deliver
Design Design Brassboard to
Review Review Army
Revlew
A Scheduled Start V Scheduled Completion
• Actual Start • Actual Completion
Figure 6-15. FTDB Brassboard Developrnent Schedule
Slip ]
Page 6-47
This page intentionally left blank.
Page 6-4g
7. Testability and Maintainability
AFTA is designed tO be testable for hardware faults at all stages of its lifetime. As a
fault tolerant computing system, it actively tests itself during operational modes in order to
maintain its high reliability. During mission critical operations it is imperative that faulty
components be identified and expunged from the system to prevent the possibility of a
system failure should a second uncovered fault occur. However, because no computing
system is operational 100% of the time an_0ecause all digital computing systems require
maintenance, the testing capabilities of the AFTA will also encompass maintenance modes
of operation as well. Consequently, the system test activities will address all aspects of
determining hardware faults - at the maintefiance depot, upon command by an operator, at
power on, and in a mission critical environment.
7.1. Level of testing
The AFTA consists of numerous, individually testable components. Testing of the
AFTA will exercise these components as comprehensively as possible. The components
addressed by the test suites are the processors, Network Elements, I/O controllers, power
conditioners, mass memory devices, and VME buses.
There are essentially 2 levels of testing: component self tests and system tests. The
component self tests are intended to isolate faults in the functional components of a line
replaceable module with the emphasis on isolating the fault to a chip-level component. This
goal can be achieved using on-board diagnostic mechanisms or functionally equivalent
tests. On the other hand, the system tests are designed not only to exercise the numerous
components in a cohesive manner but also to perform these tests while performing mission
critical operations. _ ......
7.1.1. Component self tests
The component self tests are diagnostic tests which exercise the various hardware
components of the AFTA. Each line replaceable module (LRM) in the AFI'A will have a
suite of tests which exhaustively tests each functional component of the LRM. Whenever
available, these tests will be supplied by the manufacturer of the LRM. The testable
components on the AFTA are the processors, Network Elements, I/O controllers, VME
bus, mass memory and power conditioners.
" _age 7'1
PRECEDING PAGE BLANK NOI" t':ILMED
7.1.2. System tests
The component self tests exercise the functionality of the individual line replaceable
modules. Conversely, the system tests exercise functions requiting multiple components
operating in tandem to effectively test the system. Because the AFTA is designed as a fault
tolerant system, fault detection mechanisms are built into the specially designed
interconnection network and are exercised at every message exchange to provide high
coverage of faults with low fault latency. The goal of the system tests is to test the AFTA
as an operating entity exercising these fault tolerant mechanisms.
Fault tolerance in the AFTA is implemented using hardware redundancy. A specially
designed set of Network Elements operate in tight synchrony to implement fault tolerant
message exchanges among processors grouped into redundant virtual groups. The
constituent processors in a virtual group communicate with the members of its virtual group
and with other virtual groups by synchronously sending messages via the Network
Elements. The Network Elements perform fault tolerant specific operations on messages
and deliver voted messages to all members of the destination virtual group. The voting
process generates a consistent voted copy of the message as well as error syndrome data
which are appended to the delivered message. This error syndrome information can be
used to identify faulty components.
7.2. Test Modes
AFTA testing activities shall operate in 4 distinct test modes defined by both the
operational as well as the physical environment. In addition, these modes will dictate the
operator interface. These testing modes are: depot test, maintenance built-in test (M-BIT),
initiated built-in test (I-BIT) and continuous built-in test (C-BIT).
The depot test mode comprises a suite of tests available to the test technician or
automatic test equipment (ATE) for testing the components of the AFTA at a maintenance
repair facility. Specifically, the test suite consists of sets of diagnostic level tests for the
processors, I/O controllers, the Network Element, VME bus, mass memory, and power
conditioners. These depot tests execute outside of the constraints of a real-time
environment with the emphasis on the isolation of chip level faults in these components.
The M-BIT mode is essentially flight-line maintenance which is initiated upon
command by the flight or maintenance crew. Because this test mode is a maintenance mode
with the emphasis on detecting and isolating faults, mission critical operations will not be
Page 7-2
activeduringthis stage. The computing resources are devoted entirely to this maintenance
activity. Consequently, the suite of tests f_anbe very extensive in testing the functionality
of each line replaceable module of the AFTA. In addition, the test suite will include tests of
the functionality of the LRM interfaces, particularly the buses. As an exhaustive set of
tests, this test suite will probably require on the order of minutes to complete; however,
abort mechanisms will be provided terminate the test activity prematurely.
When power is applied to the all components of the AFTA hardware, the I-BIT mode
shall be initiated. The objective of testing during this period is to identify faulty
components in a non-mission critical stage to obviate their sudden exposure at a time when
the recovery options are very limited. As a result of the evaluation of the series of I-BIT
tests, the faulty and non-faulty components will be identified and the initial system
configuration will be established concordant With the mission reliability requirements and
the availability of non-faulty components. During a relatively short period (seconds), the
system initializes and tests itself. However, because of the time constraint a broad
spectrum of tests will be executed to test the basic functionality of all LRMs rather than
extensively testing of only a couple of LRMs. Testing at this stage of activity with a
comprehensive suite of tests ensures that the system reliability is as high as possible by
determining those LRMs which are faulty :and excluding them from the initial system
configuration. Because this stage is not executing mission-critical functions, the suite of
tests can be as extensive as time permits in exercising all functions of the component
without regard to the maintenance of mission critical information. Furthermore, the system
configuration options are far more numerous at this stage than during mission critical
operation where real-time constraints are serious barriers to many reconfiguration
alternatives.
The C-BIT mode will be initiated when mission critical operations are activated.
During this test mode not only will the mission-critical application tasks execute within a
real-time scheduling scheme but, because the system configuration consists of redundant
groups, redundancy management functions such as voting will be active to ensure that
failures do not disrupt correct system operation. Unlike the previous test modes, the fault
detection and analysis functions are constrained within a real-time framework. In addition,
these functions must not interfere with mission critical operations; data integrity must be
maintained and the consumption of computing resources must be minimized.
Page 7-3
i i i ii ii iim i
initiated
mission
completed
Figure 7-1. System mode and Test Mode Interactions
Figure 7-2 illustrates in greater detail the system operations from an initial power-on
state and the interaction with the Ab'TA test modes. This sequence is described thoroughly
in Section 5.
J
Page 7-4
com[
poweron
I-BIT
self
tests
element
sync
c_n_Are
I-BIT
system
tests
standby
operations
mission
operations
manual
system
reset
operator
command
mission
activated
C-BIT
tests
system
reset
M-BIT
self
tests
element
sync
M-BIT
system
tests
Figure 7-2. Test Mode State Diagram
Page 7-5
7.3. Operator interface
There are 2 primary operators of the AFTA who are interested in the health of the
AFTA - namely, the vehicle operator and the maintenance crew. Each has drastically
differing requirements regarding the health of the digital computing system. The vehicle
operator is primarily interested in discerning the relative health of the system with regards
to its ability to accomplish the current mission with a sufficient measure of reliability.
Specifically, the vehicle operator requires knowledge whether the mission configuration is
commensurate with the requested redundancy configuration. Furthermore, he requires
knowledge of the reliability of the mission configuration when it differs with the request.
The AFTA operating system should provide some measure of reliability for each critical
functional area - that is, navigation system, flight control, as well as the some
determination of the health of redundant sensors and actuators. In addition, it is highly
desirable to filter this information sufficiently such that the information presented to the
vehicle operator is easily decipherable and interpretable given that the pilot may be
immersed in other mission critical operator tasks.
Conversely, the maintenance crew is interested in the isolation of faults to a specific
component. In fact, 2 tiers of fault diagnosis are highly desirable. Field operations
maintenance requires that components be easily replaceable. Consequently, identification
of faulty line replaceable modules (LRM) is important to the field operations. On the other
hand, the expense of LRM replacement warrants the identification of chip level faults
whenever possible. This enables LRMs to be shipped to a maintenance repair facility for
diagnosis of faulty components within the LRM and replacement of those components.
7.4. FTPP C2 Network Element Tests
The FFPP C2 system is the forerunner of the AFTA system. Like the AFTA, the
Network Element is the integral component of this fault tolerant computing system. While
there are many differences between rthese systems, there are many similarities in the
architecture of the Network Element.
This section describes in detail the component self tests developed for the C2 Network
Element.
Page 7-6
Processing Element ]
A,_a= . Data Width Converter( 32 to 8) ]
'_ FIFO Data Bus
)1 [_'_ I
.,8
F
Optical
Transmitter
I.L,L
Data To
Other Fault Sets ....
,4h • •
Data From
Other Fault Sets
-J_ault Tolertm
u Clock m
Figure 7-3. Block Diagram of FTPP C2 Network Element
Page 7-7
7.4.1. Off-line Standalone NE Diagnostic Tests
The following tests verify the correct operation of the FTPP-C2 Network Element (NE)
hardware to the extent made possible by the current design under the control of the local
Processing Element(PE), an MVME-147 board. All tests are standalone in the sense that
no inter-Network Element communication takes place. In fact, the execution of most of
these tests would be disruptive to the synchronous operation of the aggregate NEs. Thus
the test suite can only be executed by each PE running in simplex mode. The tests have
been developed as a program to be downloaded to each processor and executed from RAM.
However, since no static variables have been used, the code could easily be converted to a
PROMable version. The PROMed version could either be called as a subroutine by any
program running on the PE or as a standalone program from the i47-Bug PROMed
debugger. In the first case, the subroutine returns a boolean to the calling program
indicating whether or not any errors were detected. In the second case, the routine simply
returns to the 147-Bug user interface. A second version of the program is provided for use
from the 147-Bug user interface to allow standalone testing of the opfo-electrical devices
used in the fiber-optic communications. All error reporting is displayed on the VT-220
terminal attached to the PE. In addition to fully Verifying the correct operation of the
Network Element, the tests are intended to serve a routine maintenance function, enabling
an operator to replace faulty components.
The C2 NE hardware comprises six functional blocks. They are:
1) The Processor-Network Element Interface
2) The Network Element Data Paths
3) The Network Element Global Controller
4) The Scoreboard
5) The Inter-Fault Set Communication Links
6) The Network Element Synchronization Method(Fault Tolerant Clock)
The tests in the off-line standalone test suite fully verify the Correct operation of the f'trst
four functional blocks. Some of the tests also perform diagnostic analysis of errors detected
during the test suite in an attempt to identify the cause of the error to as fine a level as
possible. In a few cases, the IC responsible for the error can be identified. The on-card
components of the fifth functional block are tested with a separate test suite since these tests
can only be performed after some optical cables are connected in a testing configuration.
Testing the last functional block, the Network Element Synchronization Method, requires
Page 7-8
truesynchronousNE operation. The functionality of the fault tolerant clock is minimally
tested in the scoreboard tests.
7.4.2. Functional Block: Plgcessor'Network Element Interface
Sub-block: Address Decode and Dtack Generation
Parts List: U0101, U0105, U0109, U0113, U2909
Test Description: A failure in this part of the circuitry will appear to the PE as a
Bus Error. The criterion for passing the test will be that no Bus Error is detected
during a read of the Status Register or the first byte in the Dual Port Memory or a
write to the Class Fifo or the Transmit.........Fifo. Since it is not possible to isolate any
of the devices, each one must be replaced in turn until the Bus Error is eliminated.
Test Sequence Number: 1. All further testing depends on the ability to correctly
read and write data to this interface.
Sub-block: Reset Generation
Parts List: U0113, U0117
Test Description: Writing to the Rese(Location of the DP RAM should cause the
NE to reset itself. To partially verify that the reset function is operational, the reset
location is written to and the necessary time delay is allowed to elapse. The status
register is then read. The value stored_ in bits one and zero should be 1. Next a byte
is written to the Class FIFO and the macro wrap_.serp_vme is executed. The status
register is again read. This time since data should be in the Receive FIFO, the value
of bits one and zero should be 3. The reset location is written to a second time and
the status register is read. The reset function is considered operational if the value
of bits one and zero of the status register is restored to one. If the test fails, either
the reset function is not operational or the status register has failed.
Test Sequence Number: 2. All further testing depends on the ability to
correctly reset the NE.
Sub-block: Dual Port Ram
Parts List: U4138
Test Description: A simple read/w_te pattern test is performed on the 2 Kbytes
of the dual port RAM. To pass thi s test, the pattern read from a given location must
be equal to the pattern that was:written ......to it. Since byte 0 of the DP RAM performs
a special control function for the N-E, it is exempt from this test. A failure of any
part of this test means replacing the device.
Test Sequence Number: 3. This tegt only depends on Test 1.
Sub-block: Dual Port Ram Contention Arbitration Capability
Parts List: U4073
Test Description: Since the on =chip contention circuitry is faulty, additional
logic in the form of this device was added to perform arbitration when both the PE
and the NE try to write to the same l_ation on this device at the same time. To test
that the contention logic is working, the same location on both sides of the DP
RAM must be accessed simultane0usly. First, the local timer is read. The
verify__contention macro is executed. This causes the global controller to lock out
DPCOMM0 for approximately 64 l.tseconds. Therefore, the if the VME side of the
DPCOMM0 is accessed now, 64 _seconds should elapse before it receives a Dtack.
After reading this location (the value _ad is irrelevant), the local timer is again read.
If 64 I.tseconds have elapsed, the contention logic is operating correctly. The time
which actually expires is reported as part of the test results.
Test Sequence Number: 5. This test depends on the DP RAM test and the test
for the Global Controller.
Sub-block: Receive FIFO
Parts List: U0121, U0125 U1421, U1425, U1205, U1417
Test Description: One byte is written to each of four DP RAM locations
(DPDATA0-DPDATA3). The deliverdpram macro is executed which causes these
bytes to be written in order into the Receive FIFO. The receive FIFO is read as a
long word and the four bytes compared to those written to DP RAM. If the
corresponding bytes are equal, the Receive FIFO is operational. If it fails, the
Receive FIFO is not operational or the DP RAM interface on the Voted Data Bus
which it shares with the Receive FIFO is faulty. There is no software test to
differentiate between these two possibilities. Replace the Receive FIFO and retry.
Test Sequence Number: 6. This test must follow the test for the Global
Controller and the test of DP RAM.
Sub-block: Transmit FIFO
Parts List: U0121, U0125, U1421, U1425, U1409, U1413
Test Description: The macro xmit_to_dpram sends data from Data Width
Converter on the VME bus through the Transmit FIFO to DP RAM through the
Debug Wrap Buffer. This wrap test is performed by writing a known long word
pattern to the Transmit FIFO and then executing the macro xmit to dpram four
times. The contents of DP RAM 0 is compared to the corresponding byte in the
long word pattern written to the Data Width Converter. If all four patterns match,
the Data Width Converter, the Transmit FIFO and the Debug Wrap Buffer are
considered operational. In this case, the same test is performed using the wrap_vme
macro. This macro causes the data from the Transmit FIFO to be transferred to the
Receive FIFO, thereby exercising some additional data paths in this interface. If this
exercise produces no errors the result is a passing score for the Transmit FIFO. If
some of the patterns in the xmit to dpram test match but others do not, the
corresponding register(s) in the Data Width converter are failed. If none of the
patterns match, either the Transmit FIFO has failed or the Debug Wrap Buffer has
failed. In this case a test to verify the operation of the Class FIFO is performed.
This test is described below. However, it uses the Receive FIFO instead of the DP
RAM as a repository of the wrapped byte. Thus if this test also fails, the Debug
Wrap Buffer appears to be the failed device. If a new device results in the same test
results, the Transmit and Class FIFOs are both faulty and must be replaced.
Test Sequence Number: 7. This test must follow the test of the Global
Controller, DP RAM and Receive FIFO.
Sub-block: Class FIFO
Parts List: U1201
Test Description: This test uses the wrap_serp_vme macro to transfer a byte
from the Class FIFO to the Receive FIFO. If the byte written equals the byte read,
the Class FIFO is considered operational. If they are not equal, the device is faulty.
Test Sequence Number: 8. This test must follow the presence test for the global
controller and the Receive FIFO.
Sub-block: Status Register
Parts List: U0117
Test Description: Following the execution of an NE-reset command, the value
of the lower two bits of the status register should be 01 corresponding to an
asserted value of CTS and a de-asserted value of DR. An asserted value of CTS
corresponds to an empty Transmit FIFO. A de-asserted value of DR corresponds
Page 7-10
to andanemptyReceiveFIFO. This part of the test is run implicitly in the reset
function test. However, the ability to clear the CTS bit (bit zero) in the status byte
read from this register can be tested also. This is accomplished by writing a series
of long words to the Transmit FIFO and reading the Status register. CTS should be
de-asserted when the Transmit FIFO is more than half full or after 129 long words
have been written. It should remain in that state while the Transmit FIFO is filled to
hold finally 256 long words. Next the macro wrap_vme should be executed
enough times to transfer half the data_m the Transmit FIFO to the Receive FIFO.
This will require adjusting the message size with the write to_ftc macro. Since the
largest message which can be sent is 15 long words and _e smallest message is 4
long words, this can be accomplished with two 15 word messages and one 12
word message. At this point, the CTS should be reasserted. After the contents of
the full Transmit FIFO are transfe_ed to the Receive FIFO, the contents of the
Receive FIFO are read and compared to the outgoing data. Each of the 256 long
words read from the Receive FIFO should match the corresponding word written
out to the Transmit FIFO. When the Receive FIFO is empty, DR should be de-
asserted.
Test Sequence Number: 9.
Sub-block: Debug Wrap Buffer
Parts List: U2701
Test Description: When tests for both the Class FIFO and the Transmit FIFO
fail, this buffer is implicated. No UnFque test for this device is possible.
Test Sequence Number: N.A.
7.4.3. Functional Block; Network ElcmentData Paths
Sub-block: Data Paths through My FIFO and Opposite FIFO
Parts List: U2705 (Tri-Statable Pipeline Register), U1667, U1661, U1666, and
U1669 (Debug Router), U0167 and U0149 (My FIFO), U0169 and Ul163
(Opposite FIFO), U0244 and U1244 (Voter), U2163 (Synchronous Data Path
Controller), U2244 (Asynchronous Data Path Controller), U2149 (Vote Mask
Register)
Test Description: The purpose of this test is to determine whether or not the data
paths through the FIFOs designated as My FIFO and Opposite FIFO are
functioning correctly. Since the devices comprising the FIFOs cannot be fully
isolated, the hardware comprising th_ entire data path is also tested both implicitly
and explicitly by the test suite descd_d here. Two data paths are exercised. The
aggregate error information obtainedduring this process is then analyzed to identify
as closely as possible the source of any errors. In the fast test sequence, data is sent
from the Transmit FIFO over My External Bus through the Debug Router to My
FIFO, through the Voter and finally _ad back from the Receive FIFO. For this data
path, the Vote Mask is set to exclude data from all channels except My FIFO in the
voted result which is returned to the Receive FIFO. In the second test sequence, the
data path is the same except the data path FIFO used is the Opposite FIFO. The
Vote Mask is set to vote only data from the Opposite FIFO. (The voter PALs only
allow simplex "voting" of data from these two particular FIFOs). For both tests the
messages sent are the same: four words which contain bit patterns for byte wide
"marching ones" through a field of zeroes and "marching zeroes" through a field of
ones. Each test answers the following two questions: (1) Does the data read match,
bit for bit, the data written? (2) Were any voter syndromes registered against the
Page 7-11
data path FIFO in use? If all patterns read match the patterns written and no voter
syndrome errors are reported against the channel under test, the devices in these
data paths are functional. If no data is delivered to the Receive FIFO for either path,
the functional blocks suspected of being faulty are the Synchronous or
Asynchronous Controllers. If pattern errors are detected on both paths but no voter
syndrome errors are detected, then devices in the Tri-Statable Pipeline Register or
the Debug Router may be faulty. If a pattern mismatch occurs on only one data
path, then the FIFO in that path or its associated pipeline register may be faulty. If
syndrome errors are reported on both data paths, then the Voter or the Vote Mask
Register may be faulty. If any errors are detected, the raw test results of this test are
displayed on the attached monitor in tabular form.
Test Sequence Number: 10.
Sub-block: Data Paths through Left FIFO and Right FIFO
Parts List: U1667, U1661, U1666, and U1669 (Debug Router), U0161 and
U0153 (Left FIFO), U0165 and U1149 (Right FIFO), U0244 and U1244 (Voter),
U2163 (Synchronous Data Path Controller), U2244 (Asynchronous Data Path
Controller), U2149 (Vote Mask Register)
Test Description: This test and the data analysis are performed in exactly the
same manner as the test for the Data Paths through My FIFO and Opposite FIFO
except that a different Vote mask is used. This test is only performed when no
errors are detected on the previous data paths test suite. The Vote Mask used to test
the Left FIFO includes My FIFO, Opposite FIFO and Left FIFO. For testing the
Right FIFO, the mask is changed to include Right FIFO and exclude Left FIFO. If
any errors are detected by this test, the raw test results of both sets of data path tests
are displayed on the attached monitor in tabular form.
Test Sequence Number: 11
Sub-block: Voter Error Detection Capability
Parts List: U0244 and U1244 (Voter), U2149 (Vote Mask Register), U3877
(Syndrome Accumulator)
Test Description: The purpose of this test is to determine if the error detection
capability of the Voter is functioning properly. This test is only performed if the
Data Paths test suites have detected no errors in at least three of the the data paths.
Depending on the number of working FIFOs, all possible configurations are tested.
If all four FIFOs are fully functional, then the quadruplex and three triplex
configurations are tested. If only three FIFOs are operational, then only one triplex
configuration is tested. Each configuration is tested separately. One at a time, each
channel of a given configuration is designated as the channel under test. The
channel so designated is sent a corrupted message while the other channels receive
congruent copies of the valid message. The messages are selected so that error
detection is tested for every bit in eight bit wide voted data path. Furthermore, both
possible error types are tested in each bit position, i.e. a correct value of zero and an
erroneous value of one and vice versa. Error insertion is accomplished by first
writing a valid message to all the FIFOs, then cleating the FIFO under test and
sending all FIFOs the corrupted message. The FIFO under test now holds only one
message, the corrupted one while the other FIFOs have a valid message followed
by the corrupted message. Since the first message in the FIFOs are voted together,
this procedure correctly inserts an error in the FIFO under test. Having thus
inserted an error, the message is voted and delivered and the resulting voter
syndrome information is examined. If, in all cases, an error is recorded against the
channel under test, the error detection hardware is deemed to be functioning
correctly. Furthermore, if, in all cases, the voted value of the message correctly
Page 7-12
masks the error, the voter hardware is deemedto be functioning correctly.
Otherwise,amalfunctioningcomponentexistsamongthedeviceslistedabove.In
this case, the resultsof the test _ displayedin tabular form on the attached
monitor.
Test Sequence Number: 12
Sub-block: MessageReflectionMultiplexer
Parts List: U2657, U2661, U2665, U2669
Test Description: The purpose of_iS test is to verify that the special data paths
involved in Class 2 exchanges are operating correctly. In particular, this test
exercises the reflection path through the multiplexer which performs the second
round of the Class 2 exchange. Each reflect path is tested in turn. A message is
written to the data path FIFOs with the wrap to_dp macro. The FIFOs which are
not under test are then cleared. This is followed by the reflect_from_X macro,
where X is either A,B,C, or D, depending on the reflect path under test. The
message is then voted and read from-fhe Receive FIFO. If the patterns match and
no voter syndrome errors are reported, the reflect function is working properly.
Any pattern mismatches or syndromes errors are indications of faulty hardware
along the reflection data path and ibis information is therefore displayed in a
message on the attached monitor. ...........
NOTE: The microcode does not correcdy execute the reflect.from_X macro, so this
test cannot be performed. ...........
Test Sequence Number: 13
7.4.4_ Functional Block: Network Element Global Controller
Sub-block: The Global Controller
Parts List: U1473, U2773, U1491, U1485, U0181, U0185, U1477, U0177,
U0173, U2725, U3869, U2777, U2781, U3885
Test Description: Despite its complexity, there is very little visibility into the
operation of this functional block from the PE. A presence test can be performed
and the results read back from DP RAM by executing the macro write_pattern. The
Global Controller is considered "active" if the correct pattern is written to bytes 0
and 1 of the DP RAM. Another macro, verify_counter, causes the Global
Controller to load its counter with 255 and to count down to zero. However, the
successful execution of the presence test and the countdown test does not mean that
the Global Controller is fully operational. Massive failures of other tests implicate
the Global Controller. However, Without detailed knowledge of its operation,
ascertaining which devices are faulty_s not possible.
Test Sequence Number: 4
Sub-block: ISYNC Test
Parts List: All NE components except the Scoreboard
Test Description: The purpose of this test is to verify the operation of the Global
Controller in performing the initial network element synchronization, ISYNC. Even
though the synchronization taking place is trivial, because synchronization with
oneself is true by definition_ the Global Controller does not "know" this fact.
Therefore, it performs the synchronization as if it were trying to synchronize the
quadriplex NE configuration, utilizing all the same logic and state changes required
in a "real" synchronization. At then end of ISYNC, the NE reports the channels
Page 7-13
7.4.5.
with which it is synchronizedin adualport RAM location. In this case,it should
besynchronizedwith itselfand the three other NE channels, since the debug router
is programmed to transferdata from the simplex channel conducting the test to all
data path FIFOs, not from the external optical network. Following ISYNC, the NE
is in synchronous mode and therefore will no longer respond to commands placed
in the dual port RAM. Instead, it responds only to message sending commands
generated by writing to the Class FIFO with the desired message size and class.
The presence of messages delivered to the Receive FIFO are detected by reading the
Status Register. To return to debug mode the NE must be reset.
Test Sequence Number: 14
Functional Block: The Scoreboard
Sub-block: Message Size Test
Parts List: U3234, U3244
Test Description: The purpose of this test is to verify to operation of the
Scoreboard in sending packets of every allowable length. The Scoreboards of the
various NEs communicate with each other by means of System Exchange Request
Packets (SERPs). The SERPS contain information on the contents of the Class
FIFO on each NE as written by its associated PE. In this simplex mode of pseudo-
synchronous operation, the SERP packets are all identical and therefore the voted
SERPs processed by the Scoreboard should always cause the message sent by the
PE to be delivered. One at a time, messages of length 16N bytes (where N has a
value from 1 to 15 inclusive) are written to the Transmit FIFO. The size and class
of the message are then written to the Class FIFO. The Status Register is polled
until it indicates the presence of a message in the Receive FIFO. The delivered
message is then read into a buffer. The contents and size of the message received is
compared with the message that was sent. If there is any disagreement, this result
is reported in an error message displayed on the attached monitor. The voter masks
are set so that this test is performed for every possible configuration of channels
over the minimum fault masking number which in this case is three, provided that at
least this many channels are deemed to be working correctly. Thus even if only
three channels are working correctly, the quadruplex and four triplex configurations
are still tested, since this is a normal operational condition. This test is integrated
with the following test for Message Size to exercise every possible combination of
message class and message size with every voting mask that provides a fault
masking group.
Test Sequence Number: 15, the NE must be in synchronous mode following
!SYNC
Sub-block: Message Class Test
Parts List: U3234, U3244
Test Description: The purpose of this test is to verify to operation of the
Scoreboard in sending packets of every allowable class. It is similar in execution to
the Message Size Test. In fact, these two tests are merged to allow every possible
combination of class and size to be written to the Class FIFO. The contents and size
of the message received is compared with the message that was sent. If there is any
disa.greement, this result is reported in an error message displayed on the attached
momtor. In implementing this test and the previous test for Message Class, every
possible combination of message class and message size is exercised with every
voting mask that provides a fault masking group.
Test Sequence Number: 16, the NE must be in synchronous mode following
ISYNC
Page 7-14
7.4.6. Functional Block: The Inter-Fault Set Communication Links
Sub-block: Optical Data Links and TAXIs
Parts List: U4209, U4229, U4245, U4266 (ODLs), U4914, U4214, U4363,
U4250 (TAXIs)
Test Description: The purpose of this test is to verify the correct operation of the
devices used in the C2 optical communication network. This test requires that the
cables which usually connect the optical Transmitters to other fault sets be instead
connected to the Optical Receivers of the simplex channel itself. In this wrap-
around mode, the debug muter must be programmed to route data received optically
to the data path FIFOs. The tests Which exercise the data path FIFOs, ISYNC and
the message class and size are repeated for this configuration.
Test Sequence Number: 17
7.4.% Conclusiops
While the series of NE self tests exercise much of its functionality, the testing
procedures demonstrate some deficiencies in hardware architecture of C2's NE. The NE is
a highly integrated hardware unit. In a few cases a specific IC can be identified as the
source of an error. However, in most cases, a set of ICs are identified as the most likely
source. Even worse, in a few cases, the = actual source of the error is completely
indeterminate.
These complaints are most obvious regarding the functionality of the Global Controller,
which is pervasive. Since all the self-testing depends on the full functionality of the Global
Controller, and since there is no way to fully verify the Global Controller itself, any failed
test could potentially be due to faults in the Global Controller, comprising 14 devices, or
due to bad connections between various 0ther devices and the outputs of the Global
Controller.
In the C2 NE, there is no microcode in the Global Controller to support fully
independent standalone testing of the Scoreboard. The operation of the Scoreboard can
only be observed during the synchronous operation of the system. However, it is possible
to operate the NE in a pseudo-synchronous mode in which the NE is synchronized only
with itself. In this mode, the correct operation of the Scoreboard can be inferred from
various positive test results.
Complete end-to-end testing of the inter'fault set communication links can only be
performed in conjunction with other NEs. However, it is possible to wrap the output of
Page 7-15
the optical transmitter to the input of the optical receiver on the same NE. However, this
requires operator manipulation of these cables. In this configuration, the operation of the
TAXI chips and the Optical Data Links can be tested.
Because the C2 NE was not designed for testability many functions of the NE were not
sufficiently modularized or accessible to test facilities. In many cases, it was difficult to
devise tests of functional components because of the inaccessibility of the necessary
information. In those instances, it was necessary to perform numerous tests and to analyze
the entire set of test results rather than to straight-forwardly exercise a single function with
a predicable result.
Finally, there is no functional level test of only the fault tolerant clock (FTC). Although
the fault tolerant clock is exercised during the scoreboard testing, the level of testing is
minimal. A fault in the FTC may or may not adversely effect the results of these tests
depending upon the nature of the fault.
7.5. AFTA Maintenance
The AFTA is being designed under the assumption that two domains of maintenance
activity will be used during its operational life. These are field maintenance and depot
maintenance.
7.6. AFTA Line Maintenance Procedure
An overview of the maintenance time line is depicted in Table 7-1. The times shown to
perform the various maintenance steps are extremely preliminary estimates. The AFTA
architectural features relevant to the maintenance discussion are shown in Figure 7-4.
. Operation
Perform M-BIT TBD
Open Ba)_.
Connect P1MA
Downloa_nt Fault Lo_
Replace LRM/LRU
Perform M-BIT
Close Bay
Estimated time reqd ....
30 minutes
2 minutes
minutes
30 minutes for LRM,
60 minutes for LRU
TBD
30 minutes
Comments
Initiated via CDU
Identifies LRU, LRM, Bay on CDU
Retest entire AFTA via PIMA/CDU
Table 7-1. AFFA Maintenance Time Line
Page 7-16
I!
,,t
LI3
i 4 m
iiiiii r
m m D
r
Inllm
i _...m ,...._
Figure 7-4. Maintenance-Related AFTA Features
Page 7-17
M-BIT is initiatedbycrewchiefor pilot via CDU or PortableIntelligenceMaintenance
Aid (PIMA) andis interruptibleatanytimeviaAFTA resetor powercycle. Resetof power
cycle throwsAFTA into I-BIT power-upsequence.M-BIT and I-BIT aredisabledin
critical vehiclemodessuchastaxi,flight, etc.
M-BIT indicatesfault statusof AFTA on CDU. If faults are found,CDU indicates
LRU, LRM, andbay in whichLRU canbefound. ThebaycontainingthefaultedAFTA
LRM/LRU is openedby themaintenancecrew. Vehicle-specifictoolswill berequiredto
achievethis. Oncethevehiclebay is open,themaintenancecrewmembercanconnecta
PIMA to aport on theoutsideof theAFTA LRU (Figure7-5). PIMA portswill alsoexist
elsewherein thevehiclewhichwill not requireopeningof vehiclebays. Accordingto LH
doctrine,eachvehiclepossessesaPIMA, whoseprimary purposeis to assistmaintenance
personnelin isolatingvehiclefaultsanddiagnosingproblems;it is assumedthatit will also
beusedfor AFTA diagnosisandmaintenance.ThePIMA, essentiallya largishruggedized
laptopcomputer,will beconnectedto thevehicleafter eachflight to interrogatesystem
status.For theAFTA, thePIMA iscapableof displayingandprinting (via anAVIS rental
car-like printer) the nonvolatile fault log maintained by the AFTA, sending the AFTA
through I-BIT and M-BIT, resetting the AFTA, and resetting the nonvolatile fault log.
The AFTA system services are also responsible for logging LRM/LRU utilizatio'n
information such as number of power cycles, elapsed power-on time, and other
information determined to be of interest to maintenance personnel. This information will be
maintained for each replaceable component, downloaded to the PIMA upon request, and
will accompany defective modules back to the depot via the PIMA printout. In addition to
its diagnosis function the PIMA will keep records, and store and display all flightline
maintenance and logistics publications [MIL-HDBK-59].
To further facilitate maintenance, an annunciator panel located on the exterior of the
LRU indicates which LRM inside is faulty. The panel contains one indicator for each LRM
slot.
Page 7-18
IIII I I I_
PIMA
Port
AFTA LRU
Fault
Annunciator
Panel
Figure 7-5. AFTA LRU
If the fault cannot be isolated to an LRM, the entire LRU must be replaced. This will
probably require tools to remove the chassis and wiring. An example of such a fault would
be failure of the inlra-FCR backplane connecting the LRMs.
If the fault is isolated to one or more LRMs, as indicated by both the CDU, PIMA, and
annunciator panel indicators, the maintenance technician opens the AFTA LRU by re-
Page 7-19
movingthe access panel. The access panel is sealed to the LRU chassis with manually op-
erable closures to allow this to be performed without the use of special tools. The faulty
LRM(s) is (are) removed using the injector/extractors attached to the LRMs; Allen
wrenches will be required to unseat and seat the "wedge-locks" on the LRM cold edges
from the chassis cold frame. Replacement LRMs are installed and secured using the LRM
injectors and wedge-locks.
After replacement of the LRU or LRM(s) and before replacement of the access panel,
the M-BIT is repeated, either from the CDU or the PIMA. Assuming that the M-BIT is
passed, the LRU access panel is replaced and the bay is sealed.
It is possible that I-BIT, M-BIT, and C-BIT could be used to diagnose faults down to
the integrated circuit level, and that this information could be recorded and printed out by
the PIMA and accompany the defective LRM/LRU back to the depot. However, because it
is necessary to confirm at the depot that the defective circuit was repaired before shipping it
back to the field, some ATE will be required for depot testing, assuming the AFTA
components cannot test themselves to the IC level at the depot.
Page 7-20
8. Common Mode Fault Study
Most current fault tolerant architectures arc primarily designed to tolerate random hard-
ware faults. It is assume_:l - _at the probabilities that redundant coPieS of a computation suf-
fer a fault at the same time are independent and uncorrclated. This is an accurate assump-
tion for random hardware faults but a poorone for common mode faults such as those
caused by software faults, because all copies of the execution will suffer the fault at the
same time if the copies are identical. Software faults axe but a specific class of common
mode faults. Other sources of common mode faults are generic hardware bugs or design
flaws, massive electrical upsets which overwhelm the fault tolerant power/clocking fault
containment mechanisms, etc. More rigorously, a common mode fault is defined to be a
fault that affects multiple fault containment region simultaneously or nearly simultaneously.
Nearly simultaneous in the AFTA context means that the system has not recovered from a
fault before the next fault arrives.
Under the AFTA program, a methodolo_ for detecting and recovering from common
mode faults in AFTA will be developed. In addition, a plan for verifying the effectiveness
of these techniques will be formulated. In the event that application-specific information
for the study is needed, the helicopter TF/TA/NOE application will be used as a context for
the common-mode fault tolerance study. :
8.1. Objective ....
The objective of this study is to develop a comprehensive methodology for reducing the
probability of failure of synchronous redundant Byzantine resilient computing systems due
to common mode faults. The methodology is to include techniques to avoid, remove, and
tolerate common mode hardware and software faults, the identification of means to verify
the effectiveness of the common mode fault avoidance/removal/tolerance (CMFA/R/T)
techniques, and a timetable for inserting or developing appropriate CMFA/R/T technology
into AFTA.
8.2. Approach
The study comprises four phases. Phases I and 2 comprise technology surveys, phase
3 comprises evaluation and planning, and phase 4 comprises initiation of the plan.
Page 8-1
8.3. Enumeration of Common Mode Fault Sources
Common mode faults and their sources are extremely diverse. They can be classified in
the same way that all faults are classified in the document "Dependability: Basic Concepts
and Terminology" [Lap90]. They can be classified according to three main viewpoints
which are their nature, their origin and their persistence. The three viewpoints are not mu-
tually exclusive.
8.3A Classification by Nature
Faults may be accidental in nature, i.e., they appear or are created fortuitously. Or they
may be intentional in nature, i.e., they are created deliberately.
For AFTA, intentional faults, e.g. Trojan horses, time bombs, viruses, will not be con-
sidered since they are related to secure systems. Security is currently not a requirement for
AFTA applications although it may be at some future point in time.
8.3.2 Classification by Origin
This is further divided into three viewpoints which are not necessarily mutually exclu-
sive:
1. Phenomenological Causes
- physical faults, which are due to 'adverse physical phenomena;
- human-mode faults, which result from human imperfections.
2. System Boundaries
internal faults, which are those parts of the system's state which, when invoked
by the computation activity, will produce an error;,
external faults, which result from system interference caused by its physical
environment, or from system interaction with its human environment.
3. Phase of Creation
design faults, which result from imperfections that arise during: the develop-
ment of the system (from requirement specification to implementation), subse-
quent modifications, or the establishment of procedures for operating or main-
taining the system;
Page 8-2
operational faults, which appear during the system's exploitation.
8.3.3 Classification by persistence
1. Permanent Faults
their presence is not related to internal conditions such as computation activity
or external conditions such as the environment.
2. Temporary Faults
- their presence is related such conditions and as such they are present for a lim-
ited amount of time.
Since intentional faults are excluded from the current scope of work, there are only 16
possible sources of faults that must be consI_de_. These are all the possible combinations
of the remaining four viewpoints. Of these the physical, internal, operational faults can be
tolerated by using hardware redundancy. Aiiother faults can affect multiple fault contain-
ment regions simultaneously. These are the sources of common mode faults. However,
only some of these fault classes are meaningful. These are tabulated in Table 8-1. Of
these, the interaction faults which arise from the interaction of the computer system with its
human environment, e.g. an operator, will not be considered here since the man-machine
interface is outside the scope of the AFTA's u-se as an embedded control system.
Phenomenological
cause
Physical Human
made
X
X
X
X
X
System
Boundary
Internal
X
X
External
X
X
X
Table 8-1.
Phase of
Creation
Oper-
Design ational
X
X
X
X
X
Persistence
Perm- Temp-
anent orary
X
X
X
X
X
Common Mode
Fault Label
Transient (External)
CMF
Permanent (External)
CMF
Intermittent (Design)
CMF
(Permanent) Design
CMF
Interaction CMF
Classification of Common Mode Faults
Using this methodology, then, only 4 sources of common mode faults need to be con-
sidered for AFTA:
Page 8-3
l. Transient (External) Faults which are the result of interference to the system from its
physical environment such as lightning, High Energy Radio Frequencies (I-IERF),
heat, etc.
,
,
°
Permanent (External) Faults which are the result of system interference caused by its
operational environment such as heat, sand, salt water, dust, etc.
Intermittent (design) Faults which are introduced due to imperfections in the require-
ments specifications, detailed design, implementation of design and other phases lead-
ing up to the operation of the system. These faults manifest themselves only part of the
time.
(Permanent) Design Faults are introduced during the same phases as intermittent faults,
but manifest themselves permanently.
If the relative likelihoods of these four classes of common mode faults were known,
one could apportion the efforts in dealing with them appropriately. However, the models
to predict the occurrence of design faults do not exist or are not mature enough to be of any
practical value to AFTA. Similarly, the rates of occurrence of transient faults and perma-
nent external faults are very much dependent upon the operational environment. Therefore,
the relative rates of occurrence of the four classes of AFTA common mode faults cannot be
predicted with any certainty. Experience suggests that all of these are sufficiently likely to
be of concern to the designers of AF'I'A.
8.4. Enumeration of Common Mode Fault Avoidance, Removal, Tolerance
Techniques
There is a wide range of techniques available today to prevent introduction of CMFs
into a Byzantine Resilient fault tolerant computer, to remove CMFs before such a computer
is put into operational use, and to detect and recover from CMFs that do occur during op-
erational use. As such, the techniques and tools can be classified into three major cate-
gories: Fault Avoidance, Fault Removal and Fault Tolerance. These classifications are de-
scribed in this section while their effectiveness and suitability for inclusion in AFTA _11 be
discussed later.
Page 8-4
8.4.1 Commoq blode Fault Avoidance =
These techniques and tools are used from the requirements specifications to the design
and implementation phases and result in fewer CMFs being introduced into the computer
system. An unprioritized list of these techniques and tools, without regard to their effec-
tiveness in preventing CMFs or applicability to AFTA, is as follows.
8.4.I.1. Formal Methods
These are mathematically based techniques for specifying, developing, and verifying
computer systems with strong emphasis on consistency, completeness and correctness of
system properties. Formal methods have been applied at various levels of specification and
design and to a diverse set of hardware, software and algorithmic parts of fault tolerant
computers. Some of the example applications include the following.
Microprocessor Design: Viper [Cohn88], FM8501 [Hun86], Mini Cayuga [Sri90];
Algorithm Specification and Implementation: Interactive Consistency and Oral Mes-
sages 1 [Bev90], [Bic90];
Fault Tolerant Clock Synchronization: [Da173], [Lam85], [But88], [Rus89];
Specialized Hardware: Communicator-Interstage [Klj88];
Software: Real Time Kernel [Spi90], Ada Formal Verification [Gua90], Formal
Specification [Goe91]
Reliable Computing Platform: [DiV91].
8.4.1.2. Formally Verified Components
Use of hardware and software modules that have previously been verified or have been
developed using formal methods can reduce the incidence of CMFs in these parts of a fault
tolerant computer. Examples of such components include microprocessors such as
VIPER, VIPER2, FM8502, Mini Cayuga, Floating Point Units, and Real Time Kernels.
8.4.1.3. Mature Components
Use of hardware and software modules that have been widely used over a long time
period and whose performance has been monitored and analyzed for correctness of opera-
Page 8-5
tion can also cut down the incidence of design faults. Examples of such hardware modules
include popular microprocessors such as Motorola 68020, Intel 80386, etc., floating point
coprocessors, memory management units, and Ethernet and VMEbus controllers. Exam-
ples of mature software modules include Ada Run Time System and CAMP (Common Ada
Missile Packages) software libraries [CAMP]. CAMP products have been developed under
contract to the US Air Force Armament Test Laboratory, Eglin Air Force Base, Florida and
are available from Data & Analysis Center for Software [DACS]. CAMP products consist
of Parts, Armonics Benchmarks, and Parts Engineering System (PES). The CAMP Parts
are 444 reusable Ada components organized into 35 Top-Level Computer Software Com-
ponents (TLCSCs) which contain 137,000 source lines of Ada code (including comments,
package specifications, package bodies, and test code). The CAMP Armonics Benchmarks
are used to evaluate Ada and processor implementations in the armonics domain. The
benchmarks represent typical armonics applications and include missile operational parts as
well as support parts from the mathematical domain. The tests establish the "correctness"
of compiler implementations and measure performance in size and speed of generated code.
The CAMP PES is a catalog that provides a means of identifying and retrieving reusable
software parts.
8.4.1.4, Design Automation Tools
These are tools and techniques that can help automate parts of the hardware and soft-
ware design cycle. By replacing a labor intensive design process with automated tools, the
incidence of human errors can reduced. In the software arena, more than 50 different
CASE (Computer Aided Software Engineering) tools are available that provide different
levels of automated software generation. The Draper CASE tool has been used, among
other applications, to produce Boeing 737 autoland code in Ada starting from a high level
control law specification. The Ada code was compiled and integrated with the existing
system software on the AIPS Fault Tolerant Processor without any modifications.
In the hardware arena, VHDL (VHSIC Hardware Description Language) is becoming
widely available to describe hardware designs at various levels of abstraction, from a high
level functional description to all the way down to the gate level.
A suite of tools, generally known as Silicon compilers, can be used to convert VHDL
or other high level design descriptions through various levels of detailed hardware design,
fight down to the Silicon implementation with some help from the human designer. One
such suite of tools is the Silicon 1076 compiler from LSI Logic, Inc. which interfaces with
Page 8-6
a VHDL design description at the top and produces a silicon chip at the end of the design
cycle.
8.4.1.5. Architectural Considerations .........
Human errors are more likely when dealing with complex systems and unconventional
concepts. In a fault tolerant computer, concepts that add to the design complexity of a con-
ventional Von Neumann uniprocessor computer architecture are redundancy management
and distributed and parallel processing. Examples of the additional complexities that a de-
signer faces are: fault containment, error coiatainment, synchronization of redundant pro-
cesses, communication between redundant_rocesses, synchronization of and communica-
tion between distributed/parallel processes_ all of these in the presence of one or more
faults, detection, isolation and recovery from faults, and so on.
If the design complexity can be reduc_ then the incidence of human errors can be re-
..........
duced. Some of the fault tolerance concepts can be stated simply and precisely using a
mathematical formalism. These include the_requirements for synchronization, agreement
and validity. Other concepts that can be stated precisely include requirements for fault
containment and error containment. Because of their simplicity fault tolerant computers that
are based on these concepts and implement these requirements are likely to contain fewer
design errors. (it should be noted here that not all fault tolerant computers implement these
requirements.) There is an added benefit int_e design verification and fault removal phases
of basing designs on precisely stated requirements.
Another architectural consideration is the hiding of the design complexity. For exam-
ple, certain architectures implement fault tole_nce in such a manner that the virtual architec-
ture apparent to the applications programmer_nd the operating system programmer appears
to be that of a conventional non-redundant Computer. The Complexities of a redundant ar-
chitecture are made visible only to the tasks/hat must deal with detection and isolation of
faults and recovery from faults. .....
Similarlyl the complexities of distribut_arallel processing can be hidden from most
of the software designers by providing layered communication protocols such as the ISO
OSI models.
Page 8-7
8.4.1.6. Design Diversi_
Design diversity is the concept of implementing different layers of a redundant system
using different designs starting from a common set of specifications. The concept can be
applied to hardware, software, programming language, design development environment
and other design activities. This approach can potentially eliminate many common mode
design faults since each redundant layer uses a different design. Some design faults such
as those that result from an incorrect interpretation of ambiguous specifications could still
find their way in multiple or all designs. Design diversity is listed here as a fault avoidance
rather than a fault tolerance technique since it purports to confine each design fault to a sin-
gle fault containment region, thereby avoiding a common mode fault.
When redundant hardware and/or software elements are implemented using different
designs, bit-wise exact consensus cannot be guaranteed between the outputs of redundant
processors. However, it is still possible to provide a Byzantine resilient core fault tolerant
computer in which design diversity is used for applications programs.
&__Us¢ o__/_Standards
Over the years, a number of standards have been developed for the design of computer
systems. Although the primary motivation for the development of standards is ease of in-
teroperability, logistics, maintainability, reduced cost, and so on, one of the side benefits of
using standards is the reduction of design errors. Standards usually result in detailed, pre-
cise, and stable specifications that can be adhered to in the design phase and verified against
in the verification phase. The design errors that are normally introduced due to ambiguous
or changing specifications can potentially be eliminated by the use of standards.
Examples of standards include bus protocols such as MIL-STD 1553, PI bus, High
Speed Data Bus and processor Instruction Set Architectures such as MIL-STD 1750 and
more recently the JIAWG ISAs such as the Intel 80960 and the MIPS R3000. Software
standards include the MIL-STD-1815, more commonly known as the Ada language. The
advantages of mature, precise and detailed specifications are not limited to military stan-
dards alone. Commercial products, though not standards, per se, can become de facto
standards. VMEbus is such an example of a backplane bus standard.
Page 8-8
8.4.1.8. Good Software Engineering Practices
Common mode faults caused by software are probably the largest single source of fail-
ure in a redundant computer system. ManY software errors can be avoided by following
well established software engineering practices. These practices include adherence to the
waterfall software development methodology, that is, an orderly development of require-
ments, specifications, detailed design, code, unit test, module test, integration, and system
test, with traceability of requirements from beginning to end. Rigorous configuration con-
trol and documentation, such as that specifi_ by MIL-STD 2167A, are also considered an
integral part of this methodology. Other g_ software engineering practices include soft-
ware quality assurance reviews, use of Higher Order Languages such as Ada, modular
code, code reuse, and layering or hierarchicai Structuring of code such as the 7 layers of the
Open Systems Interconnect model for intercomputer communications.
It should be noted here that many of these software engineering practices overlap with
other fault avoidance techniques discussed in this section that can be used to avoid not just
software but also other common mode faultS_ For example, software quality assurance re-
views can be considered a part of design revre_ ws which are applicable to hardware as well
as software. Similarly, code reuse overlaps _ith use of mature components, and so on.
Since software development is a la_ntensive process, one of the good software
engineering practices deals with the training_hd qualification of people. It is important to
assign software development duties to peopl_e- who have been fully trained in the desired
software language, tools, and development methodologies and possess an appropriate type
and amount of experience in the relevant activities.
One aspect of personnel qualification that might be considered controversial is the
"quality" of people. It has been our experience that not all software designers who are
equally trained and experienced produce equally "good" software where quality of software
is measured by the number of errors. The difference between the best and the worst de-
signers can be an order of magnitude in the number of errors. Therefore, track record of
software developers as much as training and experience must be given consideration if the
goal is to produce high quality software.
8.4.1.9. Conservative Hardware Design Practices
Conservative hardware design practices_n help keep a healthy margin of safety be-
tween the worst operational conditions andre actual limits of operation of the computer
Page 8-9
system. The MIL-SPEC 883C operational temperature range, for example, is -55°(2 to
+125°C even though such extremes are probably never reached in actual use. Military de-
sign guidelines also call for derating of electronic parts by a certain margin depending on
the environment. For example, the current outputs of drivers, wattage rating of resistors or
maximum voltages of capacitors may be reduced by 20 to 50 percent to keep the operational
parameters well within the maximum design values. Use of military qualified parts, as
specified by MIL-STD 38510 or SMDs (Standard Military Drawings), ensures that the part
designs meet the functional and electrical specifications and that the parts have been manu-
factured and screened as specified by MIL-SPEC 883C.
Similar to the good software engineering practices discussed in the previous section,
the hardware logic can also be designed in accordance with conservative design rules de-
veloped over the years. Some of the example rules are the use of synchronous designs and
the avoidance of metastable states. Timing verification on synchronous designs is much
easier than for asynchronous designs. The worst case performance of a synchronous de-
sign always occurs at the longest propagation delay; a properly designed synchronous cir-
cuit will still function using delta (approaching zero) delays. Synchronous circuits typically
depend on inputs which are edge-sensitive. Edge-sensitive inputs should only be driven by
a signal that is guaranteed to be free of unwanted glitches. Acceptable signals are clocks,
outputs from registers, or flip-flops (not latches), and combinatorial circuits specifically
designed to be free of glitches under all conditions.
The state of a synchronous design should be predictable at all times following power-
up or reset. All state elements, except simple data registers, should be initialized by the re-
set process. Trap states in a state machine can be avoided by defining unconditional transi-
tions from unused states into the set of defined states. While the machine should not nor-
mally enter an unused state, prevention of trap states ensures that if it does enter such a
state, through a metastability problem or a single-event upset, the system will eventually re-
cover.
Metastability is another cause of common mode hardware faults that can be easily
avoided by adhering to. Metastability is caused by inputs changing within the setup and
hold time of a flip-flop and can cause finite-state machines to enter unwanted or undefined
states. Asynchronous inputs (those that might change within the setup and hold time) are
easily synchronized by passing the signal through a synchronizing flip-flop stage.
Page 8-10
8.4.1.10. Shielding. Packaging and The_l Management
An appropriate amount and type of shiel_ng and packaging can keep the HERF, Single
Event Upsets (SEUs), and lightning from interfering with the correct operation of the com-
puter system. Similarly, proper packaging _d cooling can keep the dust, saltwater, sand
and other foreign matter outside and also dissipate the heat generated internally to the out-
side. Proper packaging techniques can also assure that the hardware will survive the ex-
pected shock and vibration environmehi.q_llLE'5400 provides the military specifications
for thermal management and shock and vi_ti0n design requirements.
8.4.2 Common Mode Fault Removal
Faults that slip past the design process :can be found and removed at various stages
prior to the computer system becoming operational. The fault removal techniques and tools
include the following.
8.4.2.1. Design Reviews
Traditionally, informal design reviews ahd code walk-throughs between engineers and
peers as well as formal design reviews such as PDR (Preliminary Design Review), CDR
(Critical DR), SRR (Software Requirements Review) by supervisors and managers have
been used to uncover gross design and impiementation errors. The management reviews
also check the compliance of the design Wlththe intent of the high level requirements and
specifications, which may not always be stated unambiguously and precisely.
8.4.2.2. Simulatiott_
Simulations have been used at various levels to check compliance with design goals.
Functional and timing simulations have bee/i:a must before ASICs (Application Specific
Integrated Circuits) can be fabricated. VHDUcan now be used to perform behavioral simu-
lations before a function is even translatedinto an electronic circuit. VHDL can also be
used to perform more detailed, lower level simulations as behavioral boxes are replaced by
detailed circuit designs, all the way down to transistor characteristics.
8.4.2.3. Testing
For software, testing rather than simulation has been the traditional technique for un-
covering design faults. Unit tests, module tests, functional tests, and code reading have
Page 8-11
been used extensively in the past to verify correctness of software. Additionally, structural
analysis tools can also be used to analyze static software behavior.
More recently, a novel approach to testing software, called back-to-back testing, has
been advanced. It involves comparing outputs of the program under test against a func-
tionally identical program that has been produced from the same specifications as the target
program but by a different programming team and/or in a different language. This is simi-
lar to N-version programming in that multiple versions of a program are produced. How-
ever, after the testing phase is completed, only one of the versions is chosen for operational
use. Any miscompares at the outputs can be traced to a design or coding error in one of the
versions, an incorrect or different interpretation of specifications by different programming
teams or a difference due to round-off errors.
Testing with oscilloscopes, logic analyzers and probes has been the traditional hard-
ware debugging technique. With the advent of VLSI ASICS, very little visibility can be
obtained inside these chips with these tools. Designing chips with testability such that all
the internal nodes can be tested by applying test vectors (test inputs) and observing the out-
puts, has become a field in its own right. Scan-path is now a standard technique to make
ASICs testable. Similarly, automatic generation of test vectors for ASICs is significantly
more advanced than automatic generation of test inputs for software modules.
8.4.2.4. Fault Injection
Insertion of faults in an otherwise fault-free computer system that is designed to tolerate
faults is a powerful technique to exercise redundancy management hardware and software
that is specialized, error-prone, difficult to test and not likely to be exercised under normal
conditions, i.e., likely to stay dormant until a real fault occurs. Fault insertion techniques
can also be used to operate the system in various degraded modes which are expected to be
encountered in operational life of the system. Degraded mode operation stresses not only
fault handling and redundancy management aspects but also task scheduling, task and
frame completion deadlines, workload assignment to processors, inter-task communica-
tion, flow control, and other performance-related system aspects. Fault insertion exposes
the weaknesses in the hardware and software design, the interactions between hardware
and software, and the interactions between redundancy management and system perfor-
mance. It is an accelerated form of testing the hardware, software and the system, analo-
gous to "bake and shake" testing of hardware devices.
Page 8-12
Different typesof toolshavebeendevelopedto insertfaultsat variouslevels to stress
thefault tolerancecapabilitiesof computersystems.Draperhassuccessfullyusedapin-
level hardwarefault injector for thepast i0 yearsto uncoversubtledesignerrors in the
FTMP, FTPs and AIPS. A similar too! has also been developed by Laboratoire
d'Automatique et d'Analyses des Systemes _AAS) of the French National Center for Sci-
entific Research (CNRS) in Tolouse and Used to evalu9.te a railway signal control com-
puter ....
Carnegie Mellon University has devel6_ed FIAT to insert memory faults. A memory
mutation technique has also been used inde_ndently at Draper to stress the AIPS FTP re-
dundancy management software. Chalme_University in Sweden has experimented with
Californium as a radiation source to test a self-checking microprocessor pair.
Fault insertions at higher levels such as module, link, and fault containment region have
also been used at Draper for the purposes of design verification.
8.4.2.5. Discr__ ancy R_. ort_
A Discrepancy Report (DR) is filed ah-y-time anomalous or unexpected behavior of
hardware, software or the system is encountei'_. A DR deals with the observed symptoms
which may eventually be traced to one or more specific design, coding, manufacturing or
other problems. A DR log can be started as soon as the first phase of testing is begun.
This will normally be a unit test for softw_ and module test for hardware. Alternatively,
the log may be deferred until a unit/module has passed an acceptance test to the satisfaction
of the designer. Delaying the log to this poi/it can cut down the paper-work associated with
the relatively large number of error sympt0ms which is a normal part of initial debugging.
The risk with delaying the start of the log is that if the designer is not methodical in resolv-
ing all the observed discrepancies then causes of these errors may be left in the unit/module
if the unit/module acceptance test does not produce the error.
In any case, the DRs are logged through all subsequent phases of testing, integration,
and verification and validation activities. Once a DR is traced to an underlying cause or
causes, and the problems are successfully resolved, the DR can be closed out after record-
ing the cause(s) and ihe fixes made. Figure 8' i shows a typical format for a Discrepancy
Report. It should, at a minimum, describe originator, date, problem category, software
identification (if known), hardware identification (if known), document identification, other
related DRs, description of the symptoms or occurrence of an event, conditions, and con-
Page 8-13
jectureon possible causes. Other fields in the DR form that need to be eventually f'dled out
include analysis of cause and effect, recommended solution, disposition, and verification
and close-out. Each DR must also be signed by responsible engineers and managers.
It should be pointed out here that a Discrepancy Report is more comprehensive than a
"Bug Report" that is normally associated with the software testing phase. Typically, a Bug
Report is filed when a bug, i.e., a cause of the software error is discovered. A DR pre-
cedes the discovery of the cause. It is filed when an anomalous behavior is discovered
whether or not its cause can be immediately determined. Furthermore, it is not limited just
to software but applies to hardware and the system as well. A further important distinction
between a DR and a bug report is that a DR may eventually lead to the discovery of many
related errors. Typically, in early phases of the system integration several hardware and
software design errors, manufacturing defects, and subtle interactions conspire to produce
bizarre system behavior. As each error or defect is found and corrected, symptoms change
and become less or more bizarre due to the masking effect of one error on another. Some-
times the error symptoms disappear altogether due to a subde change in timing of events.
This is where DRs become quite useful in systematically accounting for abnormal behavior.
For example, if at the delivery time, the system passes all the acceptance tests but not all the
DRs have been successfully resolved, it implies that there could still be some latent errors
in the system.
Discrepancy Reports bring a certain amount of discipline to resolving the observed
problems. If the procedures for logging DRs are followed rigorously by all the engineers,
programmers, and technicians working on the program, then the probability of removing
all the known common mode faults is increased considerably. One no longer has to rely on
the memory or methods of individual designer or tester to keep track of the known prob-
lems.
DRs can be used to collect statistical data necessary to predict the software reliability
growth and other software reliability related metrics. The DR shown in Figure 8-1 can be
expanded to record the data that will be necessary to plot the number of software errors dis-
covered, the mean time between software error occurrence, relationship of errors to soft-
ware units and lines of code, and so on. under an internal R&D program, Draper is devel-
oping a system for automating the recording and searching of the DR database.
Page 8-14
DISCREPANCY REPORT ....................
The CharlesStark Draper Laboratory, lnc_
Cambridge, Massachusetts 02139
II
1. ORIGINATOR: 2. ORGANIZATION
DR NUMBER
PROJECT
"3. DATE:
SHEET_-. OF.........--
4. TELEPHONE #:
5. PROBLEM CATEGORY:
[] COMPUTER PROGRAM/DATA
r'] HARDWARE
[] DOCUMENT
[] O mR
8. DOCUMENT IDENTIFICATION: .....
arm
6. SOFTWARE IDENTIFICATION
(if known)
CPCI:
CI_:
UNIT:
VERSION/REVISION:
....... 9. RELATED DR:
[]
[]
7. HARDWA_ IDENTIFICA¥1ON:
SUPERSEDED
MODIFIED
(ff known)
10. DESCRIPTION (OCCURRENCE,):
11. CONDITIONS:
12. POSSIBLE CAUSE:
ii iHmll
13. ANALYSIS OF CAUSE AND EFFECT:
SIGNATURE ORG ..........._=b^TE TEL.
14. RECOMMENDED SOLUTION:
m
SIGNATURE OR(; DATE TEL.
SIGNATURE
[] NO ACTION REQUIRED [] ECP NO.
[] F,CRNO. [] OTItER
ORG i @iii, DATE TEL,
16. VERIFICATION AND CI.OSE OUT:
RESPONSIBLE ENGINEER
TECI INICAL MANAGER
DATE
DATE
Figure 8-1. Typical Discrepancy Report Format
Page 8-15
8.4.2.6. Automated Theorem Provers
Automated theorem provers or mechanical checkers have been used to argue the com-
pliance of an implementation with a set of specifications. In the course of showing the cor-
respondence from one level to the next, one develops a set of arguments to convince the
ATP that one formal statement follows from another. This usually leads to uncovering the
errors in the correctness of implementation.
1],4.3 Common Mode Fault Tolerance
Common mode faults that are not removed prior to operational use of a computer sys-
tem may eventually manifest themselves in the field. At this point the only recourse is to
detect the occurrence of such a fault and take some corrective action. These are fault toler-
ance techniques and following is an unprioritized list of such methods.
814.3.1. Common Mode Fault Detection
Before a recovery procedure can be invoked to deal with common mode faults in real
time, it is necessary to detect the occurrence of such an event. Many ad hoc techniques
have been developed over the years to accomplish this objective. Most of these techniques
can also be used prior to operational use of the system to eliminate faults. The difference is
that in the fault removal phase, detection of a fault leads to some trap in the debugging envi-
ronment while in the operational phase it will lead to a recovery routine. Similarly, fault
removal techniques discussed in Section 8.4.2 can also be used to aid in the task of detect-
ing faults in real time, albeit with a high penalty in performance.
a. Watchdog Timers
Watchdog timers can be used to catch both hardware and software wandering into
undesirable states. They are typically used in the Processor Element but can also be em-
ployed in the Network Element. Neither hardware watchdog nor task timers unambigu-
ously indicate the occurrence of a common mode fault. The syndrome in the failed channel
of a physical fault is no different from that of a common mode fault. The syndromes
across redundant channels must be compared in real time to determine the cause.
Page 8-16
b. Hardware Exceptions
Hardware exceptions such as illegal address, illegal opcode, access violation, privilege
violation, etc. are all indications of a malfu_nction. Again, syndromes across redundant
channels must be correlated to distinguish between physical and common mode faults.
c. Ada Run Time Checks
Ada provides numerous run time checks such as type checks, range constraints, etc.
that can detect malfunctions in real time. Additionally, user can define exceptions and ex-
ception handlers at various levels to trap abnormal or unexpected program/machine behav-
ior.
d. Memory Management Unit
The Memory Management Unit can be programmed i0 limit access to memory and con-
trol registers by different tasks. Violations can be trapped by the MMU and trigger a re-
covery action.
e. Acceptance Tests
This is a very broad term and can be applied to applications tasks and various compo-
nents of the operating system such as the task scheduler and dispatcher. The results of the
target task are checked for acceptability using some criteria which may range from a single
physical reasonableness check such as pitch command not exceeding a certain rate to an
elaborate check of certain control blocks to ascertain whether the operating system sched-
uled all the tasks in a given frame.
It should be noted again that a physica! fault can trigger any of these detection mecha-
nisms just as well as a common mode fault. Therefore, it is necessary to corroborate the
syndrome information across redundant channels to ascertain which recovery mechanism to
use.
f. Presence Test
Presence test is normally used in FTPs and FTPPs to detect the loss of synchronization
of a single channel due to a physical fault. However, it has also been modified to detect a
total loss of synchronization between multiple channels of an AIPS FTP. This is an indi-
cation of a common mode fault. This technique can be extended to the FTPP as well.
Page 8-17
8.4.3.2. Common Mode Fault Recover3,
The recovery from CMF in real time requires that the state of the system be restored to a
previously known correct point from which the computation activity can resume. This as-
sumes that the occurrence of the common mode fault has been detected by one of the tech-
niques discussed earlier and that its source has been identified.
a,
bo
c°
Exception Handlers: If a common mode fault causes an Ada exception or a hardware
exception to be raised, then an appropriate exception handler that is written for that ab-
normal condition can effect recovery. The recovery may involve a local action such as
flushing input buffers to clear-up an overflow condition or it may cascade into a more
complex set of recovery actions such as restarting a task, a virtual group or the whole
system.
do
Task Restart: If the errors from CMF were limited to a single task and did not propa-
gate to the operating system, then only the affected task needs to be restored and/or
restarted with new inputs. The state can be rolled back using a checkpointed state from
stable storage. Recovery is then effected by invoking an alternate version of the task
using the old inputs assuming that the fault was caused by the task software. This is
termed the backward recovery block approach. If the fault is caused by a simultaneous
transient in all redundant hardware channels then the same task software can be re-exe-
cuted using old inputs. This is termed temporal redundancy. Alternatively, forward
recovery can be effected by restarting the task at some future point in time, usually the
next iteration, using new inputs. This assumes that the fault was caused by an input
sensitive software that will not repeat with new and different inputs.
Virtual Group Restart: In case the CMF resulted in the loss of synchronization, then
redundant channels must be re-synchronized before rollback can begin. Furthermore,
the state of the virtual group must be restored before resuming computational activity.
System Restart: Finally, if all else fails the whole system can be restarted in real time
and a new system state established with current sensor inputs.
Page 8-18
8.4,3,,t. Performance Overheads of Common Mode Fault Tolerance Techniques
The common mode fault avoidance and fault removal techniques can increase the devel-
opment cost of the program but generally do not result in an operational performance
penalty. By contrast, the fault tolerance techniques can cause significant performance
overheads. Therefore, not all of the techniques discussed in this section may be suitable
for real time TF/TA/NOE application. Although it is difficult to quantify the overheads
without a specific system design, one can separate the techniques qualitatively in low,
medium, and high penalty groups.
Low overhead fault detection techniques include watchdog timers, hardware excep-
tions, and presence test since they require ex_ution of zero to a few instructions at infre-
quent intervals. Medium overhead techniques include memory management unit if the
MMU does not add significant number of wait states to memory accesses. Ada run time
checks can potentially result in significant performance penalty and is an example of a high
overhead technique. Finally, acceptance tests can be written to be anywhere from ex-
tremely simple such as a rate or a range check on an output variable to an elaborate program
that duplicates the complete functionality of the program being checked. Thus acceptance
tests can be low, medium or high overhead techniques depending upon their complexity.
Most recovery techniques do not add overheads under nominal, non-faulty operational
conditions. The criterion here is the time-it takes to recover from a fault since the
TA/TF/NOE application tasks cannot be sus_nded for a very long time. The time to re-
cover increases as the level of recovery increases. Thus, exception handlers generally re-
quire the least amount of time, task restart would require a little more followed by virtual
group restart and system restart.
These are only general qualitative observations. Whether or not any of these techniques
will be applicable to AFTA will depend on the specific design parameters to be determined
during the detailed design phase of AFTA.
8.4.3.4. Common Mode Fault Examples
This section describes some of the common mode faults that have been observed over
the years in synchronous redundant Byzantine resilient computing systems at Draper.
What one observes in real time is the effect of the fault, i.e., the error symptom. Manifes-
tations of common mode faults are polygeniC. A given error symptom can be caused by
Page 8-19
severaldifferent CMFs. For example, all members of an FTPP virtual group can go out of
synchronization for a variety of different reasons such as EMI, frame overrun, etc.
Common mode faults are also polysymptomatic. A given CMF can result in different error
symptoms under different conditions. For example, EMI can cause a task in all members
of a virtual group to produce incorrect results or it may cause all members to go out of
synchronization. It is therefore easier to list the observed error symptoms than the causal
CMFs. Table 8-2 lists some of the commonly observed error symptoms, their possible
causes and some plausible means of detection and recovery. It should be emphasized here
that the list is exemplary in nature and is not meant to be exhaustive.
_Z
Z_
_[..,
[-, r.r.l
ra_
ra_
O
O
! °!,,
i
g_
Table 8-2.
o_
8
Commonly Observed Error Symptoms of Common Mode Faults
Page 8-21
[,.
a_
C)
a_
!I
1
m!i
°°i
U
/2
,u
_-: i_i
r.)
,.q
i
g,
°i!
te
II_i
v v
Table 8-2. Commonly Observed Error Symptoms of Common Mode Faults (Cont.)
Effectiveness 0f Common Mode Fault Avoidance, Fault Removal,
Fault Tolerance Techniques
There are several ways of evaluating the effectiveness of common mode fault avoid-
ance/removal/tolerance techniques. One of the simpler ways is to pair each technique with
Page 8"22
theCMF source against which it is effective. Qualitatively, transient (external) CMFs and
permanent (external) CMFs can be avoided by proper shielding, packaging, thermal man-
agement and conservative design practices (9, 10). All other fault avoidance techniques (1-
8) discussed in Section 8.4.1 will be effective against intermittent (design) faults and
(permanent) design faults. All of the fault _moval techniques described in Section 8.4.2
should be effective in finding design faults. The fault tolerance techniques described in
Section 8.4.3 should be effective against intermittent design faults. Additionally, all fault
tolerance techniques should be able to toleratetransient common mode faults. None of the
fault tolerance techniques can tolerate permanent design faults or permanent external CMFs.
Table 8-3 summarizes these relationships.
Transient (EXT) CMF
Permanent (EXT) CMF
Intermittent
(Design) CMF
(Permanent)
Design CMF
1-8 9,10
X
X
ll,,
X
X
CMFR CMFF
X
X
X
X
Table 8-3. Effectiveness of CMF A/R/T Techniques
Qualitative effectiveness criteria are im_rtant but do not provide the information neces-
sary to determine which techniques one must pursue for the AFTA program. Quantitative
measures would be valuable for this purpose. Unfortunately, the mechanics of how most
common mode faults are introduced is not understood well enough to quantify the fraction
of faults a given technique will be able to prevent, remove or tolerate. There are a few ex-
ceptions to this. For example, one can design a shield of appropriate thickness to prevent
SEU upsets due to high energy particles of a given intensity or interference from HERF of
a specified energy. However, such quantitative data is not available for most design faults
or the techniques to avoid, remove and tolerate such faults.
Page 8-23
At thispoint, then,only experience,anecdotalevidenceandqualitativeandsubjective
argumentscan be used to decide on the relative effectiveness of a particular technique
against a given source of common mode faults.
......................
8.6. Suitability of Common Mode Fault Avoidance, Fault Removal, Fault
Tolerance Techniques for AFTA
In order to determine the suitability of the techniques to deal with common mode faults
for AFTA, it is helpful to divide the AFTA computer system into a hierarchy of elements as
follows.
1.
1.1
1.2
1.3
1.4
1.1.1
1.1.4
1.2.1
1.2.2
1.2.3
1.2.4
1.2.5
1.3.1
1.3.2
1.3.3
1.4.1
1.4.2
1.4.3
AVI'A Computer System
Hardware
Software
Power
Algorithms
Processor Element (CPU, FPU, MMU, Memory, Bus Interface,
NE Interface)
I/O Element (CPU, Memory, Bus Interface, I/O Interface)
Network Element (Scoreboard, Global Controller, Voter, Fault
Tolerant Clock, Bus Interface, NE Interface)
Monitor-Interlock (Watchdog, Voter, Output Enabler)
Ada Run Time System
Core FDIR
I/O Services
Intercomputer Services
Applications Software
A/C Power Source
FCR Power Source
Monitor-Interlock Power Source
OMI
Clock Synchronization
Syndrome Analysis
Once the hierarchy has been developed to a sufficient depth, one can make a 2-dimen-
sional matrix where one dimension is the lowest level AFTA element and the other dimen-
sion is the CMF A/R/q" technique. We would then choose to apply certain techniques to
Page 8-24
certainAFTA elementsbasedon the effectiveness criteria discussed in Section 8.5 as well
as the following additional criteria.
1. Cost
2. Schedule
3. Maturity of technique/tool
4. Added Complexity
Since the AFTA program is an engineering endeavor constrained by a fixed budget
and schedule, one needs to choose the techni_es which are mature, timely and within the
resource constraints of the AFTA program. An additional criterion is the extra complexity
added by the technique. If the added complexity introduces more design errors than the use
of the technique avoids or removes, then ifW-tiild be a self-defeating exercise. Of course,
the lack of firm quantitative data on the effectiveness of the techniques and the added com-
plexity makes these decisions more subjective than objective.
The following discussion applies to the l_fig term AFTA development program outlined
in Section 1. The time-span covered here includes conceptual study phase, dem/val phase,
FSD phase, operational phase and p3I. A subset of the techniques will be selected in co-
operation with the Army and NASA for dem0-ristration on the AFTA brassboard.
1. AFTA Computer System
Table 8-4 summarizes the choice of CMF A/R/T techniques for each element of the
AFTA hierarchy based on these criteria. TK_°humbers and letters in the table refer to the
techniques discussed in Sections 8.4 - 8.6. At the system level (1.0 AFTA) the AFTA Ar-
chitecture is designed to avoid CM faults (fault avoidance technique 5). For example, the
Byzantine Resilient Virtual Circuit (BRVC) abstraction embodied in the FTPP hides the
complexities of inter-processor communication in the presence of faults from the applica-
tions software. Any extensions to the architecture proposed during the brassboard, FSD
and subsequent program phases will have to pass the test oi_n0t violating-the BRVC and
other complexity-reducing architectural attn_u_s before they can be implemented.
The table also indicates that Design Reviews and Testing (fault removal techniques 1
and 3) will be used to remove CM faults afle-vel 1.0. The asterisk implies that FR tech-
niques 1 and 3 will be used at all levels of hierarchy below that level as well. At the system
level, the only recovery technique is d (restart the whole system and establish a new system
state with current sensor inputs). ............
Page 8-25
2. Hardware
All of AFTA hardware elements will be designed and/or procured in accordance with
MIL-STD 883C and conservative design practices will be followed (FA technique 9) and
appropriate shielding, packaging and thermal management techniques (10) will be utilized.
The Processor Element and the I/O Element are Non-Development Items in the AFTA
architecture. Mature Components (3) that comply with military or de facto commercial
standards (7) will be procured for the PE and IOE. Additionally, formally verified PEs (2),
if available in the AFTA FSD time-frame, can be used at least for the AFTA hard core
functions. The hard core functions include the redundancy management tasks and the
safety-critical flight control tasks. Using formally verified PEs for all the AFTA functions
may not be practical due to their limited throughput in comparison with mature but formally
unverified PEs.
Watchdog timers and hardware exceptions (a and b) will be used to detect CM faults in
real time in the PE. For real time recovery, redundant PEs will be resynchronized (c), if
necessary.
Most of the Network Element will be designed using Design Automation Tools (4). In
particular, the Scoreboard, the Global Controller and the Voter and Fault Tolerant Clock
will be described at least at the behavioral level using VHDL. This description may also be
carried to the structural level. Formal methods (1) should also be applied to the NE hard-
ware design. All five major blocks of the NE should be formally specified at the abstract
finite state machine level. Formal verification should be carried down through the detailed
hardware design to the Register Transfer Level (RTL).
In the case of ASICs, the logic synthesis from VHDL descriptions (structural and be-
havioral) will also utilize design automation tools. Candidates include LSI Logic's Silicon
1076 tool suite and Autologic and GDT Silicon compilers.
A software simulation (2) of the NE will be constructed. The primary purpose of the
NE simulator is to provide to AFTA system software developers a substitute for the NE
hardware until such time as the hardware becomes available. However, the simulator can
also be used to verify the functionality of the NE design. This can be accomplished by
comparing the NE simulator's response to the AFTA system software to the virtual pro-
gramming model of the NE specified by NE designers.
Page 8-26
Hardware fault injection (4) will be used in all parts of the NE hardware to uncover CM
faults.
To detect CM faults in real time in the _, the presence test (f) will be employed. This
would detect a loss of synchronization of NEsl To recover from this situation, a reset will
be asserted by the processors using the fibeibptic links which should force NEs to resyn-
chronize.
The Processor Bus Interface of the _i_ili be designed using mature components (3)
as will the NE-NE interface, ii
The Monitor-Interlock will be subject_ (0 hardware fault injections (4) to uncover CM
faults.
3. Software
Formal methods (1) should be used to specify selected parts of the AFTA system soft-
ware and applications software. A candidate language for formal specification of software
requirements is Z [Spi891. Selected parts of the software should also be formally verified.
A candidate tool for verifying the correctness of Ada software is Penelope [Hir90]. Pene-
lope is an interactive system that accepts pr0_ams from a subset of Ada and formal specifi-
cations for them. It generates verification _nditions which are statements in first-order
logic. Proof of these statements implies thafthe program satisfies its specifications.
The AFTA software will use the DoD Specified standards (7) such as the programming
language (Mil-Std 1815a, i.e., Ada). Good software engineering practices (8) will also be
followed in the development of AF'TA softw_e.
All of the system software will be subjeCted to hardware faults as well as data errors
and memory mutations (4). Ada run time checks and MMU (c and d) will be used to detect
CM fault occurrences in real time.
The Ada Run Time System will be designed around a mature Ada-compiler-vendor
supplied RTS (3). Architectural attributes (5)will be used to simplify the design of FDIR,
I/O, lntercomputer services and applications Software.
Additionally, for the applications software, the two candidate approaches to avoiding
CM faults are the use of design automation tools (4) such as the Draper CASE tool, called
IDEA, and the use of design diversity (6) to produce multiple versions from a given set of
Page 8-27
specifications. Acceptance tests can be used to check the reasonableness of the outputs
produced by the RTS and the applications software in real time to detect CM faults. In the
case of CM faults in a single applications task, exception handlers (a) can try to recover
from an abnormal condition. Also, the task can be purged and restarted (b) with fresh in-
puts in the next frame.
As far as CM faults in the power supplies and power distribution system are concerned,
the only recourse is to effect a complete system restart (d). This may be a cold restart if the
system state was not saved. Alternatively, one could provide low voltage detectors which
will force an orderly shut-down of the system, saving the current system state in non-
volatile memory. In this case, a warm restart can be effected when the power comes back
on-line. Low voltage detectors are usually integrated on PEs and cause an interrupt that can
be used to trigger the orderly system shut-down (see Section 4).
Finally, the algorithmic elements of AFTA such as OM1, clock synchronization, and
syndrome analysis are best suited for verification by formal methods (1).
Page 8-28
AFTA
Element
Fault
Avoidance
Fault
Removal
Fault
Detection
Fault
Recovery
1.0 AFTA 5 1*,3* d
1.1 H/W 9", 10"
1.1.1 PE 2,3,7 a,b c
i
1.1.2 IOE 3,7
1.1.3 NE 1 2",4" f* c*
1.1.3.1 SB 4 5
1.1.3.2GC
1.1.3.3 V/F"TC
1.1.3.4 BI
1.1.3.5 NE Int.
1.1.4 MI
1.2 S/W
1.2.1 RTS
1.2.2 FDIR
1,
1.2.3 I/OS
1.2.4 ICS
4
3
1",7",8"
5
4,5,6
1"
1.2.5 Appl. SW
1.3 Power
5
5
4
4*
1.4 Algorithms
1.4.10M1
c*,d*
e
e
1.4.2 Clock
S_,nch.
1.4.3 Syndrome
Analysis
a,b
d
Table 8-4. Application of C_ A/R/T Techniques to AFTA
Page 8-29
8.7. Plan for Implementation of CMF Avoidance, Removal, Tolerance
Techniques
To deploy a fault tolerant computer system that is resilient to common mode faults as
well, the planning must begin at the earliest phase of the program and the plan must be car-
ried out through development and deployment of the product. It is appropriate to begin by
defining a plan of action in the conceptual study phase. Steps leading to the definition of
the plan of action have included the following activities. Sources of common mode faults
in AFTA have been identified. A three-pronged approach to make AFTA CMF-resilient,
consisting of fault avoidance, fault removal, and fault tolerance, has been developed.
Techniques and tools for each of the three prongs have been enumerated. Each of the three
prongs has been matched to the type(s) of CMFs against which it is effective. This section
now outlines the time-line for using various tools and techniques and other actions that will
be required throughout the AFTA life-cycle to make AFTA CMF-resilient.
/L7A. Demonstration/Validation Phase
This phase of the AFTA development will last 36 months. Activities during dem/val
include AFTA detailed design, brassboard fabrication, coding, and integration, demonstra-
tion of brassboard with an application and AFTA validation. Demonstration and validation
activities will be carried out initially at Draper and subsequently at Army AVRADA and also
possibly at NASA Langley Research Center.
If CMF-resilience is a serious goal of the AFTA project, then enough funding should
be found to support the following CMF A/R/T activities during the dem/val phase. (In
practice, the FSD funding levels are typically much higher than the dern/vaI funding levels.
This may necessitate postponing some of the activities from the dem/val to the FSD phase.)
8_7.1. I. AFTA @stem
Any extensions/modifications to the AFTA computer system should be examined for
compliance with the attributes that help reduce, manage and hide the system complexity
(see Section 8.4.1.5).
8.7.1.2. Hardware Design
The Network Element design should be described at the behavioral and at the structural
level in VHDL. The NE hardware should _designed in accordance with Mil-Std 883B/C
for the full temperature range. Conservative design practices should be followed.
Features that will help detect and tolerate CMFs should be designed into the NE. Ex-
amples include a watchdog timer to reset the_ when it goes into an undesired state and
time-outs for PE input/output buffer full cofiditions.
Formal methods should be used to begin the verification process of the NE hardware
design. As a first step, an abstract finite state machine-level specification of the three major
blocks of the NE that are bus-interface-independent, i.e., the Scoreboard, the Global Con-
troller, and the Voter/Fault Tolerant Clock sl_uld be constructed.
Rigorous Preliminary and Critical Design Reviews (PDR and CDR) of the NE design
by peers, superiors, and government contract monitors should be carried out.
8.7.1.3. Software Design
The Dem/Val software should be designed using good software engineering practices
outlined in Section 8.4.1.8. A waterfall soft w_ development methodology, starting with
software requirements specifications and ending with detailed design, should be followed.
Design automation tools such as Drap-er's CASE, HTrs 001, Cadre's Teamwork
should be examined for applicability to AFTA software design and coding.
Selected subset of AFTA system softw--_e shouid be formally specified in the Z ian:
guage or a suitable software specification language. Selected subset of AFTA system
software should be formally verified using Penelope or a suitable software formal verifica-
tion tool.
The software should be coded in Mil-Std I815a Ada language. The Run Time System
should be based on the XD-Ada-supplied and Draper-modified RTS that is currently being
used in the US Navy's SSN-21 Seawolf Ship Control Computer and has also been ported
to the FTPP Cluster C2+.
System software developed under AIPS, Seawolf and FTPP projects such as I/O Sys-
tem Services and Inter-Processor Communications Services should be examined for AFTA
Page 8-'31
reuse. CAMP software libraries should be acquired and examined for applicability to
AFTA.
CMF detection and recovery mechanisms such as those discussed in Sections 8.4.3.1
and 8.4.3.2, respectively, should be designed into the AFTA operating system. These
techniques should be sufficiently broad to cover the commonly observed error symptoms
of CMFs presented in Table 8-2.
Rigorous Preliminary and Critical Design Reviews (PDR and CDR) of the software
design by peers, superiors, and government contract monitors should be carried out.
8.7.1.4. Hardware-Software Test and Integration
The primary emphasis during the integration phase will be on fault removal techniques.
The three major techniques to be used during this phase are testing, fault/error injection,
and discrepancy reporting.
Software testing should follow the unit, module, and functional testing paradigm. The
major complexities of the AFTA architecture are in the dimensions of hardware redundancy
and parallel processing. The AFTA architecture hides the redundancy dimension from
most of the software by providing a Byzantine Resilient Virtual Circuit abstraction. This
will allow all of the system software, except FDIR, to be tested in a non-redundant envi-
ronment. The parallel processing dimension is also hidden from applications software by
the inter-processor communication services. The testing of most AFTA software can there-
fore utilize the tools and techniques developed for conventional computer architectures.
For the FDIR, inter-processor communications services and possibly some I/O services, it
will be necessary to develop a more sophisticated debugging environment. It is essential to
provide software developers debugging tools that gives them visibility into the workings of
the parallel-redundant computer. Development of such a debugging environment should be
a priority of the AFTA dem/val phase.
Hardware testing during dern/val mainly pertains to the NE which can be tested using
traditional hardware testing techniques if the design does not contain any ASICs. The NE
simulator should be used to verify the NE functionality against the NE virtual programming
model specified by the hardware designers. The design of the NE hardware, in turn,
should be verified against the NE simulator.
Page 8-32
Discrepancyreportsshouldbefiled St_ing with the testing phase. This activity should
continue through various stages of hardw_-_bftware integration. As design and manufac-
turing errors are discovered and fixed, tests should be rerun to duplicate the discrepancies.
Once the system has been integrated, it _buld be subjected to extensive fault and error
injections with and without applicationg__:ode executing. Initial testing will be with
faults/errors in a single fault containment re,on and only in the NE. System parameters of
interest such as fault detection, identification and recovery times should be recorded.
Cause of any faults not detected, misdiagnosed, or improperly recovered from should be
identified. Impact on performance should be evaluated to ascertain that no scheduling
dead-lines were missed due to fault handling transients in the system performance and no
unexpected data corruption occurred. Subs_uently, common mode faults should be in-
jected to test AFTA's ability to tolerate C_s'
8.7.2. Full Scale Development Phase
The CMF A/R/T activities during the FSD phase parallel those during the dern/val
phase. However, they will be on a much l_ger scale.
For example, the formal specifications of software, applied to a small subset of soft-
ware during the dem/val phase, should be carried to as much software as possible. The
formal description of the NE at the finite state machine done during the dem/val phase
should be carded down to the gate level. _ Formally verified PEs should be examined for
inclusion in the FSD hardware demonstratio_:
The CASE tools for software design an d development should be applied more exten-
sively across a broader spectrum of system and applications software. The software devel-
opment should rigorously follow Mil-Spec22] 67a. Extensive simulations of the NE ASICs
designs should be carded out. Multiple versions of flight critical application code should
be developed. ...........
More extensive fault detection and recovery techniques should be designed into the
AFTA system software, the NE and the PEs.
Same fault removal techniques as those in the dem/val phase will be applied here as
well. However, their application should be much more extensive. The testing should be
over a wider input range. The DRs should be logged with enough information to construct
Page 8-33
thesoftwarereliability growth models. The fault/error injection should be expanded to
cover the PEs and the IOEs as well as the NE.
A closed loop testing of the complete AFTA system including the computer and I/O
devices tied to a dynamic simulation of the helicopter/ground vehicle should be carried out
under normal conditions, with faults, and under maximum application load and faults.
The hardware should also be subjected to various Environmental Screening and Stress
(ESS) tests. Any component failures should be examined for design errors and corrected.
8.7.3. Production Phase
The major sources of common mode faults in the production phase are the manufactur-
ing defects. Appropriate quality control measures are necessary to ascertain that the AFTA
systems are manufactured in accordance with design specifications. It is also very impor-
tant to track the changes in the system requirements that affect the AFTA design very care-
fully. If design changes become unavoidable due to changing requirements, it would be
necessary to go through the critical steps of the previous two phases to make sure that the
design changes do not introduce new errors.
8_7.4. Deployment Phase
When AFTA systems are deployed in the field, data on faults, errors, failures, and
anomalous behavior should be collected and analyzed. The cause of each event should be
examined and categorized as a random hardware fault or a CMF or a potential CMF.
Sometimes a design error can masquerade as a random hardware fault as, for example,
when it causes only a single fault containment region to fail. Therefore, it is important to
do the cause and effect analysis for each observed event. This feedback can then be used to
remove the source of the CMF. The feedback process should also be used to examine the
efficacy of the three-pronged approach, i.e., CMF avoidance/removal/tolerance techniques.
Every CMF that slips through this process should be examined to determine the effective-
ness and the weaknesses of various techniques in avoiding, removing, and tolerating
CMFs. The field data should be studied with the goal of improving the CMF A/R/T ap-
proach.
Page 8-34
8.7.5. Pre-Planned Product Improvement Phase
Ideally, all of the CMF A/R/T techniques Suggested in this report should be applied to
the development of AFTA during the de_val and the FSD phase. However, it may turn
out that the AFTA schedule and the availabil°iiy of the tools and techniques do not coincide.
In that case, some of the activities can be deferred for the Pre-Planned Product Improve,
ment (p3I) phase. For example, if a formally verified microprocessor is not available in
time to meet the FSD schedule, it can be added later on to the AFTA as a part of the p3I ac-
tivity.
Page 8-35
Thispageintentionallyleft blank.
Page8-36
9. Analytical Models
This section describes the quantitative models are to be used in analytically evaluating
AFTA. Quantitative models are present_ for effective throughput, effective intertask
communication bandwidth and latency, effective input/output bandwidth and latency, reli-
ability and availability under two typical AFTA redundancy management policies, weight,
power, volume, and life cycle cost.
J ThroughputRequirements
I PE throughput,
Context switchoverhead
# tasks perframe
OS overhead
I FDI overhead
I NE bandwidth
I LRM failurerates
VID rodlovel
I Recoveryrates
I Environment
..._:
¢1 throughput l,,,,-_;
 rv'° i
I ll_:d Effective
I _ lintertask
II '--'___ Ibandwidth
h
v'°
and availability
(
# VlDs I
:: #PEs
_.-I # FCRs
AFTA effective
throughput
AFTA
, reliability andI
-! availability
-I
Figure 9-1. AFTA Methodology Information Flow
t::'_EtL_.Dt_G PAGE BLANK NOi FILMEU
Page 9-1
The inputs to these models come from MIL-HDBK-217E failure rate data, empirical
test and evaluation, and other sources. The relationship among the analytical models, the
model inputs, the AFrA configurable parameters, and the AFTA requirements, is depicted
in Figure 9-1.
9.1. Performance Model
9.1.1. Delivered Throughput
In the initial stages of architecture synthesis only overall throughput requirements are
available, while in the early stages of development the effective throughput of a given archi-
tecture can in turn only be roughly estimated. A rough delivered throughput model is suit-
able for this situation. In this model, one begins with the raw throughput, expressed in
suitable terms such as DAIS MIPS. Denote this quantity XVG, raw. The VG throughput
estimate is reduced to a value denoted XVG. delivered by overheads such as the rate group
(RG) dispatcher, synchronization delays, RM overheads, contention effects, and context
switches.
The effective throughput available to an application running on a parallel processor is a
strong function of the efficiency of the mapping of the application task to the parallel pro-
cessing resources, and is impossible to plausibly generalize. Therefore in the current report
we will calculate the delivered throughput of an AFTA simply as the sum of the through-
puts of its constituent VGs. Thus, in an AFTA configuration consisting of NVG VGs each
having throughput XVGi, delivered, the delivered throughput is
XAFTA, delivered = NvGXVG, delivered (9.1)
Estimation of a VG's delivered throughput requires a frame-by-frame analysis of the
AFTA OS and redundancy management (RM) overheads. For convenience, a diagram of
the RG frames used by the Ab-TA scheduler is repeated below.
Page 9-2
/minor frame index:
0 |I !2 13 14 15 6 17
| .......
_,..........• .......k ........,,,_.........,L_......._,i,..........., ..............
I_ F rama .%_ Fzame %%]_,'_F ramo .%_L_F zame X_]_N Frame _Fr amo .%'q%_ F zano .%_ F ramo %_
!
" i .i .'
Figure 9-2. Mapping of RG Frames to Minor Frames
The overall approach to determining AFTA OS and fault tolerance-related overhead,
and hence delivered throughput, is to calculate and verify the time required to perform these
functions. Time is used instead of a parameter such as the number of instructions required
to execute a particular overhead function because time is a directly measurable and therefore
verifiable quantity, whereas instruction coun_ are notoriously misleading when used to es-
timate performance. In addition, many of the overheads include operations to which an in-
struction count and processor throughput are irrelevant, such as accessing the network ele-
ment. As the detailed design of AFTA proceeds, the overhead time estimates will be re-
fined and correlated closely to parameters such as processor instruction execution rate and
Network Element bandwidth. However, it is worth noting that, even at validation and veri-
fication time, the execution times are primary measured parameters.
At each minor frame boundary, a particular set of RGs have completed their iterations
and are ready to initiate a new iteration. These RG sets as a function of the frame boundary
are shown in Table 9-1.
Page 9-3
Frame Boundary Completed RGs
7-0 4,3,2,1
0-1 4
1-2 4, 3
2-3 4
3-4 4,3,2
4-5 4
5-6 4, 3
6-7 4
Started RGs
4,3,2,1
4
4,3
4
4,3,2
4
4,3
4
Table 9-1. Completed/Started RGs vs. Frame Boundary
At each minor frame boundary several functions are executed which contribute to the
OS overheadt. The chime interrupt handler synchronizes the VG and performs time man-
agement functions; subsequently the dispatcher performs other metabolic housekeeping.
The time required to perform these functions is denoted
THK
Subsequently, the dispatcher transmits all messages emanating from RGs whose
frames have just completed. The time required to perform this function depends on the
number of RGs that have just completed, the number of tasks in each RG, the number of
messages sent by each task, the size of each message, and contention with other VGs for
the Network Elements' message passing services. The time required to perform this is de-
noted
]q_ss,_m.i
TSEND. i = 2 TSU+ SRTp
k=l (9.2)
where NMESSAGES, i = the number of messages sent in frame i, TSU is the setup time re-
quired to begin sending a single message, Sk is the size (in Network Element packets) of
outgoing message k, and rip is the incremental time required to send one packet.
Next, the dispatcher updates the incoming message queues for all tasks which have re-
ceived messages during the previous frame, and updates the frame markers for tasks in
t See Section 5.3 for a detailed description of the dispatcher functionality.
i
Page 9-4
RGswhich will be started on the current frame boundary. The time required for this de-
pends on the number of RGs that have just completed, the number of tasks in each RG, the
number of messages received by each task, and the size of each message. The time re-
quired to perform this is denoted
TRECEIVE, i .... E TSU+ SkTp
k=l (9.3)
where (overloading names to avoid needless notation proliferation) NMESSAGES, i = the
number of messages received in frame i, TSU is the setup time required to begin receiving a
single message, Sk is the size (in Network Element packets) of incoming message k, and
Tp is the incremental time required to read grid process one packet from the Network Ele-
ment*.
Finally, before suspending itself, the dispa---icher enables execution of all tasks residing
in RGs which can be started on the currenft_ame boundary by setting an event for them;
the time required for this depends on the ti_ required to set an event, the number of RGs
to be started in the next frame, and the number of tasks in each RG. The time required to
perform this is denoted ...........
TEV. i = NTASKS, iTEv (9.4)
where NTASKS, i = the number of tasks t0b-e started in frame i, and TEV is the time re-
quired for the dispatcher to set an event. ..........
TD, major, the time consumed by the dispatcher over all eight minor frames (i.e., one
major frame) is estimated as:
7
TD. major = E(TI-_ + TSEND.i + TRECEIVE, i + TRV,i+I)
i=0 (9.5)
where i refers to the minor frame just completed, and is computed modulo 8.
* Strictly speaking, a recipient PE reads a packet from an NE upon reception of a packet
delivery interrupt, according to the method outlined in Section 4, and this overhead is
spread out within a frame. The current approach to estimating the overhead consumed by
this process is to lump it all together at the frame's end.
Page 9-5
The FDI task immediately follows the dispatcher in each minor frame. This task exe-
cutes a complex suite of functions which are described in Section 5. The exact set of func-
tions included in the FDI task is not known at the current phase of development and in fact
may vary from mission to mission depending on a mission's temporal constraints. At the
current level of modeling granularity, the temporal overhead due to FDI task execution in a
minor frame is abstracted as TFDI, minor, and the total temporal overhead due to FDI task
execution over the major frame is therefore
TFDI, major = 8TFDI, minol (9.6)
The context switch time incurred by the R4 tasks, the dispatcher, and the FDI tasks in
the major frame is equal to the context switch time per task (Tcs), times the number of R4
tasks (NTASKS, R4 +2), times the number of minor frames per major frame (8):
TCS, major, R4 = 8(NTASKS.R4 + 2)Tcs (9.7)
For lower-frequency RGs, upper bounds for the context switch times are computed.
The context switch time incurred by the R3 tasks is upper-bounded by the context switch
time per task (Tcs), times the number of R3 tasks (NTASKS, R3) plus the maximum num-
ber of times that RG R3 can be preempted in a major frame (4), times the number of minor
frames per major frame (4):
TCS, major, R3 = (4NTASKS,R3 + 4}Tcs (9.8)
The context switch time incurred by the R2 tasks is upper-bounded by the context
switch time per task (Tcs), times the number of R2 tasks (NTASKS, R2) plus the maximum
number of times that RG R2 can be preempted in a major frame (6), times the number of
minor frames per major frame (2):
Tcs. major, R2 = (2NTAsKS,R2 + 6)Tcs (9.9)
The context switch time incurred by the R1 tasks is upper-bounded by the context
switch time per task (Tcs), times the number of R1 tasks (NTASKS, R1) plus the maximum
number of times that RG R 1 can be preempted in a major frame (7), times the number of
minor frames per major frame (1):
Tcs, major, RI = (_qTASKS,R1 + 7)Tcs (9.10)
Page 9-6
by
The total temporal overhead per major _ame due to context switches is upper-bounded
Tcs, major = Tcs, major, R4 + Tcs, major, R3 + Tcs, major, R2 + TCs, major, R1 (9.11)
The total temporal overhead per major frame due to the dispatcher task, FDI task, and
context switches, is
TOVERHF-AD,major = TD, major + TCS, major + TFDI, major (9.12)
Let Tmajor denote the period of a majof_arne. Then the fractional overhead due to the
dispatcher task, FDI task, and context switches, is upper bounded by
OH = TOVERnE_' major
Tmaj °r (9.13)
Let XVG, raw denote the raw throughput of an AFTA VG. As a simple first-order engi-
neering approximation of the VG's deliWred throughput after deduction of the various
overheads, one may use ........
X VG, delivered = {1-OH} X vc. raw (9.14)
9.1.2. lntertask Communication
Intertask communication is achieved in AFTA via sending and receiving messages ac-
cording to the design described in Section_SS _ For uniformity of programming and trans-
parency of distributed processing resourc_message passing is used both for intra-VG
and inter-VG communication. As descri_qn Section 5, RG tasks may enqueue messages
for subsequent transmission by the AFTA inTdhask communication servicest. The time re-
quired to enqueue a message is denoted TENQ_UE MESSAGE and is a parameter to be ver-
ified during the Dem/Val. Upon c0mpletid_:_of an RG frame, the AFTA intertask commu-
nication services transmit packets emanating from the just-completed and all lower-fre-
quency RGs to the destination VG via theNetwork Elements. All messages emanating
from a VG are transmitted on a RG boundS, which is in turn determined by the sending
VG's timer interrupt. This has the intent-of minimizing the jitter and skew with which
messages are transmitted by a VG. It als0°h_is the effect of delaying the transmission of a
t In extenuating circumstances, RG4 tasks may send messages immediately without
waiting for a frame boundary because they _enonpreemptible.
Page 9-7
task'smessages untilthe end of itsRG frame, so the message latency,as measured with
respecttothe moment the sending taskemits the message, can bc up toone RG frame,de-
pending upon where inthe RG frame thetaskisscheduled. This isillustratedgraphically
inFigure 9-3,in which each RG has one message tosend,each denoted by a boldface m:
RG4 enqueues ml, RG3 cnqucues m2, RG2 cnqueues m3, and RG I cnqucues m4. The
messages arctransmittedby theAFrA communication servicesattheframe boundariesThe
latencyincurredby message m3 ishighlightedinthefigure.
I illl_H
o ii i2 i3 _4 is "6 7 i
_ ,a _\\_ _ ,,X _ ,.iv " '
• s x ! !
sent ml ml rnl ml ml ml ml ml
m2 m2 m2 m2
m3 m3
m4
Figure 9-3. RG Message Passing
The time at which a task is scheduled in its frame and the particular frame within which
it is scheduled have a significant impact on intertask communication latency and bandwidth.
Let Tlatency, RG denote the time from message enqueueing (the boldface ms in Figure 9-3)
to the task's next RG frame boundary, at which point the message is transmitted by the
communication services (the plainface ms in Figure 9-3). While it must be empirically vali-
dated to obtain an accurate estimate, this parameter is largely under the control of the appli-
cation designer. However, it is not recommended that Tlateney, RG be relied upon for cor-
rect execution of the application task because Tlatency, RG depends upon the relationship
between the sending task's invocation of the message passing services and the frame inter-
rupt. This in turn depends upon task execution times, which may vary widely from itera-
tion to iteration, inducing unwanted jitter, skew, and validation difficulties. It also makes it
more difficult to modify the tasking schedule when desired. It is preferable to specify tim-
Page 9-8
ing parameters with respect to frame boundaries, which, because they are determined by
highly accurate crystal oscillators on the PEs, have low skew, low variance, and are more
validatable. Consequently, in the following analysis all timing parameters will be specified
and verified with respect to the timer interrupt demarcating the boundary of the just-com-
pleted RG frame.
On frame boundaries, the AFTA communication services transmits all enqueued pack-
ets from completed RGs into the Network Element. The latency incurred by this function
has two components. First, each outgoing packet must be written over the PE-NE bus into
the Network Element; let TXMIT denote this stochastic time interval, which must be paid for
each packet to be transmitted. Finally, let TNE denote the stochastic time interval required
for the Network Element ensemble to perform the requested message transmission accord-
ing to the exchange rules described in Section 4. The total time required to send a single
message of size SK packets using this procedlure is
Tsl_l'a) = Si_ TxMrr + TI_) (9.15)
The packet transmission procedure is illus_ted in Figure 9-4.
Timer
Interrupt
PE
NE
1 2
dispatcher set up packets
_hdus ekeepin_ °utg°ing_ to :sl
t ransmit si
packets
3
repeat
I, 2, 3
for
all
outgoing
messaqes
Figure 9-4. Outgoing Message Processing
Messages arrive at a destination VG at arbitrary times during a frame. The recipient
PEs of the destination VG receive a packet delivery interrupt from the NE and read the
packet from the NE into the PE's private memory. (The loss of throughput incurred due to
this asynchronous activity is debited in the delivered throughput model.) However, the
packets are not assembled and made available to destination tasks until the termination of
Page 9-9
theRGhostingthedestinationtask;thereforethe latency (the time between message arrival
and delivery to the destination task) is a function of the phase relationship between message
reception and the frame boundary. Again, this phasing cannot in general be counted upon
to yield a known timing relationship, so the timing specification should be performed with
respect to the destination task's RG frame boundary following the reception of the last
packet of a message. If some desired phase relationship must be maintained, it can be ob-
tained via the frame phasing technique described in Section 5. Upon a frame boundary
boundary, the AFTA intertask communication services process the received packet queue,
constructing and delivering messages to tasks which are at an RG boundary. Note that,
unless it is a latecomer, the message is already in the PE's private memory because it was
read from the NE during the packet delivery interrupt service routine. The time required to
perform incoming message processing is an increasing function of the number of new mes-
sages to he assembled, the number of packets received in the previous frame, and the num-
ber of new messages completed and ready for delivery to destination tasks. Denote this
time TRECEIVE MESSAGE. A plausible parametric formulation for this time interval is cur-
rently unknown and will be developed and verified during subsequent phases of the AFTA
development.
9.1.3. Input/Ou _tput
The input latency is defined to be the time interval between the sampling of a physical
quantity by an input device and the delivery of the digital representation of that quantity to
the recipient task. The output latency is defined to be the time interval between the produc-
tion of a digital quantity by a source task and the delivery of that digital quantity to the de-
vice which converts that digital quantity to a physical quantity.
As described in Sections 4 and 5, AFTA will support numerous I/O devices and con-
trollers, including FTDB, MIL-STD-1553, mass semiconductor memory, rotating media
memory, discretes, analog, RS232, Ethernet, and possibly others to be determined at a fu-
ture date. The techniques used for accessing these devices have been partitioned into two
classes. In the first technique, known as concurrent I/0, an I/O task resident on one or
more members of a VG responsible for accessing an I/O device initiates an I/O transaction
by writing commands and data to the device. Immediately thereafter, the VG may initiate
other I/O, resume processing other I/O, or resume to other tasks while the I/O transaction
completes concurrently, under control of an autonomous I/O controller. After a specified
time interval, the I/O system services may return to the autonomous controller to process
the input and/or status data resulting from the transaction. It is the intent that at any given
Page 9-10
timeseveralconcurrentI/O transactionswill bein progress, probably under the control of
several VGs, in order to maximize AFTA I/O performance. Concurrent I/O is expected to
be useful for operation of complex I/O devices such as network controllers and rotating
storage media, where it is a waste of CPU time to wait around for lengthy transactions to
complete. The second I/O technique available in AFTA is called sequential llO, in which
one or more members of a VG responsible for accessing an I/O device performs an I/O
transaction and wait until the I/O activity is completed before initiating other I/O or resum-
ing to other tasks. Sequential I/O is suitable for accessing fast, low-latency devices which
may require atomic access, such as discrete input and output complexes, analog to digital
converters, digital to analog converters, etc. It is simpler than concurrent I/O, but should
be used with discretion since it nonpreemptively monopolizes the VG.
Prior to being provided to an output device, data may be transferred from the VG(s)
hosting the source task to the VG(s) responsible for the I/O activity, or, if they are one and
the same, the output data may be voted priorto being output; both such actions utilize the
Network Element message passing capabilities In addition, input data may have to be
transferred from the I/O VG(s) to the destination VG(s) using one or more Class 2 ex-
changes. A subset of the large number of possibilities is enumerated in Section 5.
During the AFTA Conceptual Study, it was judged that generation of an I/O perfor-
mance model general enough to describe the disparate I/O devices, access techniques, and
input and output data distribution options possible in AFTA would be quite time-consum-
ing. Therefore it has been decided that, for reasons of expediency, construction of the
AFTA I/O performance model(s) will be de_rred until more information is obtained about
anticipated I/O devices; this will occur during the Detailed Design phase of the program.
9.2. Reliability and Availability Models
The reliability and availability of an AFTA implementation is a function of the number
of FCRs and PEs, the VG redundancy levels, the mission environment, the operational and
maintenance scenario, and fault recovery procedures. Precise mathematical definitions of
the terms "reliability" and "availability" are used in this report. While high reliability and
availability contribute to dependable operation, they should not be construed to exhaus-
tively connote all attributes of dependable systems.
AFTA fault recovery options are enumerated in Section 5.6.6, and each one has a
strong impact on the overall Ab-TA reliabil!ty and availability. Since the construction of re-
Page 9-11
liability modelsfor all the options described in Section 5.6.6 is beyond the scope of the
Conceptual Study phase, two have been selected for evaluation as being appropriate for
two extremes of temporal constraints which might be imposed on AFTA.
The first class of options, of which the graceful degradation and Network Element
masking in Section 5.6.6 are examples, are appropriate for an operational mode in which
little if any time is available for fault recovery. In this case, a faulty component in a redun-
dant VG or an NE is immediately disabled upon detection, with no lengthy fault recovery
attempted. No effort is made to discriminate between transient and permanent faults for the
purpose of performing on-line recovery, in effect treating all faults as permanent until a
more relaxed operational regime is entered. This option has the advantage of incurring no
dropout of functionality, but has the disadvantage of irreversibly reducing the redundancy
level of the faulted VG and hastening its demise due to redundancy exhaustion. Therefore
it may be viewed as being best suited for short missions having fast real-time constraints,
such as real-time control of mission-critical helicopter functions.
Figure 9-5 illustrates this fault recovery option: after the first failure of member A of
quadruply-redundant VG1, the faulted member is disabled, reducing VGI's redundancy
level to triplex. A second failure of one of VGI's members, say B, reduces its redundancy
level to "degraded triplex." For a degraded VG, the Network Element's main data path
packet voter masks the input from the faulted member and does not include it in the vote.
The Scoreboard, however, continues to consider a degraded VG's faulted channel when
calculating the VG's voted Output Buffer Not Empty (known as OBNE, an indication that
the VG has a packet to be transmitted from its Output Buffer) and voted Input Buffer Not
Full (IBNF, an indication that the VG is capable of receiving at least one packet in its Input
Buffer)t. This is to allow a faulted member of a degraded VG to remain in synchronization
with its parent VG to facilitate recovery operations. This capability is more robust and use-
ful for degraded quadruplex VGs than for degraded triplex VGs.
A third failure in VG1, say of member C, reduces its redundancy level to simplex, and
a fourth failure results in the loss of the functionality supported by VG1. The probability of
successfully transitioning from a faulted degraded triplex VG to a nonfaulty simplex is
significantly less than unity, and is represented by the "duplex coverage," CD.
J" See Section 4 for a discussion of this terminology.
Page 9-12
Figure9-5. GracefulDe_dation of QuadruplexVGI
Whena fault recoverytime on theorderof a secondof two is permissible,a wider
rangeof fault recoveryoptionsareavailable. Representativesof this classof optionsare
listedin Section5.6.6asprocessor resynchronization, processor reintegration, processor
replacement, processor replacement with initialization, task migration, and Network Ele-
ment resynchronization. All of there recovery options are characterized by their capability
to seek and find components sufficient to maximize the likelihood of forming a desired con-
figuration of redundant VGs, followed by either initializing or copying the state of the
newly reintegrated component into agreement with the surviving members of the faulted
VG. As is mentioned earlier, this process, while maximizing the effective use of the re-
Page 9-13
configurable AFTA components, consumes one to two seconds to perform. As an example
of such a strategy in the context of the previous example, we reconsider the case of a pro-
cessor replacement fault recovery option applied to VG1 t. After a failure of member A of
VG1, VGI's redundancy level can be restored by switching in (say) the PE adjacent to
member A. After the second failure of member B, a spare processor may be reintegrated,
again restoring VGI's quadruplex redundancy level, and so on and so forth (Figure 9-6).
This can continue until all the spares allocated to repairing VG1 are exhausted, at which
point the VG1 fault recovery policy may revert to the graceful degradation policy described
above, or another policy may go into effect.
The more leisurely fault recovery options in this class are more suited to less stressful
real-time operational regimes and missions, such as during the hiatus phase of the flight
mission where availability is to be maximized, or during a long ground mission where one
or two second dropouts are a reasonable tradeoff for significant mission longevity en-
hancement.
t Different VGs may have different fault recovery options, and the same VG's fault
recovery option can vary over the course of a mission.
Page 9-14
Figure9-6. ProcessorReplacementRedundancyManagementfor QuadruplexVG1
The following sections present formulations of the probability that AFTA can perform
its intended functionsl i.e., form the requisite number of functioning VGs, when managed
according to the two fault recovery policies outlined above. Depending upon the use to
which AbTA is put at a given time, a given formulation will be equivalent to either
"reliability" or "availability". For example, when the processor replacement strategy is
used during hiatus to maximize AFTA availability, the formulation will refer to "AFTA
Mission Availability", whereas when it is used to calculate the probability that the Fault
Tolerant Navigation Processor is capable of performing its intended function during a mis-
Page 9-15
sion,thesame formulation yields the "FTNP Mission Reliability". To attempt to generalize
the meaning of the formulations, the following notation is adopted.
The probability that AFTA can perform its intended functions, i.e., form the requisite
number of functioning VGs, when managed according to the graceful degradation class of
redundancy management policies is denoted
PGD
The probability that AFTA can perform its intended functions, i.e., form the requisite
number of functioning VGs, when managed according to the processor replacement class
of redundancy management policies is denoted
PPR
Recall from Section 2 that the reliability of the system is equal to the probability that all
functions needed to execute the mission are operational, or
Rsys = Prob(Fj operational, 'v' Fje S) (9.16)
The Function Reliability is the probability that a given function Fj can be executed be-
cause its resources are operational
RFj = Prob(resourcei operational, V resourcei • Fj) (9.17)
The System Reliability is then
Rsys = Prob(resourcei operational, V resourcei • Fj, V Fj • S) (9.18)
The probability that all needed VGs are functional is
Rsys = Prob(VGi operational, 'v' VGi • Fj, 'v' Fj • S) (9.19)
If all VG reliabilities were independent, this would reduce to
Rsys = 1-I R(VGO
VGIE F) F_S (9.20)
Unfortunately they are not: they are correlated through their joint dependence upon fail-
ure of FCRs in which they have common members. However, if conditioned upon the
Page 9-16
failureof the Fault Containment Regions in which their members reside, the probabilities
for VG reliability do become independent. For a single VG, one may write
R(VGO = R(VGi I no FCR faults) Pr(no FCR faults)
NNKs
+ _ R(VG i t FCRj faulty) Pr(FCRj faulty)
j=l _
NNEs NNEs
+ KY_ _ R(VGilFCR j faulty and FCRk faulty) Pr(FCRj faulty)Pr(FCRk faulty)
j=l k=l, k#j
(9.21)
where
K-- { 0,NNEs<51, NEs = 5
For an AFTA consisting of multiple VGs, conditioning the VG reliabilities upon FCR
fault pattern allows us to conveniently express system reliability as a summation of terms,
each of which is a product of independent probabilities, or
Rsys = I-I R(VGi i no FCR faults)] Pr(no FCR faults)
V(3iE F ) FjES ]
+_ H R(VGilFCR n faulty Pr(FCR n faulty)
n=l [VGI_F) Fj_S
NNEs NNEs I" ]
+K | H R(VC lFCRs.andmfaulty)lPr(FCRnfaulty) (FCRmfaulty)
n=l m=l. m:ml.VOt_F) Fj_S
(9.22)
where
NNEs
Pr(no FCR faults) = RFC R
Page 9-17
INNEs_n _-t,,
Pr(FCR n fauky) = Pr(FCR m faulty) = _ 1 rXFCR UFCR
Rx = reliability of component x
Ux = 1-Rx
RFCR = RNERpcRBus
RNE = era'NE t
Rpc = e-XPC t
RBU S = e-_.BUSt
.2,9_20!LFormulation for Graceful Degradation Class of Fault Recovery_
We now present a formulation for PGD, the probability that all VGs needed to perform
the AFTA's intended function are operational when managed according to a graceful degra-
dation class of fault recovery policies. A given number of VGs are needed to perform the
functions, and as their members fail, the VGs' redundancy levels are reduced until the VG
is inoperable.
The overall analytical approach is to formulate an expression for the reliability of a VG
conditioned upon a given FCR failure pattern, assuming a graceful degradation redundancy
management policy. Then, the probabilities of the given FCR failure patterns are calcu-
lated.
Let E(_,, It, t, r) represent the reliability at mission time t of a VG having processor
failure rate _., fault recovery rate It, and redundancy level r, assuming PE faults only. This
is the probability of occurrence of all operational states (redundancy levels of 1, 2, 3, or 4)
of the VG minus the probability that the VG fails due to near-coincident PE faults.
E(;L, It, t, r) =
ci e._._ (l_e__ r(r-1)_, t
It
,r>0
0 ,r<0
(9.23)
Page 9-18
wher_ ci is the probability that a VG of reduh_ncy level i+ 1 can successfully degrade to a
VG of redundancy level i.
If r>l, then
c D i=l
] !.0, i=2
ci = _ 1.0, i=3
 ii0,i--4
If r=l, then
CD=I.O
The parameter CD ranges from 0.5 to 0.90, depending upon the level of effort put into
tolerating faults in duplex VGs. A safe assumption is usually CD = 0.50, since at worst the
redundancy management function can, upon detecting a fault in a duplex VG, randomly
guess which one is faulty and mask it out.
Let nelist(VGi) represent the set of FCRS which contain at most one channel of VGi.
For example, if quadruply redundant VGi- has members in FCRs 0, 1, 3, and 4, then
nelist(VG1) = {0, 1, 3, 4}.
The conditional VG reliability becomes =.......
R(VGi I no FCR faults) = _'(_'PE' IApE, t, redlev0
[
=)E(_'PE, IAPE, t, redlev.), je nelist(VG0
R(VGil FCRj faulty) \E(_'PE, IAPE' t, redlevi-1), je nelist(VG i)
(9.24)
(9.25)
and
R(VG i [FCRs j, k faulty ) =
(
r "-(_'PE' lAPe' t, redlev _, j_ netlist(VG i and k_ nelist(VGO,
E(_'PE, lAr'E, t, redlev i- 1), j_ neflist(VG i and ke nelist(VGO,
E(;L r,E, lA_, t, redlev i- 1), je netlist(VG i and k_ nelist(VG _,
_F-O-pE, IAPE' t, redlevi-2), je neflist(VG i and ke nelist(VGO
(9.26)
Page 9-19
This formulation for the conditional VG reliability is used to compute PGD:
Pc_=[ I_vo,_F, F,_sRC¢Gi' n° FCR faults)] Pr(no FCR faults)
+N_ 1-I R(VGi I FCR n faulty) 1 Pr(FCR n faulty)
n---1 [_t3_EFj FIE$ ]
NNEs NNEs I" 1
+K _ _ / I'I R(VGi I FCRs n and m faulty)l Pr(FCR n faulty) Pr(FCR m faulty)
n=l re=l, m__VI31_Ft Fi_S
(9.27)
9_,2.2. Formulation for Processor Replacement Class of Fault Recovery
Let NVG denote the number of VGs required to meet the mission throughput and other
performance requirements, and let redlevi denote the redundancy level required for VGi to
meet its mission reliability requirements. If NVGs of the appropriate redundancy levels
cannot be formed, then AFTA cannot meet its mission requirements. In this section we
produce a formulation that a desired AFTA configuration can be constructed from the PEs
and NEs which are nonfaulty, given a redundancy management strategy from the processor
replacement class. For the following analysis it is assumed that the term VG also includes
the IOCs.
Assume that the fault-free AFTA is composed of NNE Network Elements and hence
NNE FCRs. It is assumed that each VG i must have redundancy level redlevi. Not all
VGs need have identical redundancy levels. A total of at least
NPEs = E redlev i
i=l (9.28)
PEs are required to form NVG VGs, each VGi of which has redundancy level redlevi.
If the AFTA is composed of 4 FCRs, then to construct the required VG configuration each
FCR must contain at least
N 4 : INPEs/4] (9.29)
'Page 9-20
PEs, while if the AFTA is composed of 5 FCRs, then each FCR must contain at least
N5 = [NPEs/5] (9.30)
PEs. Any four PEs resident in different FCRs can form a quad VG and any three a triplex
VG. The probability that the requisite VGs an be formed under a processor replacement
type of redundancy management strategy is .......
PPR = Pr(#PEs per FCR _>N4)Pr(exactly 4 FCRs operational)
+ Pr(#PEs per FCR > N5)Pr(exactly 5 FCRs operational) (9.31)
To meet the mission requirements, each FCR must have either N4 or N5 PEs, depend-
ing on the number of FCRs the Ab'TA configuration possesses. We assume that under
fault-free conditions each FCR has an equal complement of PEs. To increase availability,
spare PEs may be added to each FCR t_o bring the total number of PEs in each FCR up to
NT. If the AFTA configuration possesses 4 FCRs, then NT > N4; if the configuration pos-
sesses 5 FCRs, then NT > NS. The probability that the requisite number of VGs can be
formed is
T R n U (NT_ n)
PPR = t n=N4 PE PE ....
R FCRU FCR
+K
N.r 1n=Ns_ ]
(9.32)
where
RpE _ e'kPE t
RFCR = RNERPCRBUS
RNE = e "_'NEt
Rpc = e'_'PC t
RBUS = e'kBUS t
Ux = 1-Rx
Page 9-21
and
K= { 0, NNEs<51, NNEs = 5
9.2,3, Failtwe Rate Calculation Methodology
For the purposes of reliability and availability calculations, AFTA is partitioned into
LRMs and LRUs, each of which has an associated failure rate. The LRMs include proces-
sors, Network Elements, power conditioners, and input/output controllers. The LRU's
primary contribution to AFTA failure rate is its backplane bus: if the bus fails, then it is as-
sumed that the entire FCR is unusable. Secondary non-Byzantine resilient techniques may
be used in a given AFTA implementation to reduce the probability of FCR backplane bus
failure.
Most AFTA components are Non Developmental Items (NDI), for which failure rates
and plausible calculational means should be provided by their vendor. These are usually
based on MIL-HDBK-217E analyses and are furnished along with the components' docu-
mentation.
In the current analysis, we focus on the estimation and minimization of NE failure rate.
_.2 .3.1. Environmental Effects
The AFTA will potentially reside in a number of different vehicles under a number of
different operational and environmental conditions. These conditions must be specified for
each operational mode of the system.
When possible, the MIL-HDBK-217E will he used to estimate the component failure
rates of the AFTA. CECOM/RAMECES field data will also be used when available.
When component failure rates are estimated using the MIL 217E handbook, the effect of
the operational environment is taken into account by multiplying the component failure rate
by an environmental multiplier HE. Values of liE for monolithic microelectronic devices in
various operational environments are given below.
Page 9-22
Environment
Ground,Benign
Ground, Fixed
Manpack
Ground, Mobile
Airborne, Rotary Winged
Cannon, Launch
II E
0.38
2.5
3.8
4.2
8.5
220.
Table 9-2. Environmental Failure Rate Mul-@Iiers for Monolithic Microelectronic Devices
9.2.3.2. PE Failure Rate Calculations
PE Mean Time Between Failures (MTBFs) are obtained from the board manufacturers
and are summarized in the table below. The failure rates are assumed to be the reciprocal of
the MTBFs. On the whole, the MTBFs cit_y the vendors seem much higher than expe-
rience would indicate. Selected vendors were contacted for details regarding their MTBF
calculation methodology, but no such info_ation was received in time for inclusion into
this report. This information would of course be required for any fielded version of AFTA
as part of the vendors' documentation.
PE Type Environment
Radstone PMV 68M CPU-3A Ground, Mobile, 45C
Lockheed Sanders STAR MVP Airborne, Uninhabited, 40(2
SAVA GPPM Ground, Mobile, 85C*
MTBF
16,982h
,,ll
32,000h
31,000h
li E
4.2
6.0t
4.2
Table 9-3. PE Cited Failure Rate Data
These failure rate data must be converteilTrom the cited environment to the anticipated
operational environment. The technique chosen to approximate this conversion is to mul-
t There are six Aircraft, Uninhabited environments specified in 217E, with liEs ranging
from 3.0 to 9.0; the Lockheed data do not S_cify which is meant. The AUA, Aircraft,
Uninhabited, Attack environment is assumedas it is approximately the numerical mean of
all Aircraft, Uninhabited multipliers.
* This temperature is far above the 50C specified in 217E as being representative of the
Ground, Mobile environment. Moreover, it is inconsistent with the SAVA maximum
operating temperature specification of 78C. It is therefore assumed to be a typographic
error in the draft SAVA standard. ....................
Page 9-23
tiply theMTBF cited by the manufacturer by the cited environment's HE, divided by the
anticipated environment's HE.
9.2.3.2.1. Hiatus: Ground Fixed
The hiatus is assumed to occur in the Ground, Fixed environment as specified in 217E,
with a lie of 2.5.
PE Type MTBF Multiplier
Radstone PMV 68M CPU-3A 28,530h 4.2/2.5 = 1.68
.Lpckheed Sanders STAR MVP 76,800h 6.0/2.5 -- 2.40
SAVA GPPM 52,080 4.2/2.5 = 1.68
2.5
2.5
2.5
Table 9-4. PE Hiatus Failure Rate Data
9.2.3.2.2. Aircraft Mission
The aircraft mission is assumed to occur in the Aircraft, Rotary environment as speci-
fied in 217E, with a HE of 8.5.
PE Type MTBF Multiplier
Radstone PMV 68M CPU-3A 8,391 h . . 4.2/8.5 -- 0.49
Lockheed Sanders STAR MVP 22,588h_ ' 6.0/8.5 = 0.71
SAVA GPPM 15,190 4.2/8.5 -- 0.49
lie
8.5
8.5
8.5
Table 9-5. PE Aircraft Mission Failure Rate Data
9.2.3,2.3. Ground Mission
The ground mission is assumed to occur in the Ground, Mobile environment as speci-
fled in 217E, with a liE of 4.2.
PE Ty_
Radstone PMV 68M CPU-3A
Lockheed Sanders STAR MVP
MTBF
16,982h
Multiplier
4.2/4.2 = 1.0
45,760h 6.0/4.2 = 1.43
SAVA GPPM 31,000 4.2/4.2 = 1.0
lie
4.2
4.2
4.2
Table 9-6. PE Aircraft Mission Failure Rate Data
Page 9-24
9.2.3.3. NE Failure Rate Calculations
9.2.3.3.1. Methodology
The AFTA Network Element failure rate is calculated using the MIL-HDBK-217E Parts
Stress Analysis technique. The failure rate 0_ two Network Element implementations will
be calculated. The Baseline board is consffUcted of a combination of NDI ICs and the
Scoreboard ASIC, as described in Section 4_5_2.2. It is believed that this is the minimum
level of integration needed to allow the AFrA-NE to fit on a single VMEbus-compatible or
SAVA-compatible module. The High-End board consists of four ASICs, the DPRAM,
and the fiber optic components, as described in Section 4.5.2.4, and represents an aggres-
sive packaging approach which would allowthe NE to readily fit on a JIAWG or smaller
module. (It is likely that the NE can fit on a-JIAWG module with a lower level of integra-
tion.)
9.2.3.3.2. Assumptions
The assumptions used in calculating th_ NE failure rate are as follows. First, the
"Baseline NE," that is, the NE containing t_ Scoreboard ASIC, is calculated first. The
Itigh-End board is evaluated as a perturbation to the Baseline. Failure rates are calculated
using the MIL-HDBK-217E Parts Stress Analysis technique. The NE is assumed to con-
sist of Class B parts, consisting of hermetic, ceramic, eutectically bonded integrated circuits
mounted to the board using plated througfi]_isies (PTH). The NE board is assumed to
comply with the MIL-STD-344 form factor, and to consist of 6 signal layers. Maximum
integrated circuit power dissipation specifications are used when estimating junction tem-
peratures. We note that the maximum pow_ dissipation is a worst-case assumption and
can differ from typical power dissipation figures by factor of two. The learning factor YIL
and the voltage stress derating factor-lq V are assumed tO be if0.-
The NE failure rate comprises four main Contributors: the integrated circuits, the fiber
optic components or "plant", the onboard pins connecting the integrated circuits and back-
plane connector to the printed circuit board, and the backplane pins connecting the NE to
the FCR backplane bus.
The integrated circuit failure rates are calculated according to Section 5.1.2.1 of MIL-
HDBK-217E, as
Page 9-25
where
no(c,n v+ n, fail  / 0 (9.33)
Xp is the device failure rate in failures per 106 hours
rIQ is the quality factor (1.0 for MIL-M-38510 Class B components)
FIT is the temperature acceleration factor, based on technology, given by
°(xf'l I1)273 + tTc+ Ojcp } 298
A = 4635 for "I'TL parts, 6373 for CMOS parts
TC is the case temperature (A function of the operational environment See
Table 5.1.2.7-4 in MIL-HDBK-217E.)
OjC is the junction-to-case thermal resistance (A function of the die attach
method and the number of pins. See Table 5.1.2.7-4 in MIL-HDBK-
217E.)
P is the integrated circuit power dissipation
HV is the voltage stress derating factor (Assumed to be 1.0)
II E is the application environment factor
CI is the circuit complexity factor based on gate count and technology (See
p. 5.1.2.1-1 of MIL-HDBK-217E.)
C2 is the package complexity failure rate (See Table 5.1.2.7-16 of MIL-
HDBK-217E.)
(9.34)
The fiber optic connector ands.splitter failure rates are calculated according to Section
5.1.20 of MIL-HDBK-217E. The optical emitter and detector failure rates are calculated
from [APS90].
15age 9-26
Thefailurerateof the onboard pins connecting the integrated circuits to the printed cir-
cuit board is calculated according to Section 5.1.13 of MIL-HDBK-217E. Note that plated
through holes (PTH) are assumed to be used,
Xp= XbrlQFiE[nlnc+ n ilc+ 13)1 failures per 106 hours
where .........
2Lp= failure rate due to onboard pins
2Lb = base failure rate (0.000041 per !06 hours for wave-soldered boards,
according to MIL-HDBK- 217E Table 5.1.13-1 )
FIQ = quality factor (assumed to be 1.0)
FIE = environmental factor (MIL-HDBK-217E Table 5.1.13-4)
nl = number of wave soldered PTHs (obtained by summing the pin counts
of the integrated circuits on the _ plus the number of PTHs required to
connect the backplane connector to the circuit board: 192 for VME, 256 for
SAVA, and 250 for JIAWG LRMs)
n2 = number of hand soldered PTHs (assumed to be 0)
lqc = complexity factor (2.0 for a 6-layer NE board)
Finally, the contribution to the NE's failure rate due to the mating connector pair be-
tween the NE and the FCR backplane bus is calculated according to Section 5.1.12.2 of
MIL-HDBK-217E as follows.
_'p= _b I-IEFIp I-IK failures per 106 hours
where
_.b = base failure rate, given by
_.b= 0.216 exp
-2073.6
T + 273 IT + 273]466 / failures per hours+ ,°°
Page 9-27
T = operating connector temperature, C
FIE = environmental factor (MIL-HDB K-217E Table 5.1.12.2-4)
lip = pin factor, given by
rip=
N - number of active pins (192 for VMEbus, 256 for SAVA, 250 for JI-
AWG)
II K = cycling factor (cycle defined as unmating/mating of the connector,
from MIL-HDBK-217E Table 5.1.12.2-7)
9.2.3.3.3. AFTA Hiatus NE Failure Rate
Using the calculations described above the failure rate of the baseline NE under hiatus
conditions is summarized below. The first table shows the NE tentative parts list and fail-
ure rate for each part. This is a preliminary parts list which will probably change as the
Detailed Design phase of the AFTA program proceeds; the AFTA analysis program will
track these design refinements.
page 9-28
3
4
5
6
7
8
9
10
11
12
13
14
lS
,1 S
17
18
19
2O
21
22
23
24
25
_2 (i
27
28
29
30
31
32
33
34
36
36
37
38
39
4O
A
CHIPS
Scoreboard ASIC
IDT 7202
Slanetics VME _gntroll_r
Altera EPM 5064 JC-1
i
Altera EPM 5128 JC-1
Lattice GAL 18V10-15LP .........
Lattice GAL 26V12-'15LP
IDT 2K
,it r rr," "l
x 8 SRAM IDT 6116 LA 20 TD
IDT 4K x 8 DPRA M IDT 7134 L 35 J
IDT 64 X _ FIFO IDT 72402 L 25 P
Lattice 22v10
Altera 5032
TI Octal I_us Transceiver
TI Octal Trans wlReg SN74ALS_tANT
IDT 16k x 8 DPRAM IDT 7006 S_G
IDT 2910 Microsequencer IDT 3:9_IOC
IDT 4k x 16 RAM w SPC IDT 715Q2 S 25 J
Dallas Semiconductor watchdog t_mer DS1232
AMD Taxi Transmitter AM7968-125 JC
•
AT&T Optical Transmitter iml, ,i iiii
AT&T Optical Receiver --
i i ii I ........... =,_ m. i
PROMs :':"_"......
iCvDress Reaistered PROMs CYTC_45A-25WC
Olclllatora .....
Vectron 50 MHz osciliator CO-238A-O
Other Parts ...........
;Fiber Optic Connector ..........
Optic Fiber
,
=
Failures /10^6
1.21E+00
1.02E,+00
2.33E-01
8.92E-01
3.64E+00
3.07E-01
4.14E-01
1.11E-01
2.05E-01
2.63E-02
3.84E-01
1.12E+00
:_,79E-02
6.99E-02
1.21E+00
1.01E-01
1.85E+00
8,72E-03
7.18E-02
2.39E-01
1.63E+00
5.08E+00
1.99E+01
Failure Rate 110^6
7. 1E-o2
7.31E-02
Failure Rate / 10^6
8.31E-02
8.31E-02
Failure Rate / 10^6
5.ooi:"-Ol
_.0o_-ol
5.00E-01
Table 9-7. AFTA Baseline NE Parts Hiatus Failure Rates
The next table shows the total NE failqre_Tate and how it is broken down according to
the integrated circuits ("silicon" in the table), the fiber optic plant, the onboard pins, and the
backplane pins. This information is depicted graphically in Figure 9-7.
Pag_9_29 -
43 'Fail'u_ Rites I Per 10^6 'houri ,,,
44
4 5 Silicon 2.00E+01
ii
4 6 FO Plant 1.50E+O0
4 7 Onboard Pins 1.90E+01
4 8 Backplane Pins 2.54E-01
4 9 Total 4.08E+01
SO
S 1 NE MTBF (houre) .., 2.45E+04
Table 9-8. AFTA Baseline NE Hiatus Failure Rate and Constituents
[] Silicon
[] FO Plant
[] Onboard Pins
Backplane Pins
Figure 9-7. AFTA Hiatus NE Failure Rate Constituents
The calculations indicate that the Baseline AFTA network has a hiatus failure rate of 41
failures per 106 hours, corresponding to an MTBF of 24,500h. From this calculation we
also concIude that approximately 50% of the hiatus NE failure rate is due to onboard IC-to-
board connections, and approximately 50% of hiatus NE failure rate is due to the integrated
circuitry. Thus increasing the level of integration of the NE could at most increase its hia-
tus MTBF by a factor of two.
9.2.3.3.4. AFTA Aircraft Mission NE Failure Rate
The failure rate for the AFTA Baseline NE under Aircraft, Rotary Winged environmen-
tal conditions is presented below. For conciseness the parts list is omitted.
Page 9-30
Table
4 3 Failure Rates r per 10^6 hours
44
4 5 Silicon ..... 3.66F+01
4 6 FOPlant _-- 1.50E+00
i
4 70nboard Pins ...... 1.45E+02
4 8 Backplane Pins .......... 1.42E+00
4 9 Total 1.85E+02
SO
5 1 NE MTBF (hours) 5.42E+03
9-9. AFTA Baseline NE Flight Mission Failure Rate and Constituents
[] Silicon
[] FO Plant
BB Onboard Pins
[] Backplane Pins
Figure 9-8. AFTA Flight Mission NE Failure Rate Constituents
In the helicopter environment, the Baseline NE failure rate has increased significantly to
185 failures per 106 hours, corresponding to an MTBF of 5,400h. It is of interest to note
that approximately 80% of the flight mission NE failure rate is now due to the onboard IC-
to-board connections, and only 20% of thefailure rate is due to the integrated circuitry.
This result is most likely due to the severe _ibration characteristic of the helicopter envi-
ronment. The obvious implication is that increasing the level of NE integration via the use
of ASICs could have a significant impact off the NE's reliability under a helicopter flight
environment.
9.2.3.3.5. AFTA Ground Mission NE Failure Rate
The failure rate of the AFTA Baseline NE under Ground, Mobile conditions is pre-
sented below.
4
4
4
4
4
4
4
5
5
A B
3 Failure Rates_ per 10^6 hours
4
5 Silicon 2.43E+01
6 FOPlant 1.50E+00
70nboard Pins 6.62E+01
8 Backplane Pins 6.1 9E-01
9 Total 9.26E+01
0
1 NE MTBF (hours) 1.08E+04
Table 9-10. AFTA Baseline NE Ground Mission Failure Rate and Constituents
[] Silicon
[] FO Plant
[] Onboard Pins
[] Backplane Pins
Figure 9-9. AFTA Ground Mission NE Failure Rate Constituents
The AFTA Baseline Ground Mission NE failure rate is estimated to be 93 failures per
106 hours, corresponding to an MTBF of 10,800h. Of this failure rate, approximately
72% is due to onboard pins, with 26% due to the integrated circuitry. Again, a significant
benefit can be obtained through from reducing the number of onboard pins through increas-
ing the level of integration of the silicon.
Page 9-32
9.2.3.3.6. Implications and Indicated Course of Action
The AFTA Baseline failure rate calculati_s are summarized below.
Environment MTBF, h
Ground, Fixed 24,500
Aircraft, Rotary w!nged
Ground, Mobile
Table 9-11.
5,400
10,800
% due to ICs % due to onboard pins
50 50
20
26
Summary of AFTA Baseline NE MTBF Data
80
72
These results imply that a higher level of circuitry integration could reduce the hiatus
NE failure rate by up to 50% and the mission NE failure rate by at most 80%, assuming
that an increased level of integration does not _uce the reliability of the silicon.
This motivates repetition of the calculation using a partitioning of the NE into VH-
SIC/VLSI ASICs corresponding to the p_itioning described as the "High End Network
Element" in Section 4. In this partitioning; the NE is comprised of four ASICs and 16K x
32 bits of DPRAM. The four ASICs correspond to the Scoreboard, the Global Controller,
the Voter/FTC, and the VME Controller. _e gate count, pin count, and power dissipation
estimates are presented in the table below. For a detailed breakdown of the constituents of
these devices, refer to Section 4.5.2.5. AIS_ note that the largest devices, notably the
Global Controller and Dual-Ported RAM (DPRAM), are mostly RAM and ROM.
Device # Gates # Pins
Scoreboard 130K I45
Global Controller 658K 80 0.675
Voter/FTC 55K 193 1.370
h, •
VME Controller 156K 131 1.026
DPRAM 1M 272 2.5
Total 2M 821 7.8
Power Consumption, W
2.172
Table 9-12. Gates, Pins, and Power Consumption of High-End NE
MIL-HDBK-217E formulations do not extend to gate counts of highly integrated (e.g.,
30,000+ gate) ASICs, so other analytical means will be required to obtain plausible failure
rates. Specifically, the devices will be broken up into logic gate count, RAM bit count,
Page 9-33
and package complexity, the failure rate for each segment of the ASIC will be computed
separately, and the three will be combined. Failure rate due to logic gates will use the for-
mulation
Xp. logic = I-IQ(C 1,1ogkl"I_rlv) IIL failures/106 hours
failure rate due to RAM will use the formulation
_'p,RAM : FIQ(C 1,RAMYI11-Iv) I'll. failures/106 hours
and failure rate due to package complexity will use the formulation
p,package = FI QC 21-IEl-I L failures/10 6 hours
where, HQ, HT, I-Iv, HE, I'lL, and the failure rate formulations due to onboard pins,
fiber optic components, and backplane connectors are unchanged from previous calcula-
tions. Because of the large number of pins on each package, OjC is estimated as 25C/W
according to MIL-HDBK-217E.
Scoreboard
50K logic gates
40Kbits RAM
Cl,logic = 0.24 (extrapolated from MIL-HDBK-217E)
CI,RAM = 0.20
package C2 = 0.076
The device failure rate is, in failures per 106 hours
GF
3.4
AR GM
6.9 4.4
where, as usual, GF refers to the Ground, Fixed hiatus environment, AR refers to an
Aircraft, Rotary Wing mission environment, and GM refers to a Ground, Mobile environ-
ment. Note that this Scoreboard ASIC has a higher failure rate than that in the Baseline
NE. This is because the Baseline Scoreboard ASIC does not have RAM integrated on it as
does the High End Scoreboard ASIC.
Global Controller
Page 9-34
2.5K logicgates Cl,logic -'-0.04
328Kbits RAM C1,RAM = 0.60 (extrapolated from MIL-HDBK-217E)
package C2 = 0.032 __
The device failure rate is, in failures per I_ hours
GF AR GM
II, ........
0.8 1.9 1.1
Voter/FTC
30K logicgates CI Jogic= 0.16
12Kbits RAM CI,RAM = 0.10
package C2 = 0.076
The device failure rate is, in failures per 106 hours
GF AR GM
0.9 2.2 1.3
.EI_...C_nlmll_
24K logic gates Ci,logic = 0,16
66Kbits RAM CI ,RAM = 0.401 .....
package C2 = 0.053
The device failure rate is, in failures per 106 hours
GF
1.1
DPRAM
512Kbits RAM
AR GM
2.5 1.4
Cl ,RAM = 0.80
Page 9-35
package C2= 0.10
The device failure rate is, in failures per 10 6 hours
[GF AR GM
8.5 16.4 10.7
r
The total NE Device failure rates are, in failures per 106 hours,
GF AR GM
14.7 29.9 18.9
ii
The single largest contributor of High End ASIC failure rate is the DPRAM (58%,
55%, and 57% for the three mission environments). The total number of onboard pins in
the High-End NE is 1021 (821 connecting the ASICs to the board and approximately 200
connecting the backplane connector to the board), compared to 1682 for the Baseline NE.
The NE's power dissipation is also reduced from 40W to 13.4W.
The net result of the integration of the NE circuitry into VHSIC/VLSI ASICs is as fol-
lows.
The hiatus failure rate for the High-End board is
Component
Silicon
FO Plant
Onboard Pins
Backplane Pins
Total Failure Rate
MTBF
Failure Rate, per 106 hours
14.7
1.5
10.5
0.25
27
37,106h
The Aircraft Mission failure rate for the High-End board is
Page 9-36
Component
Silicon
FO Plant
Onboard Pins
Backplane Pins
Total Failure Rate
MTBF ....
Failure Rate, per 106 hours
29.9
1.5
80
1.42
113
8,863h
The Ground Vehicle Mission failure rate for the High-End board is
Component
Silicon
FO Plant
Onboard Pins
Backplane Pins
Total Failure Rate
MTBF
Failure Rate, per 106 hours
18.9
1.5
36.4
0.6
57.4
17,421h
The following table compares the MTBFS of the AFTA Baseline NE MTBF and the
AFTA High End NE, for the three operational environments under consideration. The im-
provement factor is calculated as the ratio of the High End NE MTBF to the Baseline NE
MTBF. .........
Environment Baseline NE
Ground, Fixed 24,500h
Aircraft, Rotary Winged
Ground, Mobile
5,400h
10,800h
Hil_h End NE
37,106h
Improvement Factor
1.51
8,863h 1.64
17,421h 1.61
Table 9-13. Comparison of MTBFs of Baseline and High End AFTA Network Element
9.2.3.4. IOC Failure Rate Calculations .....
As for all NDI AFTA modules, IOC failure rates must be provided by the manufacturer
of the modules or calculated from detailed design information as outlined above. Cur-
rently, no IOCs have been definitively selected for inclusion in AFTA and therefore no fail-
ure rate data are available. As the Detailed _sign phase proceeds and the AFTA I/O suite
Page 9-37
isdefinitizod,thesedam willbe obtainedand incorporatedintotheAFTA analyticalmodel
suite.
9.2.3.5. l_' Failure Rate Calculations
Power conditioner failure rates for militarized VMEbus- and SAVA-compatible mod-
ules were not obtained under the Conceptual Study phase of the AFTA program. How-
ever, [MA-HDBK] contains a list of JIAWG-like power conditioners and failure rate data.
We make the approximation that a PC packaged in VMEbus or SAVA form factor will have
a similar failure rate, and use the following PC failure rate data.
PC Type
L i li|ll i|.l
Varo Power Systems
Model 24039
i
General Dynamics
Model PS32
MTBF Environment
i,
20,000h
15,850h
Airborne, Uninhabited,
Fi[_hter, 71C
Airborne, Uninhabited,
Fighter*
9.0
Table 9-14. PC Cited Failure Rate Data
For the AFTA Conceptual Study analysis we assume the existence of a generic PC
having an MTBF of 17,500h under the Airborne, Uninhabited, Fighter environment. As
usual, a module's failure rate is assumed to be the reciprocal of its MTBF.
9.2.3.5.1. Hiatus
The hiatus MBTF of the generic AFrA PC is estimated as
MTBFpc,hiatus= 17,500 "2_-.'50}=63,000h
9.2.3.5.2. Aircraft Mission
The aircraft mission MBTF of the generic AFTA PC is estimated as
9.0
MTBFpc,aircraft = 17,500(_-)= 18,529h
* Environment uncited: assumed to reside in F-16.
Page 9-38
9.2.3.5.3. Ground Mission
The ground mission MBTF of the generic AFTA PC is estimated as
MTBFpc,ground = 17,500 {4-_}9"0= 37,500h
9.3. Physical Characteristics (WPV) Models
The weight, power, and volume (WPV) models are simple linear summations of the
WPV of the LRMs and LRUs comprising a _ven AFTA configuration.
The weight of AFTA is given by
WTAFTA = ZiWTFcRi (9.35)
where WTFCRi, the weight of FCR i, is given by
WTFCRi = NpEiWTpE + NNEiWTNE + NIociWTIoc + VdTRACK + NpciWTpc
9.3.2. Pow.c.t
The power consumption of AFTA is given by
PWRAFTA = EiPWRFCRi
(9.36)
(9.37)
where PWRFcRi, the power consumptibn--of FCR i, is given by
.....
PWRFcRi = NpEiPWRpE + NNEiPWRNE + NIOCiPWRIoC + PWRBT+ PWRpc
(9.38)
PWRpc = ( 1-el'C)* (NpEiPWRpE + NNEiPWRNE + NIOciPWRIoC) (9.39)
and
epc = Power Conditioner Efficiency
Since we are assuming that no power-down standby redundancy is being used in the
AFTA, the peak power and the average power _ identical.
Page 9-39
9.3.3. Vohmac
The volume (or size) of AFTA is given by
VOLAFFA = XiVOLFcRi (9.40)
where VOLFcRi, the volume of FCR i, is given by
VOLFcRi = NPEiVOLpE + NNEiVOLNE + NIociVOLIoC + VOLRACK + NpciVOLpc
(9.41)
In a modular system such as Ab'TA it is usually most convenient to specify the volume
of a module in terms of the number of"slots" it occupies, accompanied by the volume of a
slot.
9.4. Fleet Life-Cycle Cost per Service Unit (FLCCPSU) Model
It is desirable at the Conceptual Study stage to conduct a preliminary quantification of
the effect of varying AFTA parameters such as VG redundancy level, sparing, redundancy
management policy, implementation technology, etc. on operational and logistics costs.
Military missions are complex and a full cost model must reflect this complexity; however,
at the Conceptual Study phase little information is available about mission details, while in
turn a full mission life cycle cost analysis is beyond the scope of a Conceptual Study.
Consequently, a simple Fleet Life-Cycle Cost per Service Unit (FLCCPSU) model will be
used to illustrate the effects of varying AFTA parameters on overall cost. The FLCCPSU
model computes the cost of vehicle and AFTA procurement, the cost due to repairing
and/or replacing spare components, and the cost due to mission-critical failures.
FLCCPSU = Initial procurement cost +
Cost due to repair actions +
Cost due to replacement parts +
Cost due to unreliability (9.42)
The constituents of these costs are described below.
tSage 9-40
9.4.1. Assumptions and Analysis Inputs
The FLCCPSU model presented below is primarily relevant to an iterative aircraft mis-
sion, in which a fleet of vehicles must periodically sortie, perform a mission, and return to
base. This scenario is described in Section 2, an d a figure from that secdon illustrating the
mission scenario is reproduced below.
Ill
flight-critical
hiatus _---[ vehicle lost ]
maintenance
complete
MDC
L met _[ sortie i failure
post-flight
not _ 1 maintenance
met
Figure 9-10. Helicopter TF/I'A/NOE Mission Scenario State Diagram
Failure on the part of AFTA to form _Ccan prevent a vehicle to sortie; if it is as-
sumed that a given number of vehicles must sortie, then additional vehicles must be pro-
cured to achieve this given level of readinessLThis is the cost due to unavailability, and is
included in the formulation for initial procurement cost. Failure on the part of AFTA to
perform flight.critical functions during the mission results in loss of the vehicle; procure-
............
ment of lost vehicles (not to mention crews) contributes to the cost of maintaining the given
readiness level. This is the cost due to un_liability. Faults occurring either between or
during missions require maintenance actions, The manpower involved in performing the
maintenance actions and the cost involved in refurbishing or reprocuring failed AFTA
modules contributes to the life cycle cost. of the fleet.
The following list enumerates the input l_ha_[ers to the FLCCPSU model.
Fleet Service Life, Ts: It is assumed that the fleet is in continuous operation
over the Fleet Service Life.
Hiatus duration = Th: The vehicle is in a stand down mode, i.e., powered
off and unoccupied, during the hiatus period of Th hours.
Sortie duration = Tin: The sortie lasts for Tm hours.
..... Page 9-41
Numberof vehicles required per sortie = Nvs: Nvs vehicles must sortie for
each mission.
Baseline vehicle cost = Cvehb: A vehicle (without AFTA) costs Cvehb dol-
lags.
Cost of an AFTA LRM = Cmodule: An AFTA module costs Cmodule dollars.
Currently, it is assumed that all AFTA modules cost the same.
Number of LRMs in AFTA = NLRM
Cost of an AFTA rack = Crack: An empty AFTA LRU costs Crack dollars.
Number of racks in AFTA = Nrack
Cost to repair a faulty LRM found after hiatus = Crepair,hiatus: The man-
power cost to repair a faulty LRM found after hiatus is Crepair,hiatus dollars.
Cost to repair a faulty LRM found after sortie = Crepair,sortie: The manpower
cost to repair a faulty LRM found after hiatus is Crepair,sortie dollars.
Field repair time = tin: The time required to perform field maintenance and
repair of a faulty AFTA component is tm hours.
Field diagnosis time = tdiag, field, nf: The time required to perform field diag-
nosis of a AFTA component found to be not faulty is tdiag, field, nf hours.
Field repair man hour cost = Cmh: The cost per man hour to perform field
diagnosis, maintenance, and repair of a faulty AFTA component is Cmh
dollars.
Depot diagnosis time = tdiag,depot: The time required to perform depot-level
diagnosis of a faulty AFTA component is tdiag,depot hours.
Depot diagnosis man hour cost = Cdiag,depot: The cost per man hour to per-
form depot-level diagnosis of a faulty AFTA component is Cdiag,depot dol-
lars.
Depot repair time = trepair,depot: The time required to perform depot-level re-
pair of a faulty AFTA component is trepair,depot hours.
Page 9-42
Depotrepairmanhourcost = Crepair,depot: The cost per man hour to perform
depot-level repair of a faulty AFTA component is Crepair,delx_tdollars.
Depot condemnation ratio for an AFT_A LRM = rcondemn: The conditional
probability that the depot can not refurbish a returned module is rcondemn.
Cost of refurbishing an AFTA LRM = Crefurbish: The cost of the parts nec-
essary to repair an AFTA LRM is Crcfurbish dollars. Currently, it is as-
sumed that all AFTA modules cost the _me to refurbish.
9,4.2. Application Scenario
It is assumed that the hiatus phase begins with no faults in AFTA. After Th hours, all
vehicles in the fleet attempt to sortie. Vehicles able to sortie because they can form MDC
perform a sortie of Tm hours, while vehicles failing to form MDC do not sortie. Vehicles
suffering faults either during the hiatus or sortie are repaired before the hiatus period begins
again. This cycle is repeated for the entire se_ice life of the fleet (or until the maintenance
and flight crews mutiny).
Nfs, the total number of sorties required per fleet over the fleet's service life, is equal to
Nvs, the number of vehicles per sortie, times Ts/(Th + Tm), the number of sorties per vehi-
cle over the fleet's service life.
Nfs = Nvs(Ts/(Th + Tm)) (9.43)
9,4,3, Procurement Cost
The total number of vehicles procured is the number of vehicles required to meet the
sortie requirement, Nvs, divided by the availability of AFTA. The availability of AFTA is
given by pPR(Th), the probability that MDC c_ be formed under the processor replacement
class of redundancy management strategies described in Section 5:
Nvp = NvdPeR(Th) (9.44)
The baseline cost of the vehicle is Cvehb, and the cost of AFTA is CAFTA, where
CAFTA = NLRMCmodule + NrackCrack (9.45)
The total cost of the vehicle is
Page 9-43
Cveh = CAFTA + Cvchb (9.46)
The total cost of vehicles procured to meet the sortie requirement is
Cvp = NvpCveh (9.47)
9.4.4. Man_t_wer Cost due to Repairs
The expected number of faulty AFTA LRMs in a single vehicle after a single hiatus, as-
suming that all LRMs have approximately the same failure rate and racks do not fail, is
N_
fh = E J(l'rm_(Th))Jrmh(Th )(Nu_rj) rmb (t)=e'_'*'t
j=l (9.48)
The expected number of faulty AFTA LRMs in a single vehicle after a single sortie is
NLI_t
fm= E J(1-rm_(T m))Jrrn_ (T m) (N u_-j), rm (t)=e-_.,,.t
j=l (9.49)
The expected cost of field repairs (not counting parts) for a single vehicle for a single
hiatus/sortie interval
Crepair, field,ss = fhCrepair,hiatus + fsCrepair,sortie (9.50)
where "ss" stands for "single sortie." We make the reasonable assumption that Cre-
pair,hiatus = Crepair,sortie. To perform field repair, maintenance crew time must be spent
testing, identifying, and replacing the faulty module according to the procedure described in
Section 6. Denote this maintenance and repair time by tin. With a field maintenance crew
manpower cost per hour of Cmh, the field maintenance manpower cost per vehicle per sor-
tie is
life
Crepair,fieid,ss = (fh + fm)tmCmh (9.51)
The expected manpower cost of field repairs for the entire fleet over its entire service
Crepair,field,fleet = NfsCrepair,field,ss (9.52)
P'_ge 9-44
After a defective module has been remov_ from the vehicle by field maintenance per-
sonnel, to maintain a given level of readiness the spare module used for replacement must
itself be replaced. To achieve this objective _0 options exist. The defective module may
be condemned and discarded, in which case aspare module must be procured. Alterna-
tively, the defective module may be shipped back to a depot for repair. We assume that
modules are condemned and discarded eitherat the field or the depot with probability rcon-
demn and successfully repaired with probability 1-rcondemn. If the module is condemned in
the field, the manpower cost in doing so is negligible; other than the inevitable paperwork
the field crew is assumed to throw the defective module in the trash. If the module is
shipped back to the depot, shipping costs are in_curred, and the depot maintenance techni-
cian must perform additional testing to dete_ine whether the module is repairable using
depot level tests. If we neglect shipping costs, we can without numerical error assume that
the field crew always ships the module back io the depot for further diagnosis. We assume
that the depot diagnostics require tdiag,depot hours at a cost of Cdiag,depot per diagnostic
hour. If the module is condemned as a resu!t.O_f_these tests, negligible additional cost is in-
curred other than paperwork time. If the• m_ule is repaired, additional manpower and
parts costs are incurred. After diagnosis, repair requires trepair,depot time, at a cost of Cre-
pair,depot per hour. In addition, spare parts (such as integrated circuits) are required to re-
place those found to be faulty on the module. _is cost will be accounted for in the subse-
quent Section. Putting this all together, the total manpower cost per vehicle per sortie for
depot repair of a faulty module is
Crepair,depot.ss = (fh + fm)(tdiag,depotCdiag,depot + (1-rcondemn)trepair,depotCrepair,depot)
(9.53)
The expected manpower cost of depot repairs for the entire fleet over its entire service
life is
9.4.5. Cost due to Sp_¢,._
Crepair,depot, fleet = NfsCrepair,depot,ss (9.54)
The expected number of LRMs which m_st be replaced for the entire fleet over its ser-
vice life is equal to the expected number of faulty modules per sortie, times the number of
fleet sorties. Denoting this quantity by Nspa_ifleet,
Nspares,fleet = Nfs(fh+ fro) (9.55)
Page 9-45
The cost of spare parts requirexl to maintain the AFTA fleet over its service life is the
number of module faults incurred during the fleet service life times the cost of replacing a
module or repairing it at the depot.
Cspares,fleet = Nspare_fleet(rconde_nnCmodulc + (1-rcondemn)Crdurbish) (9.5 6)
9.4.6. Cost due to Unreliabili _tyof AFTA
Assuming that the crew escapes the effects of loss of vehicle-critical computing func-
tions, the cost due to unreliability of the AFTA is the total number of fleet sorties over the
service life times the probability that the AFTA causes the loss of a vehicle during a single
sortie times the cost of the vehicle.
Cur = Nfs(1-PGD(Tm))Cveh (9.57)
where PGD is the probability that the flight-critical computing functions can be per-
formed by AFTA when its redundant components are managed according to the graceful
degradation class of redundancy management policies.
9.4,7, Total FLCCPSU
The Fleet Life Cycle Cost per Service Unit is
FLCCPSU = Cvp + Crepair,field,fleet + Crepair,depot,fleet + Cspares,fleet + Cur (9.58)
Page 9-46
10. VHSIC Hardware Description Language
The VHSIC Hardware Description Language (VHDL) is rapidly becoming a standard
tool for digital logic design. Using VHDL, an engineer can write initial design specifica-
tions, device behavioral characteristics, device structural characteristics, and device timing
characteristics using a single, integrated design environment. This section details the use-
fulness of VHDL for the AFTA brassboard development project, including the areas of
logic design and synthesis, simulation, testing, and documentation for reprocurement.
10.1. VHDL Overview
VHDL (VHSIC Hardware Description Language) is a hardware design language devel-
oped under the VHSIC (Very High-Speed Integrated Circuit) project. The language can be
used for a number of applications, including design, debugging, simulation, testing, per-
formance analysis, and documentation. The_o&f of VHDL is required by the Department of
Defense for all new ASIC designs built to m_!itary specifications. The specification of the
VHDL language is standardized in [IEEE10761.
Traditional idealized design procedures prescribe a top-down design methodology. The
goal is to define a design at an abstract level, then traceably refine each major component
into increasing levels of complexity until a description of the design suitable for fabrication
is produced. In practice, this ideal is rarely observed. Instead, the designer may pursue a
bottom-up, middle-out, or a combination of the three approaches. VHDL supports the de-
signer in any of these approaches.
Designs described by VHDL can be sim_ulated before the design is constructed. The
simulation environment is a defined part of the language. VHDL is not a static language,
which just defines interconnections between elements. Processes, which can contain logic,
state, and timing relations can also be defined. The simulator uses these constructs to "run"
the design. A VHDL simulator can provide a viewport into a VHDL model, possibly
through a source-level debugger.
The VHDL resulting from the design process can be used as a form of self-documenta-
tion. The highly abstract description (called tlae behavioral model) defines the functions of
the design. The low-level description (called the structural model) defines the interconnect
between individual hardware pieces.
Page 10-i
10.1.1. Behavioral vs. Structural Models
VHDL supports the varying of abstraction by the designer during the design process.
The mechanism for varying the design abstraction is done by using two types of models:
behavioral and structural. A particular VHDL description can not be truly classified as ei-
ther behavioral or structural. Part of the power of VHDL is the ability to mix structural and
behavioral models. In fact, all VHDL descriptions include some behavioral modeling.
A VHDL description can be represented as a tree structure as shown in Figure 10-1. A
structural VHDL model contains interconnected instantiations of other VHDL models. This
is analogous to describing a circuit as a netlist, with device pins (ports in VHDL) connected
to wires (signals in VHDL). A behavioral VHDL model describes the operation of a par-
ticular device in terms of state, output and timing relations as a function of device inputs.
The lowest level, or leaf-level, models in any VHDL description are always behavioral
models.
Microprocessor I
(struct.) J
pipeline register ALU register file
I D flip-flop(behav.) I
Figure 10-1. Hierarchical VHDL Model
An example of a behavioral model for a D flip-flop is shown in Figure 10-2. The be-
havioral model of the flip-flop defines the state of the outputs Q and !Q as a function of the
clock and D inputs. The behavioral model can also include the timing relations between the
clock and the outputs (clock to Q propagation delay) and the timing relations between the D
input and the outputs (setup and hold times). A properly developed behavioral model
would check for violations of setup time, hold time, pulse width minimums, etc., with any
violations reported to the user. The development of accurate behavioral models with these
characteristics is a very tedious, labor-intensive, time-consuming, and expensive task.
Page 10-2
-Olim t=t.H(clk to O or/Q) = 16 nsH X _' t=HL(clk to Q or IQ) = 18 nsm L X _ tw (clk high) = 14.5 nstw (clk low) = 14.5 nsT H L tsu (D to clk) ,, 15 ns
!' L H tH(dktoD)=0ns
Figure 10o2. Behavioral.Model of a D Flip-Flop*
The flip-flop can also be represented as a structural model defining instantiations of
logic gates and interconnections between th_gic gates as shown in Figure 10-3. In this
case, the logic gates would be represented by behavioral models defining the device func-
tion (truth-table) and timing (propagation delay) as shown in Figure 10-4.
m
m
Clock
D
Q
Figure 10-3. StructuralModel of a D Flip-Flop$
m
m
A BIY
L_L H
L H H
H L H
H H L
IPLH(A or B to Y) - 11 ns
tPHL(A or B to Y) = 8 ns
Figure 10-4. Behavio_i_Model of a NAND Gate
The behavioral description of the flip-flop is sufficient in most instances to define the
function of the circuit. The structural mode!is too low level for most applications. The
* From Texas Instruments ALS/AS Logic Data Book 1986.
¢ From Texas Instruments "ITL Logic Standard TTL, Schottky, Low-Power Schottky
Data Book 1988.
Page 10-3
functionof the flip-flop is readily apparent from the behavioral description, but not from
the structural description. The flip-flop behavioral model could be instantiated in higher
level structural models. The flip-flop structural model, however, would probably only be
used by an engineer to design the flip-flop itself.
10.1.2. Overview of a VHDL Description
A VHDL model is composed of a number of different elements, including packages,
entities, architectures, and configurations. Each element is contained in a library. An ele-
ment can access other elements by specifying the library name containing the referenced el-
ement and the element name. One library, known as "WORK", is used to store models un-
der development. Other libraries may be used by explicitly declaring their usage within a
model.
A package contains certain constants, type declarations, global signals, functions, and
procedures that are used throughout a model. Examples of packages include TEXTIO, and
STANDARD; these packages are defined as part of the VHDL standard environment
[IEEE1076]. Packages are convenient for hiding definitions from the designer and for al-
lowing incremental changes. A package declaration can be kept separate from the package
body. The package declaration (referred to in VHDL simply as the package) contains decla-
rations for the contents of the package body. The package body actually contains the defi-
nitions. Changes to, and recompilation of the package body does not require recompilation
of models which use the package.
An entity describes the interface to the model. The entity is analogous to a symbol, such
as the flip-flop symbol shown in Figure 10-2. Ports for input and output signals are de-
fined in the entity. The entity also allows the definition of generics, which provide a
method of passing parameters into the model. Different instantiations of a model can have
different signals attached to the ports and different values assigned to the generics.
The architecture defines the model, either in behavioral or structural terms. A behavior
model, such as that shown in Figure 10-2, can be represented as a set of one or more con-
current processes. A structural model can be represented as a schematic, such as that
shown in Figure 10-3 for the D flip-flop. A structural architecture defines instantiations of
other entities, which can themselves be either structural, behavioral, or a combination of the
two. Note that a model can have more than one architecture defining the model. One archi-
tecture is selected to represent the model in a particular situation.
Page
The selection of architectures is done using a configuration. The configuration tells the
VHDL compiler/analyzer which architecture to "plug into" each entity instantiation. Gener-
ics can also be defined in the configuration section. The configuration can define the entire
VHDL description hierarchically, or it can call other configurations to define lower level
structural models.
10.2. Use of VHDL for AFTA
VHDL will be used extensively in the AFTA Network Element (NE) design. A VHDL
analysis tool will be used to design, analyze, simulate, and document the Network
Element. VHDL descriptions of the Network Element subsections can also be used as a
basis for performing validation of the NetworkElement design. VHDL will also be used as
the medium for converting appropriate subsections into application-specific integrated cir-
cuits (ASICs).
VHDL analysis and simulation is rapidly becoming a standard tool for performing chip,
board, and system design. The flexibility of VHDL permits almost any digital circuit to be
described and simulated using standard VHDL analysis tools. This capability permits de-
bugging of circuit designs before building them. If accurate models are used in the analy-
sis, the resulting simulations will be highly faithful to the actual observed behavior of the
final design.
Subsections of the AFTA NE design which may require a significant amount of design
and testing include the scoreboard and the fault-tolerant clock. Other sections which will
benefit from the use of VHDL are the data path voter, the global controller, and the ring
buffer manager.
The VHDL description of the scoreboard is particularly important since the scoreboard
is a prime candidate for implementation in an ASIC device. Since an ASIC represents a
significant investment and is not easily changed, the scoreboard design must be robust. The
VHDL analysis tool will permit a significant amount of testing to verify the correct func-
tionality of the scoreboard design. The scoreboard register-transfer level (RTL) VHDL de-
scription can be converted, with the use of a synthesis tool, directly to an ASIC design.
The test bench created to test the scoreboarddescription can also be used to generate func-
tional test vectors for testing the scoreboard ASIC during foundry testing.
Page 10-5
10.2.1. Design
One of the primary uses of VHDL for the AFTA Network Element is for conceptual
and detailed design of custom hardware. VHDL is a useful tool for performing detailed de-
sign, supplementing conventional tools such as schematic capture and Boolean equation
synthesis tools. Conceptual design at a very high level can also be done in VHDL, some-
thing for which no comparable tool exists. The transformation from high level to detailed
design can be performed completely within the VHDL environment.
We shall use the following definitions within the scope of the AFTA NE development.
Definition 1:
Behavioral VHDL is defined to be a VHDL architecture which uses any of
the legal VHDL constructs, including those which do not correspond to
possible hardware realizations of the description (i.e., pure behavioral may
not be synthesizeable).
l)r,faailiml 
Structural VHDL is defined to be a VHDL architecture that consists strictly
of instantiations of other entities and the interconnect between these entities.
I ,flaik0a 
Reeister Transfer Level (RTL) VHDL is defined to be synthesizeable behav-
ioral VHDL, that is, a behavioral VHDL description that is suitable for input
to a synthesis tool.
l O.2.L I . Behavioral
The architecture of submodules in the Network Element will be developed using VHDL
behavioral modeling. Alternate partitioning of the designs can be done using behavioral
models. Simulations using alternate partitioning can be used to determine what partitioning
provides optimal performance.
The behavioral models will be decomposed into register-transfer level (RTL) descrip-
tions. Some models will be developed only in RTL form. The RTL form is also a behav-
ioral format which specifies the functionality of a block from the standpoint of random
combinational logic and/or synchronous registers. Synthesis of a gate-level design from an
RTL description is a straightforward process using (expensive) logic synthesis tools.
Page 10-6
10.2.1.2. Structural
The structural description of Network Element submodules specifies the design of the
submodule as an interconnecdon of lower level components. The gate-level netlists syn-
thesized from RTL behavioral descriptions rep_sent structural descriptions.
Structural VHDL requires accurate mode!sdefining the behavior of the individual com-
ponents. For example, an ASIC design would require high fidelity models defining each
gate, register, and macrocell used by the design. A design utilizing standard devices also
requires models defining the behavior of _S, PROMs, and other MSI/LSI parts. The
development of structural descriptions of the_FTA Network Element is contingent upon
the availability of these models. Structural _fiptions of subsections for which suitable
models are not available would not be a useful exercise, since the description would not
represent any characteristics not already des_bed more concisely by the behavioral de-
scription.
10.2.2. Simulation
One of the most useful features of VHDL is the built-in simulation capabilities. VHDL,
unlike many other hardware description langiiTges and schematic capture programs, defines
an integrated time reference. Device models Can easily specify timing characteristics such as
propagation delays and can check for tirr_ng violations such as setup and hold times.
Timing specifications rely on the existende-bT-faithful device models, therefore accurate
timing simulations of Network Element sulS_iions using structural VHDL is contingent
on the availability of device models. ............
A test bench in VHDL is a model of a testiifixture that can be used to test the device be-
ing designed with VHDL. The test bench is _ written in VHDL, so all the capabilities of
VHDL are available for sophisticated error d_/ecking. The test bench provides a non-pro-
prietary way of stimulating and monitoring a design in a simulator. While some simulators
provide for direct stimulus of a design without Using a test bench, this capability is simula-
tor dependent and is therefore not guaranteed to be present in all _L simulators. Even if
the capability is included, the implementations-may define different methods for specifying
stimulus.
The test bench will be used for simulating major sections of the Network Element. Test
benches will be developed to test the scoreboard, the fault-tolerant clock, the data path
Page 10-7
voter, the global controller, and the ring buffer manager. A test bench will also be designed
to test the aggregate of these subsystems in a complete Network Element design.
10.2.3. Testin_
The VHDL models for the AFTA design will include test benches with which the mod-
els can be tested. A properly designed test bench can be used for either behavioral or
structural VHDL models. In addition, the test bench can be used as a source of functional
test vectors. The stimulus driven by the test bench and the expected response can be inter-
cepted and saved in a data file. The data file can be used by the manufacturer of a submod-
ule to test the submodule following assembly.
The test vectors derived from the test bench are functional test vectors which are in-
tended to test that the device performs the desired function as defined in the original specifi-
cation. An additional set of test vectors can be generated to test that a particular implemen-
tation of a device is free of faults. These test vectors are typically developed by automatic
test pattern generator (ATIK_) tools and test a device against the gate-level design descrip-
tion, not against the original design specification. Each set of test vectors should be used to
fully test a device following fabrication.
10.2.4. Documentation
Another function of VHDL is to document the AFTA design. Documentation is re-
quired to enable reprocurement or reimplementation of the AFTA design.
10.2.4.1. Custom Devices
One of the most important applications of VHDL for documentation is for reprocure-
ment of custom devices used in the AFTA design. To maintain vendor independence for
custom parts, a non-proprietary method is needed to unambiguously define the complete
design of custom devices.
Many different options are available for reprocurement of custom devices. These op-
tions are illustrated in Figure 10-5. The path chosen depends on the needs of the repro-
curement. If replacement parts for an existing system are needed, one of the paths with the
least amount of effort should be sufficient. If new parts with architectural enhancements are
needed, more effort will have to be expended to redesign many of the lower levels. The
grayed areas indicate levels of the design to which VHDL can be applied.
Page 10-8
Y Y Y Y
Odglnal Mask Compatible Scalable _w Technology- New Technology-
Implementation Process Process ......: gate-compalJble higherIntegration
l i iiiii!!ii!iiiiiiiiiil
Y
I
,Y
I
¥
I ]
New Technology-
new gates required
Figure 10-5. Reprocurement _tions for Custom Devices
The mask compatible process assumes that a standard, vendor independent process
specification is used to fabricate an integrated circuit. Such a process would allow multiple
vendors to reuse masks produced during the original fabrication process.
A scalable process is a derivative of an existing process in which the feature size, and
thus the overall chip size, is reduced. If the design rules for the technology are scaled lin-
early with the feature size, a new chip can be fabricated from the original chip layout. New
masks will have to be made, but the effort to produce masks from the original chip layout is
a straightforward process. ....
Both of the above options are straightforward reprocurement cycles. Little, if any, re-
design is necessary. For these options, VHDL has no application. However, for repro-
curement cycles with more redesign effort, VHDL can be very useful.
The next level of reprocurement is the use of a new technology with a compatible set of
gates. For example, most CMOS, NMOS, GaAs, and "I33_, technologies use negative logic
in the form of NAND and NOR gates. A gate-level netlist in VHDL that specifies instantia-
tions of these gates could be used to port a design to any of these technologies. The VHDL
behavioral model specifies the timing requirements for the design. The designer must make
sure that the new gate-level netlist in the new technology meets the requirements specified
by the original VHDL behavioral model.
Some alternate technologies use different types of gates. An example is ECL technol-
ogy, which is based on OR/NOR gates. If a device that was originally designed for CMOS
technology is to be reimplemented in ECL, a new gate-level netlist must be synthesized
from the original register-transfer level (RTL) description to make effective use of the
Page 10-9
OR/NOR gatesin ECL technology. The RTL description, which could be written in
VHDL, specifies blocks at a functional level, either as combinational logic or registers. The
synthesis tool that creates the gate-level netlist from the RTL description uses a library that
specifies what gates are available in the selected technology. Thus, the RTL description is
the first truly technology and vendor independent specification level.
A final reprocurernent option is to reimplement a device at a higher level of integration.
Increasing the integration level of a technology allows the designer to either produce
smaller chips or to design chips that use more concurrent hardware for higher throughput.
Architectural enhancements are a common method of improving the performance of a de-
vice without increasing the clock speed. However, to make these architectural enhance-
ments, the designer must use the original high level specification for the device. The un-
partitioned behavioral VHDL model is such a specification. The designer can use the
behavioral model to determine what architectural changes can be made to optimize
performance. Once the design is repartitioned with the architectural modifications, the rest
of the design cycle must be completed before the chip can be fabricated in silicon.
The options presented above describe several different anticipated options for repro-
curement of custom devices for the AFTA Network Element. Reprocurement can never be
effortless, but the amount of effort can be made commensurate with the amount of change
required.
10.2,4,2. Standard Devices
The use of VHDL is not limited to custom devices. VHDL can be used to specify a
complete chip, board, or system level design in a hierarchical manner. VHDL is sufficiently
versatile that a high level model of a system can be decomposed into board level and chip
level components. Chip level components corresponding to custom devices can be decom-
posed further into gate level descriptions. Standard devices are described at the leaf-level by
behavioral models defining the functionality and timing requirements of the devices.
Reprocurement of standard devices must be ensured as is the reprocurement of custom
devices. Typically, devices to be reprocured are required to conform to standard parts
specifications defined in MIL-M-38510. Each of these devices is defined by a "slash sheet"
in a standard, non-proprietary format. Each device is furthermore required to be available
from more than one vendor. Each vendor must ensure that parts to be sold as compliant
with the "slash sheet" meet the specification.
Page 10-10
The reprocurement options for systems Containing standard devices are described in
Figure 10-6. The options are much more limited than for custom devices. The reason for
this is that either one can obtain an exact _lacement for a part, or the design must be
modified to accept a new part. The extent oft_e redesign is, of course, dependent on the
relative similarity between the old part and the new part. However, redesign, even if simply
a revalidation of the timing characteristics, will always be required if different parts are to
be used.
I I I I I I
Original 385i 0 New Device
Implementation Equivalents Technology
Figure 10-6. Reprocurement Options for Standard Devices
The structural models represent the interc0nnection of individual standard devices. The
structural model will not represent anything more than the netlist unless accurate behavioral
models for the standard devices are used. If accurate models are not available, the netlist
representation in VHDL has no advantage over more conventional netlist representations,
such as schematic drawings.
The advantages to using VHDL for reprocurement of standard devices is very marginal.
Certainly, the behavioral VHDL allows for repartitioning of the design if new devices are to
be used. However, the structural VHDL doe___s_notenhance the simple reprocurement of
equivalent devices.
The current approach to VHDL modeling for the AFTA design does not consider the
leaf-level behavioral models for standard devices. The reason for this is the unavailability
of these leaf-level models. Modeling devices at the behavioral level involves complex pro-
gramming in VHDL and often requires significant knowledge of the inner workings of the
device. In addition, these models have no use for the development of custom devices.
Since the AFTA Network Element is targeted for multiple custom devices during full-scale
development, the development of complex structural descriptions of standard devices
seems to be fruitless.
Page 10-11
10.2.5. Candidate VHDL Tools for AFTA NE Desima
Draper has VHDL design tools and computing platforms in house to perform the above
design functions. The available tools include products by Vantage, Viewlogic, and
Synopsys. The Vantage and Viewlogic tools will be used for NE design capture and simu-
lation. The Synopsys tool will be used to synthesize gate level netlists from the RTL
VHDL description.
10.2.6. Compliance with Data Item Description
The data item description (DID) entitled "VHSIC Hardware Description Language
(VHDL) Documentation," number DI-EGDS-80811 [DID80811], has some relevance to
the development of VHDL models for the AFTA design. While the scope of the DID is
more appropriate for a full-scale development project, certain sections of the DID will be
addressed by the AFTA brassboard design to minimize the impact of future full-scale de-
velopment of the AFTA design. Proposed compliance with, and deviations from, the DID
are detailed in the following sections.
10.2.6.1. Rtference Documents
The VHDL language and environment is defined by the IEEE Standard VHDL
Language Reference Manual [IEEE1076]. The use of VHDL for the AFTA design will con-
form to [IEEE1076] in all aspects.
1_0,2.6.2. VHDL Model Hierarchy
A VHDL model of the Network Element design and of the subsections of the Network
Element design will be provided. The Network Element is the only section that will be de-
scribed using a structural model, unless detailed leaf-level models are provided by external
means for either custom (ASIC) or standard logic devices. The subsections of the Network
Element that will be described by VHDL behavioral models include the scoreboard, the
fault-tolerant clock, the data path voter, the global controller, and the ring buffer manager.
10.2.6.3. Leaf-Level Modules
The leaf-level modules for the AFTA Network Element depends greatly on the targeted
technology and the availability of accurate device models from an outside vendor.
Development of these models is an expensive, time-consuming process and is outside the
scope of the AFTA brassboard development project.
Page 10-12
TheGovernmentmaysupplymodelsfor usein the AFTA design. These models will be
incorporated as leaf-level modules wherever appropriate.
Major subsections for which a complete Sfiite of models is not available will be supplied
as a either a behavioral model or an incompletdy specified structural model.
A list of commercial grade parts used in tl_e AFTA brassboard design will be provided
along with the Network Element VHDL desc_ption. This list can be used to obtain appro-
priate device models at a later date.
10.2_6A. Enti__ Declarations
The entity declarations for Network Element subsections will describe, as best as pos-
sible, the timing constraints of the subsection-Since the actual timing constraints depend on
the behavior of the selected devices, the accuracy of these constraints is highly dependent
on the availability of accurate leaf-level _els.
The entity declarations for the Network Element subsections will conform to the inter-
face declaration requirement as specified in thtDID wherever possible.
The timing and electrical requirementsTor the Network Element design will not be
guaranteed by the behavioral models of the NE, Any timing and/or electrical specifications
included in the behavioral models will be derived from data book information. The struc-
tural models for the NE, if written, will incliJ-de timing and electrical requirements if the
models that make up the leaf-levels of the S_uctural models handle these requirements
properly.
The operating conditions for the Network Element design will not be handled by the
behavioral models of the NE. The structural models for the NE, if written, will include op-
erating conditions if the models that make up the leaf-levels of the structural models handle
these requirements properly.
The high-level structural Network Element description will provide for the addition of
timing, electrical, and operating condition requirements if appropriate leaf-level models be-
come available.
The naming conventions for entities in the Network Element design will conform to the
requirements as specified in the DID wherever possible.
Page 10-13
10.2.6.5. Behavioral Body_
Behavioral models for each major Network Element subsection will be developed as
part of the AFTA detailed design process. The behavioral models will conform to the re-
quirements as outlined in the DID wherever possible. Timing characteristics of the behav-
ioral models may be based on preliminary analysis and may not reflect the exact timing
characteristics of the brassboard design unless accurate leaf-level models are used to ana-
lyze the structural design.
Behavioral bodies will not be structurally decomposed unless functional partitions dic-
tate that such decomposition is appropriate.
The timing characteristics of the behavioral body will specify, as accurately as possible,
the known timing behavior of Network Element sections. Best, worst, and nominal output
delays will be included, if known. However, many models (particularly those from com-
mercial-grade data books) will only define worst case timing.
10.2_6,6. Structural Body_
Structural models for major Network Element subsections will be developed as part of
the AFTA detailed design process if appropriate leaf-level modules are available to define
the structural model. The structural models will conform to the requirements as outlined in
the DID wherever possible. The use of the structural models for logic fault modeling and
test vector generation depends on the accuracy of the models instantiated in the structural
models.
The naming conventions for components and signals in the Network Element structural
design will conform to the requirements as specified in the DID wherever possible.
10.2.6,7, VHDL Simulation Support
The Network Element design will incorporate test benches for simulating the Network
Element as a whole and for simulating major subsections of the Network Element alone.
VHDL test benches will be written to be independent of a particular simulator product.
Each test bench will instantiate the appropriate module, either behavioral or structural,
apply stimuli to the module's inputs, and test the module's outputs against an expected re-
suit. Any discrepancies in actual and expected output will be reported to the simulator op-
Page 10-14
erator.Each test bench will incorporate a configuration to allow selection of the architecture
for the model und_ test.
A test bench will be developed to test the entire Network Element as a stand-alone
module. In addition, test benches will be developed to test the major subsections of the
Network Element, including the scoreboard, the fault-tolerant clock, the data path voter, the
global controller, and the ring buffer manager as stand-alone modules. Test benches for
lower level entities will not be provided.
10.2.6.8. ErrorMessages ......
The format of error messages in the Network Element design will conform to the re-
quirements as specified in the DID. _
10.2.6.9. Annotations
Annotations included in the Network Element design will conform to the requirements
as specified in the DID.
1___.2.6.10. Rff_erence to Origin
The models used in the Network Element design will specify the origin of the model as
specified by the DID wherever appropriate.
The Government may supply models for use in the AFTA design. These models may
be purchased by the Government for use in modeling the AFTA design, or provided from
internal Government sources. The origin of these models will be specified, if known.
10.2.6.11. VHDL Documentation Format
The format of VHDL documentation for the Network Element design will conform to
the requirements as specified in the DID wherever appropriate.
A facility for producing an ASCII tape under the specified requirements is currently in
place. The organization of files on the tape will conform to the organization described in the
DID. Not all files specified in the DID will necessarily apply to the AFTA brassboard
VHDL models.
Page 10-15
Thispageintentionally left blank.
Page 10-16
11. AFTA Validation and Verification
AFTA will be used to provide mission- and vehicle-critical services in a range of Army
applications. Some means must be def'med to provide a reasonable degree of assurance that
a given AFTA implementation will in fact perform the required mission functions: these
means include the validation and verificationprocesses.
Validation refers to the process of demonstrating that an implemented system correctly
performs its intended functions, e.g., helicopter TF/TA/NOE/FCS, under all reasonably
anticipated operational scenarios, fault conditions, computational loads, etc. Thus, what is
ultimately wanted is a "validated system," a_:mt which one can state "This computer can
perform helicopter TF/TA/NOE/FCS." To aRempt to capture the relevant characteristics of
a system which is believed to be capable of _rforming a mission's intended functions, a
system specification is written with which t[ie_implemented system must comply, with the
hope that a system meeting the specification will also perform the mission's intended func-
tionst. Given a well-written specification, one can typically come close to building a sys-
tem that meets it. However, human fallibility and the ambiguities of language - both in-
formal and otherwise - conspire to ensure that no specification can completely describe the
needs of a mission. In recognition of this, lengthy and expensive validation testing is typi-
cally performed on an implemented system, ifi Which it is demonstrated that the system can
in fact perform a specified set of the mission's intended functions over a specified envelope
of operational conditions. The system is then Called validated.
Given a specification, say of delivered throughput, the verification process demon-
strates that an implemented system mee_iS the specification. For the helicopter
TF/'I'A/NOE_CS example, it is perhaps desif_ to make the statement "This computer de-
livers XVG.delivered DAIS MIPS to the application program." The belief that a delivered
throughput of XVG,delivered DAIS MIPS is sufficient to perform helicopter
TF/TA/NOE/FCS is a critical link between successful performance of the mission's in-
tended function and compliance with the system specification. Verification is relatively
more straightforward than proving that an implemented system performs the mission's in-
tended functions, because, whereas a mission environment incorporates the innumerable
t In actuality a hierarchy of specifications _d requirements is written. For the purpose of
the present discussion it is assumed that a single specification document suffices to describe
the system.
Page 11-1
PRECEDING PAGE BLANK NOT FILMED
vagaries, uncertainties, and complexities of real life, a specification is ultimately a list of re-
quirements that can be enumerated and checked off during the verification process. The
computer's delivered DAIS MIPS throughput can be benchrnarked. In rare cases, a system
specification is sufficiently exhaustive and accurate such that a system which meets the
system's formal specification document is capable of performing the system's intended
function; this assertion itself must of course be proved. In these cases, it is only necessary
to verify that the system as implemented meets the system's formal specifications; valida-
tion is proved by the presumed logical transitivity between the specifications and the mis-
sion's intended function. Even in this idealized example, it is safe to say that the empirical
phase of validation will never be eliminated; nobody would want to fly in an aircraft which
had never been flight tested but which, we are assured, can be formally proven to meet its
specifications.
Their limitations notwithstanding, specifications are written which serve as a mutually
understood representation of the mission designer's understanding of what characteristics
the computer system must have to perform the mission's intended functions, and the com-
puter designer's understanding of what requirements the computer system must meet.
Three categories of statements can be made about a system. The first category contains
statements (either regarding the mission's intended function or a specification item) whose
validity can only be affirmed or negated via empirical test and evaluation for each and every
differing implementation. The MTBF of an AFTA LRM is one example of such a state-
ment: as the LRM changes due to technology insertion, or as the operational environment
changes, the LRM's predicted MTBF must be verified via a reliability evaluation plan.
While often unavoidable, use of such statements should be minimized since they comprise
a large contribution to the cost of validating and verifying a system, and do not appropri-
ately leverage experience gained from the implementation of prior systems. The latter two
categories are indicative of a system which is described by the seeming oxymoron
"validated independent of the application."
The second category contains logical statements (either regarding the mission's in-
tended function or a specification item) which can be formally stated and shown to be an
intrinsic property of the system when designed according to simple, unambiguous rules
and guidelines. For example, one such statement refers to AFTA's Byzantine Resilience.
Such statements have value because their relationship to these rules and guidelines has only
to be shown once; subsequent implementations have only to be inspected to ensure that
they comply with the rules and guidelines, and hence the statement about the system holds.
Page 11-2
The third category contains quantitativesiatements (either regarding the mission's in-
tended function or a specification item) describing certain important characteristics of a
system that are valid, to an extent independent of the application in which that system is
used. These characteristics are not usually fixed numerical quantifications of system at-
tributes such as performance, reliability, component MTBF, etc., because these attributes
change from implementation to implementatioh, as technology insertions occur, etc. In-
stead, implementation-invariant formulatiofis are much more valuable in facilitating the
cost-effective, safe, and predictable reuse of-A-FTA for various applications. One example
might be the AFTA temporal overhead due tb-fault tolerance, surely an important determi-
nant of delivered throughput. While it is tempting to state that fault tolerance consumes
some fixed and attractively low temporal overhead based on prior experience, it is in fact
the case that the fault tolerance-related overhead is a function of how often the application
program invokes fault tolerance-related functions such as voting, synchronization, interac-
tive consistency, etc. Thus it is critical to knQw the relationship between the application's
invocation of these functions and the loss of throughput to allow the intelligent and in-
formed design and partitioning of application_fasks in AFTA and the accurate prediction of
the actual delivered throughput in an imple_ntation.
In reality, general statements about AFTA are supported by statements from all three
categories described above. For example, a_statement about AFTA reliability includes
statements about LRM MTBFs (only availaNe via empirical test and evaluation for each
LRM type during Reliability Development an--d Growth Testing), Byzantine Resilience (a
logical attribute of AFTA which can be shown Via adherence to simple architectural rules),
and fault latency (which is partly a function of_e frequency at which the FDIR/C-BIT task
is executed on a VG). ....
In summary, validation is the process oFdemonstrating, with a high degree of confi-
dence, that a system correctly performs its intended mission functions, for example, heli-
copter flight control. Verification is the p_s of demonstrating that a system meets its
specifications, for example, 10 MIPS throughput. A verified system becomes a validated
system when a correspondence has been shown between the specifications and the intended
mission functions, for example that 10 MIPS is sufficient throughput to perform helicopter
flight control. Application dependent specifications, such as throughput, cause validation
process to be repeated for each new applicati.on. However, there are certain logical state-
ments that can be made about AFTA attributes that hold true independent of applications,
such as Byzantine Resilience. In that sens_:once these attributes of AFTA have been
Page 11-3
demonstrated,one can say that AFTA is partially validated independent of the intended
application requirements.
1 1.1. Verifiable AFTA Attributes
The following list shows AFTA attributes which will be verified during the Dem/Val
phase. This set of attributes can be expanded upon guidance from the Army.
Functional Correctness and Byzantine Resilience
Fault Containment
NE Synchronization
Interactive Consistency
Voting
Message-Release Authorization (Scoreboard)
Functional Synchronization
Byzantine Resilient Virtual Circuit Abstraction
Reconfigurability
Rate Group Scheduling
Intertask Communication Services
I/O Services
Redundancy Management (FDIR) Software
Performance-related Attributes
Delivered throughput
Available memory per VG
Effective intertask communication bandwidth
Effective Input/Output bandwidth
Task iteration rate
Reliability-related Attributes
AFTA reliability and availability
Cost-related Attribute_t
Cost per Unit of Service
Physical Attributes
Weight
Power
,Volume
Table 11-1. Verifiable AFTA Attributes
If this list seems somewhat short it should be borne in mind that each attribute listed
above must be plausibly verified within a reasonable cost and time; therefore it makes sense
Page 11-4
ill..
to keep the size of the list to the minimum needed to specify the core AFTA attributes
needed to perform the mission's intended function.
The AFTA Conceptual Study focuses on ilpredictive" verification, in which AFTA is
verified using a combination of predictive and corroborative verification means. First, the
AFTA's verifiable attributes are enumerated; these correspond to the requirements defini-
tion format defined in Section 2 of this report. In the predictive phase of the verification,
these AFTA attributes are predicted via performance, reliability, availability models, and
cost models as described in Section 9. The pr_ictive phase runs through the Conceptual
Study and Detailed Design phases of the AFTA program. In the corroborative phase, exe-
cuted during the Brassboard Fabrication, Integration, and Validation phase of the AFTA
program, critical model inputs are verified vi_nempirical test and evaluation, or sensitivity
studies are performed to obtain bounds on the effects of unverifiable parameters. In addi-
tion, quantities predicted by the models which can be empirically verified are measured to
corroborate the models' accuracy. _-
In empirical verification it is necess_ that either (a) sufficiently large sample sizes
must be obtain to produce statistically sound results to support the modeling assertions or
(b) the architecture must be designed to minimize reliance on assertions which rely on sta-
tistical parameters which can not be obtained with a reasonable amount of effort. The
AFTA design is intended to be strongly biased towards the latter approach.
1 1.2. Verification of Byzantine Resilience and Operational Correctness
Numerous functions must be performed correctly for AFTA to meet its advertised per-
formance and reliability requirements. These functions may be expressed as a set of logical
statements about the arrangement and operation of the architecture which are independent of
any application in which AFTA is used. Thus, if correctly implemented, these statements
may be viewed as being valid regardless of any application.
The first set of specifications describes ajchitectural features which are required for
AFTA to be a Byzantine resilient message:p_-ssing parallel processor; these features are
largely implemented in the Network Element. The second set of specifications describes
the functionality required for the AFTA OS tO successfully perform scheduling, message
passing, I/O, and redundancy management.
Page 11-5
11.2.1. Fault Containment
AFTA must be partitioned into at least four Fault Containment Regions (FCRs), each of
which contains at least one NE. The FCRs must possess independefit sources of power,
independent clocks, be dielectrically isolated from each other, and if damage tolerance is
'%
requ!red, be physically separated from each other. Verification of each of these require-
ments is performed by inspection of the AFTA design and implementation.
11.2.2. NE Synchronization
The AFTA Network Elements must be synchronized to each other to within a known
skew. The NEs achieve synchronization via the use of a set of circuitry known as the Fault
Tolerant Clock (FTC), which executes a Byzantine resilient phase locking algorithm. The
synchronization algorithm itself must first be specified and shown to be Byzantine resilient.
This has been done via journal-style mathematical proof in [Kri85].
The circuitry must then be shown to correctly implement the algorithm. It is preferable
that this be done via the process of formal specification and verification. In fact, because
the NE is a critical component of AFTA it is highly recommended that its entire
functionality be subjected to a well-supported program of formal specification and
verification. This is feasible with the current state of formal verification technology but is
inappropriate within the limited AFTA Brassboard Dem/Val schedule and budget.
Therefore the approach taken in the Brassboard Dem/Val phase is to design the FTC and
other NE hardware according to standard engineering practice, including detailed
specification, design, implementation, and test reviews, with the intent that eventually
formal methods will be applied to this circuitry. Under the Brassboard Dem/Val phase, the
NE synchronization skew will be measured in the absence of faults and in the presence of
Byzantine faults.
11.2.3. Interactive Consistency
Distribution of data from one member to all members of a redundant VG must be per-
formed using a Byzantine resilient interactive consistency algorithm. The algorithm used in
AFTA has been formally specified and demonstrated to be Byzantine resilient in [LSP82].
The circuitry must be shown to correctly implement the algorithm. The preferred approach
to verification of the correctness of the AFTA interactive consistency function is via the
process of formal specification and verification. Again, because of schedule and budget
constraints, a more traditional hardware engineering process will have to do during
Page 11-6
Dem/Val. Theinteractiveconsistencycircuitry will beempiricallyshownto correctlyim-
plernenthealgorithmfor all possibledatasourcesandall possibledatadestinations,in the
!!!!:
presence of one Byzantine fault.
11.2.4. Voting
Messages emanating from redundant VGs are passed through a majority vote function
implemented in the NEs. The NEs must _ capable of voting messages arriving from
triplex and quadruplex VGs. The voter is maskable, and generates vote syndromes for de-
livery to the destination VG. Voting is easily expressed mathematically and demonstration
of its Byzantine resilience is relatively straightforward. The NE's voting circuitry must be
shown to correctly implement the algorithm, and the above comments regarding formal
specification and verification hold. The app_ach taken in the Dern/Val phase is to design
the voter according to standard engineering practice. The voter will be empirically shown
to correctly vote input messages. In the presence of faulty input messages, the voter will
be shown to correctly generate a syndrome corresponding to the faulty input. The voter
will be shown to have the capability to mask_out an input such that it can not contribute to
the voted outcome. The response of the voter to out-of-specification inputs, such as "two-
two splits," will be demonstrated .......
11.2.5. Message-Release Authorization (Scoreboard_
The NE Scoreboard has responsibility for deciding which inter-VG messages may be
transmitted. This calculation is a complicated function of the message request pattern, the
flow control request pattern, the redundancy configuration of AFTA, the sender's and re-
ceiver's redundancy levels, and elapsed time. The baseline approach to specification and
verification of the Brassboard Scoreboardjs, again, standard engineering practice. The
Scoreboard will be shown to generate Corre-ctmessage release decisions from a large set of
input message request patterns, where the patterns will in many cases be those resulting
from faulty source and destination PEs ........... =-=
However, because of its complexity and criticality, the Scoreboard has been targeted
for more advanced specification, design, and verification approaches. If these approaches
are cost-effective they may be used for other parts of the NE. The Scoreboard algorithm
has been expressed in the C programming language and by a VHDL description. These C
and VHDL descriptions have been stimulated with representative message request patterns
to help in identifying errors in the specification. The patterns serve as a verification suite
Page 11-7
for successively detailed representations of the Scoreboard. VHDL will be used in con-
junction with ASIC synthesis tools to automatically transform the high-level behavioral de-
scription of the Scoreboard to an implementation. Verification of the fidelity of the
Scoreboard implementation to the behavioral specification will be performed by applying
the message pattern verification suite to the representation of the detailed implementation.
This is facilitated by the ability of automated VHDL synthesis tools to provide mechanical
traceability between the hierarchical representations of the design, all the way from the
high-level behavioral description to the transistor level.
To investigate the feasibility of formal specification and verification of AFTA hardware,
the VHDL and textual description of the Scoreboard is being transformed into a formal
specification under a collaborative effort with Odyssey Research Associates. It is expected
that the expression of the Scoreboard algorithm in a rigorous formalism will assist substan-
tially in revealing incompleteness, ambiguity, and inconsistency in the specification. It will
also serve as a concrete case study for estimation of the time and effort in constructing for-
mal specifications and design verifications of other parts of the NE.
11.2.6. Reconfigurability
The mapping of PEs to VGs in AFTA may change upon command from a VG having
an appropriate redundancy level. It is in fact this capability to reconfigure processing re-
sources in real time which gives AFTA its power to provide high reliability and availability
across a wide variety of missions and mission modes. The mapping of PEs to VGs is ef-
fected through a Configuration Table (CT) resident in the NEs. The CT is used by the
Scoreboard to interpret message request patterns and determine the sources and destinations
of messages. The CT in the NE is changed upon reception of a command known as a "CT
Update" emanating from an appropriate VG. It must be verified that, given a CT, a given
message request pattern results in selection of the appropriate message source and destina-
tion. This is subsumed in the above discussion on verification of the Scoreboard correct-
ness. Next, it must be verified that a given CT Update in fact causes the Scoreboard to cor-
rectly reconfigure the mapping from physical (PE) to virtual (VG) resources according to
the CT Update's contents. This rests on verification of two statements. First, it must be
verified that the CT Update emanating from the VG is correctly voted and delivered to the
NEs. This is subsumed in the verification of the correctness of the NE's voting function.
Second, it must be verified that the scoreboard receives the voted CT Update and correctly
updates its CT. Verification of this attribute is also subsumed in the Scoreboard verifica-
tion effort.
i
Page 11-8
11.2.7. Functional Synchronization
The members of a redundant AFTA VG_e synchronized via the synchronous recep-
tion of copies of ii message, as described in _tion 3. To perform functional synchroniza-
tion, the members of a VG transmit a message to themselves and await its reception. The
synchronized NEs perform identical computations on the message request pattern, vote or
perform interactive consistency on the message , and deliver the message to the destination
VG with a small skew. To verify that functional synchronization in fact synchronizes the
VG members it is necessary to verify that, _ synchronization points determined by the
AFTA OS, the VG members transmit a message and await its reception. This is done via
examination of the OS code which purports to achieve functional synchronization. It is
next necessary to verify that the NEs are synchronized, that they perform identical compu-
tation on the message request pattern (via the Scoreboard), that they can vote or perform
interactive consistency on the message, and that they select the correct sources and destina-
tions for the message. Means for verifying these assertions are enumerated above. Func-
tional synchronization will be demonstrated both in the presence and absence of PE and NE
faults. The time required for synchronization and the post-synchronization skew will be
measured.
11.2.8. Byzantine Resilient Virtual Ci_Uit Abstraction
The BRVC, described in Section 3, is the major inter-VG communication abstraction
provided by the NEs comprising the fault tolerant core of AFTA. All higher-level AFTA
OS functionality relies upon this abstraction for message ordering, correctness, and deliv-
ery skew. The BRVC comprises the following Byzantine resilient guarantees:
Guarantee 1: Messages sent by non-faulty members of a redundant source
VG are correctly delivered to the non-faulty members of re-
cipient VGs.
Verification: This guarantee is verified by demonstrating that the NE ag-
gregate can correctly vote messages (Section 11.2.4) and
route them from their source to destination VG (Section
11.2.6).
Guarantee 2. Non-faulty members of recipient VGs receive messages in
the order sent by the non-faulty members of the source VG.
Page 11-9
Vmification: This guarantee is verified by demonstrating that the NE
Scoreboard releases a message emanating from a source VG
before releasing any others from that same VG (Section
11.2.6).
Guarantee 3. Non-faulty members of recipient VGs receive messages in
identical order.
Verification: This guarantee is verified by demonstrating that all NEs
achieve interactive consistency on the message request pat-
tern (Section 11.2.3), and all Scoreboards correctly execute
the same message release algorithm (Section 11.2.6).
Guarantee 4. The absolute times of arrival of corresponding messages at
the members of recipient VGs differ by a known upper
bound.
Verification: This guarantee is verified by demonstrating that all NEs are
synchronized (Section 11.2.2).
]1.2.9_ Rate Group Scheduling
The next set of logical statements refer to the AFTA OS. Under the Detailed Design
phase of the program, a Software Development Plan (SDP) and Software Requirements
Specification (SRS) will be constructed for the AFTA OS. The SRS will describe the
functionality of each of the major AFTA OS functions as well as qualification, acceptance,
and verification tests for each. The discussion presented below represents a high-level
overview of the most important functions achieved by the OS and the means for their veri-
fication.
At the highest level of abstraction, the Rate Group dispatcher is responsible for starting
tasks belonging to a given Rate Group (RG) at the beginning of that RG. Each minor
frame demarcates a specified set of RGs; tasks in these RGs must have finished one itera-
tion and are prepared to begin their next iteration. The following table illustrating the RG
boundaries is reproduced from Section 9.
Page 11-10
Frame Boundary Completed RGs Started RGs
7-0 4, 3, 2, 1 4,3,2,1
0-1 4 4
i ili
1-2 4, 3 ........:..... 4. 3
2-3 4 4
3-4 4, 3, 2 ...... 4, 3, 2
4-5 4 4
....... iiii ,,
5-6 4,3 4,3
6-7 4 4
Table 11-2. Completed/StartedR_3s vs. Minor Frame Boundary
It must be verified that, on the appropfiaie frame boundary, the RG dispatcher enables
execution of the tasks whose RG frames are about to begin via setting an event correspond-
ing to each RG; RG tasks which have just c0mpleted their frames are awaiting the setting of
this event to resume execution. To verify thatthe appropriate RG tasks are started on the
appropriate RG frame boundaries, it must be shown that the RG dispatcher sets the appro-
priate events upon each minor frame boundary.: In addition, it must be shown that each RG
task awaits the setting of the event corresponding to its RG frame upon completion of each
of its iterations. Formal specification and VehTication of this and other selected AFTA OS
functionality should be performed; while S0_-6what costly and difficult given the current
state of the art, the criticality of the OS funcff66ality to the AFTA's reliable operation merits
this effort. Because of cost constraints in the AFTA Brassboard phase, verification of the
correct implementation of the OS functiona Itntywi!! be done via standard software engineer-
ing practice, which includes detailed specifi_aii0n, documentation, review, and exhaustive
testing of the OS and application interface code. In the testing phase, it will be demon-
strated that the RG dispatcher correctly dispatches a specified number of tasks in each RG,
ranging from zero to the maximum number specified by the dispatcher specification. Tasks
shall be constructed which test the error handling capabilities of the dispatcher. For exam-
ple, such tasks shall generate frame overruns in order to test the dispatcher frame overrun
detection and recovery capability, generate Ada exceptions which have no handler and
hence trap to the dispatcher handler, and attempt other malfeasance.
Page 11-11
11.2.10. Intertask Communication Services
AFTA tasks communicate using the intertask communication services described in Sec-
tion 5. The services consist of four main components. The message enqueueing compo-
nent receives a message from an application task, partitions it into Network Element pack-
ets, and places these packets on the message queue corresponding to the RG hosting the
application task. At each minor frame boundary, the packet transmission component
transmits the queues of packets corresponding to just-completed RGs into the Network
Element for transmission to the destination VG(s). The packet delivery interrupt service
routine fields packet delivery interrupts from the Network Element by copying the delivered
packet to the appropriate incoming packet queue. Finally, the message reception compo-
nent updates frame markers, constructs completed messages from the incoming packet
queue, and makes these messages available to destination tasks. Each of these functions
must be specified and verified in detail. Formal methods are again recommended but be-
cause of their cost will not be utilized during the Brassboard construction. Again, standard
software development practice will be used, which will include a rigorous testing program
in the presence of faults. During the testing phase, it will be demonstrated that the intertask
communication services correctly transmit messages for all message sizes up to the maxi-
mum allowable in the intertask communication specification, for all exchange classes, and
for all source and destination task combinations including broadcasts. Tasks shall be con-
structed which test the error handling capabilities of the communication services. Such
tasks shall attempt to overflow their transmit buffers to test outgoing flow control capabil-
ity, cease reading incoming messages to test the incoming flow control and buffer overrun
containment features of the communication services, attempt to cause flow control at se-
lected destination VGs by sending many large messages to it, send messages to nonexistent
tasks and VGs, send illegal message classes, send illegal message sizes, and exhibit other
erroneous behavior.
11.2.11. I/O System Services
The AFFA I/O System Services (lOSS) are composed of four components. The I/O
dispatcher ensures that I/O requests (IOR) are executed and processed on appropriate
frames. The IOR execution component initiates concurrent and sequential I/O. The IOR
processing component reads input and status data and delivers it to the destination task(s).
The "back end" device drivers to which the IORs interface perform the detailed bit- and
byte-level manipulation of the interface to the IOCs. The correctness of each of these com-
ponents must be verified. It must be shown that the I/O dispatcher invokes execution and
Page 11-12
processing of each IOR in the frames in which it is specified. Each IOR must be shown to
execute and process the correct I/O chains when it is scheduled, and interface to the correct
back end I/O driver routines. These are algorithmic functions which lend themselves to
formal specification and verification. Whil_ _-ebaseline approach to specification and ver-
ification of the AFTA IOSS components follows standard software development practice,
these services are also being targeted for the use of a formal software specification and ver-
ification tool; if the use of this tool proves cost-effective then it may be used on other AFTA
software components. Finally, because of their highly specific and non-algorithmic nature,
the back end I/O driver routines' correctness_will be demonstrated via standard software
development procedures, which will include-extensive testing in the presence of faults.
Tasks shall be constructed which stress the e_r handling capabilities of the IOSS.
11.2.12. Redundancy Management (FD_) Software.
The FDIR software is responsible for te_ng components, detecting and identifying
faulty components, performing fault recove_ctions, and performing reconfigurations as
mission modes change. As can be seen from Section 5, the FDIR function can be ex-
tremely complex and require the collaboration of multiple VGs in a distributed, fault toler-
ant algorithm. Moreover, although much of _IR is active in the absence of faults, the true
utility of FDIR is in the presence of faults, _ it determines how AFTA will respond to
faulty behavior. These complexities require _ use of a set of verification techniques. The
FDIR functions relevant to a single VG may_ specified in detail, possibly using formal
methods, and it can be shown that the FDIR Code as implemented correctly reflects its
specification, either through standard softw_--development means or formal means. This
includes showing that the FDIR task, when scheduled by the RG dispatcher, correctly per-
forms the intra-VG presence tests, and, if a member is faulty, correctly identifies the faulty
member and performs the selected local recovery procedure. AFTA-wide fault detection,
diagnosis, and recovery actions may also be sp_ified and analytically verified, but analyti-
cally showing the correctness of implementation in the presence of faults, especially of the
recovery functions, may be very difficult because the loose synchronization between mul-
tiple VG participants adds temporal complicati÷ons.
In addition to analytical means, FDIR must at least in part be verified by a process of
empirically examining its response to faulty _havior. Faults may be injected in software
(e.g., FIAT, DEPEND) or hardware (e.g., AIi_ FFMP), each of which has its merits and
will be used judiciously. For each fault recove_ policy to be used in an implementation, it
is necessary to fault each LRM while AFTA is in every possible configuration of redundant
Page 11-13
VGsandIOCs, and confu'm that the fault is detected and the designated recovery policy is
carded out. Timing data will also be obtained. Empirical test and evaluation is particularly
important in verifying multi-VG response to faults. The number of fault and configuration
combinations in a testing program such as this is admittedly large but it is felt that the FDIR
function is of sufficient importance that the effort be made. Certain simplifications can be
made to make the testing program tractable, such as injecting errors only at the LRM level,
and limiting the temporal behavior of faults to permanents and transients. Automating the
fault/error injection and data acquisition processes will be necessary.
1 1.3. Verification of Performance Predictions
Verification of performance predictions usually requires empirical measurement of the
quantity of interest. The AFTA PEs' throughput(s) may be initially estimated via empirical
benchmarking using Whetstones, Dhrystones, and other mutually agreed upon empirical
benchmarks; alternatively, the throughput(s) may be analytically evaluated as in a DAIS
mix calculation. Empirical timing measurement of existing AFTA components may be per-
formed with the use of processor emulator pods, logic analyzers, processor-asserted dis-
crete outputs to logic analyzers or oscilloscopes, or on-processor timers. All empirical
evaluations must be done in the presence of worst-case faults such as a failure of a VG's
processor or a Network Element. It should be borne in mind that sample sizes obtainable
from empirical evaluation, while overwhelmingly large, may be insufficient to support sta-
tistically viable assertions that hard real-time constraints will be met with a probability
commensurate with the reliability requirements of flight-critical systems, that is, on the or-
der of 0.999,999,999. Therefore the AFTA's quantitative characteristics are intended to
minimally rely on such assertions.
11,3,1, Delivered Throughput __ VG
The delivered throughput per VG is defined to be the raw PE throughput minus operat-
ing system, redundancy management, and synchronization overheads. The delivered
AFTA throughput is equal to the delivered throughput per VG times the number of VGs in
AFTA. A model relating the raw throughput to the delivered throughput and various over-
heads is presented in Section 9. The following parameters of this model must be verified
either via inspection of the design and application tasks, empirical measurement, or static
calculation of upper-bounds on execution times as in [Pus89].
Page 11-14
Parameter
XVG, raw,the raw VG throughput
THK, thedispatcher housekeeping time
NMESSAGES, i,the number of messages
senti0frame i
TSU, the setup time required to begin
sending a single message
Sk, the size (in Network Element packets)
of outgoing message k in frame i
Tp, the incremental time required to send
one packet
NMESSAGES, i, the number of messages
received in frame i*
TSU, the setup time required to begin re-
ceiving a single message
Sk, the size (in Network Element packets)
of incoming message k in frame i*
Tp, the incremental time required to deliver
one packet*
NTASKS, i, the number of tasks to be
started in frame i
TEV, the time required for the dispatcher to
set an event
TFDI, minor, the time required for one minor
frame execution of the FDI task
TCS, the context switch time per task
Verification Technique
Determined via benchmarking according to
standard benchmarks; depends critically on
computation characteristics
Empirical timing measurement on AFTA or
calculation of execution upper-bound
Empixical inspection of application tasks
Empirical timing measurement on AFTA or
calculation of execution upper-bound
Empirical inspection of application tasks
Empirical timing measurement on AFTA or
calculation of execution upper-bound
Empirical inspection of application tasks
Empirical timing measurement on AFTA or
calculation of execution upper-bound
Empirical inspection of application tasks
Empirical timing measurement on AFTA or
calculation of execution upper-bound
Empirical inspection of application tasks
Empirical timing measurement on AFTA or
calculation of execution upper-bound
Empirical timing measurement on AFTA
calculation of execution upper-bound
or
or
NTASK$, R4, the number of R4 tasks .
Empirical timing measurement on AFFA
calculation of execution upper-bound
Empincal inspection of appli" cation tasks
NTASKS. R3, the number of R.3 tasks
NTASKS. R2, the number of R2 tasks
NTASKS. R1, the number of R1 tasks
inspection of application tasks
Empirical
Empirical inspection of application tasks
Empirical inspection of application tasks
Table 11-3. Verification of Delivered Throughput
* Name reused to avoid nomenclature proliferation.
Page 11-15
11.3.2. Available Memory _tmrVG
The available memory per VG is defined to be the gross memory per VG minus the Ada
Run Time System, dispatcher, and FDI memory requirements. The total AFTA VG mem-
ory is defined to be the available memory per VG times number of VGs in AFTA.
The verification approach for this attribute is straightforward. The gross memory per
VG is determined by inspection of the design of the PEs comprising the VG; this quantity
is probably specified in the PE procurement specification. The Ada RTS, dispatcher, and
FDI memory requirements for a given mission are empirically determined by inspection of
the memory map of the load module of the VG of interest.
11.3.3. Effective Intertask Communication Bandwidth and Latency
The effective intertask communication bandwidth is the size (number of bytes) of an
intertask message divided by the time required for transmission and reception of the mes-
sage. The latency is the time between the transmission of the message by the sending task
and the reception of the message by the recipient task. A model for the latency is presented
in Section 9. The verification parameters needed to allow this model to accurately predict
the effective intertask communication bandwidth and latency are listed in Table 11-4.
11.3.4. Effective I/O_.Bandwidth and Latency
The effective I/O bandwidth is defined to be the size (in number of bytes) of an I/O
transaction divided by the time required for transmission (reception) of the transaction by
the source (destination). The input latency is the time in seconds between the sampling of
an input byte by the input device and the availability of that byte at the input of the destina-
tion function. The output latency is the time in seconds between when a computational
function generates an output byte for delivery to an output device and when the output de-
vice receives the output byte.
Because the I/O devices to be used in AFTA are not yet def'mitized, a detailed list of pa-
rameters to be measured can not be constructed. The general outline presented in Table 11-
5, however, seems appropriate for verification of the performance any type of I/O device
and transaction type.
Page 11-16
Parameter
TENQUEUEMESSAGE, the time required for
a task to enqueue a message for transmis-
sion
Tlatency, RG, the time between the en- .......
queueing of a message by a sending task
and the task's next RG frame boundaryt
TSU,XMIT, the time required for the com-
munication services to prepare a message
for transmittal
TXMIT, the time required for a packet to be
transferred from the PE to the NE
TNE, the time required for the network ele- :
ment ensemble to perform the requested
message transmission
TRECEIVE MESSAGE, the time required for
the communication services to construct in: .....
coming message from packet queue
Table 11-4.
ii i
Verification Technique
Empirical timing measurement on AFTA or
calculation of execution upper-bound
Empirical timing measurement on AFTA
Empirical timing measurement on AFTA or
calculation of execution upper-bound
Empirical timing measurement on AFTA
(straightforward upper-bound measure-
ment)
Empirical timing measurement on AFTA
(straightforward upper-bound measure-
ment)
Empirical timing measurement on AFTA or
calculation of execution upper-bound
Verification of Intertask Communication Bandwidth and Latency
Parameter
Transaction size
For memory-mapped I/O, time between
transaction start and transaction processing
completion
For network output, time between transac-
tion start and reception of data at output
device
For network input, time between transac:
tion start and initiation of processing for
transaction
Time required for I/O task to process trans-
action information
Time required for I/O task to deliver pro-
cessed transaction information to local or
remote destination task(s )
Verification Technique
Examination of application code
Empirical timing measurement on AFTA
(straightforward upper-bound measure-
men0
Empirical timing measurement on AFTA
(straightforward upper-bOund measure-
ment)
Empirical timing measurement on AFTA
(straightforward upper-bound measure-
ment)
Empirical timing measurement on AFTA or
calculation of execution upper-bOund
Subsumed under intertask communication
bandwidth and latency verification
Table 11-5. Verification of I/O Co_unication Bandwidth and Latency
t The programming model is designed to be insensitive to this parameter.
Page 11-17
11.3.5. Iteration Rate of a Task
The iteration rate of a task is defined to be the frequency at which task iterations are ini-
tiated, it being assumed that the execution time of a task iteration is less than the reciprocal
of the iteration rate. The approach to verifying that a task's specified iteration rate will be
serviced by AFTA begins with showing that the task is assigned to a Rate Group (RG) cor-
responding to its desired iteration rate. It must then be verified that the RG dispatcher ini-
tiates the RG at the requisite frequency. Verification of this functionality is subsumed in
Section 11.2.9. All Rate Group tasks will execute at their desired iteration rate if the total
throughput consumed by all RG tasks,
TRG = fR4XR4 + fR3XR3 + fR2XR2 + fR1XR1 (11.1)
is less than XVO, delivered, as calculated in Section 9:
TRG < XVG, delivered (1 1.2)
where
TRG = total throughput consumed by all RG tasks
fRi = frame rate of RG i
XRi = throughput requirement of one iteration of all tasks in RG i
1 1.4. Verification of Reliability and Availability Predictions
The AFFA reliability, R(t), is the probability that AFTA correctly performs its intended
function during the time interval (0,t), given that it was in a well-defined operational state at
time t--0. The AFTA availability, A(t), is the probability that the system is capable of per-
forming its intended function at time t, with temporary outages during the time interval (0,t)
being allowed for repair.
As described in Section 9, several formulations can be used to quantify AFTA reliabil-
ity and availability, depending on the redundancy management and fault recovery options in
use. A general expression for AFrA reliability and availability is related to the probability
that AFrA can perform its intended functions at time t, given a particular redundancy man-
agement option; this probability can correspond to AFTA reliability or AFTA availability,
depending on the mission and/or mission phase that it models. Two representative formu-
Page 11-18
lations for this probability are given in Section 9. PGD(t) represents the probability that
AFTA can perform its intended functions at time t when the components are managed under
a graceful degradation class of fault recove_ policies, in which no service interruption is
incurred, ppR(t) represents the probability that AFTA can perform its intended functions at
time t when the components are managed under a processor replacement class of fault re-
covery policies, in which brief outages may be incurred for fault recovery. Both calcula-
tions assume that all AFTA components are operational at time t=0.
Since the reliability and availability of AFTA in its redundant configurations are far too
high to verify using accelerated life testing, verification of AFrA reliability and availability
must be performed indirectly through a combination of analytical and empirical means.
In the analytical phase, predictive model s are constructed for the figure of merit of in-
....
terest; the models in Section 9 are two such predictive models. The models rely on the
Byzantine resilience of the underlying AFTA _chitecture; verification of this assumption is
described in Sections 11.2.1 through 11.2.8. The models also rely on abstractions of
AFTA behavior in the presence of faults; the detailed behavior of a particular fault recovery
option is specified and verified in Section 1L2.12. The correspondence between the ana-
lytical model's abstraction of the behavior and the actual behavior must be verified by de-
tailed comparison of the (preferably verified) FDIR implementation and the analytical
model. This phase of reliability and availability verification is performed once and for all,
since the validity of the models is independent of the application.
General classes of numerical inputs to the reliability and availability models are listed
below. The numerical inputs may change from implementation to implementation and from
mission to mission, and therefore have to bereverified for each application.
11.4,1, ComponeDl Failure Rate
Component failure rates are a first-order determinant of ultimate AFrA reliability and
availability and must be empirically verified or computed in compliance with acceptable
engineering practice. AFFA LRM/LRU failure rates are computed using the Parts Stress
Analysis techniques as specified in MIL-STD-217E. It is assumed that the use of the MIL-
STD-217E-based approach yields failure rate data which are reasonably accurate. In some
cases failure rate data can be empirically corroborated through Reliability
Growth/Development Testing and field data. In the AFTA analytical models constant fail-
Page 11-19
ure ratesareassumed.This assumptionmust be verified using acceptable engineering
practices.
11.4.2. Fault Reconfiguration Time
Lengthy fault reconfiguration times can result in a significant probability of AFTA fail-
ure due to a second fault occurring while AFTA is recovering from the first. AFTA is de-
signed to mitigate the effects of such near-coincident faults during the mission by using a
rapid fault reconfiguration, i.e., reconfiguration within 10 ms of vote error manifestation.
For anticipated AFTA mission durations, this short reconfiguration time can reduce the
probability of VG failure due to near-coincident faults (the second term in Equation 9.23) to
a level which is relatively small compared to the probability of VG failure due to attrition
(the first term in Equation 9.23). Minimization of near-coincident faults as a dominant VG
failure mode reduces the need for acquiring statistically signifcant measurements of the re-
configuration time, since the architecture's reliability and availability are designed to be in-
sensitive to this quantity.
Table 1 I-6 illustrates a typical relationship between the two failure modes' probabilities
for the helicopter mission. Note that the two contributors to VG failure are commensurate
for quadruplex VGs being used for short mission times, and therefore statistically signifi-
cant verification of reconfiguration time becomes an issue in this case. In general, large
variations in the Ab-'TA mission duration or reconfiguration time may force this quantity
into prominence. The AFTA reliability and availability models compute the probability of
failure due to near-coincident faults and attrition to allow estimation of their relative impor-
tance and the consequent need for intensive verification of reconfiguration time. In these
models constant reconfiguration rates are assumed, which yield a pessimistic estimation of
probability of failure due to near-coincident faults.
Page 11-20
Probability of VG
Failure due to Attrition
VG Redundancy Level
1 Hour Helicopter
Mission
Triplex [ 6.49E-09 7.21 E- 14
Quadruplex [" 7[!4E- ! 3 1.44E- 13
2 Hour Helicopter
Mission
Triplex 2.59E-08 1.44E- 13
Qua&'uplex 4.84E' 12 2.88E- 13
Probability of VG
Failure due to Near-
Coincident Faults
Table 11-6. VG Failure Probability Due to Attrition and Near-Coincident Faults
During the Brassboard Dem/Val phase, fault reconfiguration times will be empirically
measured. It is expected that they will depend strongly on the reconfiguration policy in ef-
fect, the throughput of the PEs, and the bandwidth of the various communication compo-
nents like the NEs and FCR backplane bus.
11.4.3. Fault Regonfiguration Coverage
The AbTA analytical models assume uni£y detection and reconfiguration coverage for
faults occurring in triplex and quadruplex VGs, and quadruplex and quintuplex NEs.
These high coverages are due to the architecture's compliance with the requirements of
Byzantine resilience, and are not generally considered empirically verifiable per se.
Verification of the correctness of the synchronization, voting, and FDIR functions that im-
ply this unity coverage has been discussed above. An additional issue arises of verifying
the coverage of faults occurring in degraded triplex VGs, commonly denoted the "duplex
coverage." Three options exist, each of Which impacts verification. First, the graceful
degradation policy described in Section 5 canl leapfrog the degraded triplex state and pro-
ceed directly to the simplex state. This transition can occur with unity coverage since a
triplex can accurately diagnose its own health in the presence of a single fault and, specifi-
cally, can identify a nonfaulty survivor simplex within its own VG. A second option is to
assume that a degraded triplex VG suffering an additional fault randomly determines which
of its two members is nonfaulty and designates that member as the surviving simplex. This
results in a duplex coverage which is probably about 0.50. Finally, self-tests can be devel-
oped which can increase the duplex coverage to something better than a random guess. For
verification purposes, the first policy is superior since it introduces no additional verifiable
parameters.
Page ll-21
11.4.4. VG Redundancy Levels
Verification of the redundancy levels of the VGs comprising an AFTA configuration is
straightforward.
11.4.5. Mission/Hiatus Tirr_
Verification of the mission and hiatus times for an anticipated AFTA mission is straight-
forward. As an anticipated mission evolves the analyses must be continually repeated to
ensure that the targeted AFTA configuration continues to meet the reliability, availability,
and life cycle cost objectives.
11.5. Verification of Cost Predictions
The Fleet Life Cycle Cost per Service Unit (FLCCPSU) is the life cycle cost of a fleet
of vehicles given a required sortie rate, including the cost of additional vehicles required
due to vehicle unavailability, the cost of repairs and spares, the cost of redundancy, and the
cost of vehicles lost due to vehicle unreliability, over the fleet service life. A simplified
model for the FLCCPSU is given in Section 9. As for the reliability and availability,
FLCCPSU can not be directly measured, but must instead be verified through a combined
program of predictive analysis and empirical means.
A predictive model for FLCCPSU is submitted in Section 9 of this report. Verification
of the adequacy of this model consists of review of the model by appropriate government
representatives, followed by recommendations for improvement and subsequent concur-
rence that it represents FLCCPSU sufficiently well for the purposes of drawing conclu-
sions regarding AFTA redundancy levels, fault recovery policies, and other architectural
and operational parameters. The FLCCPSU model draws upon analytical results from
other AFTA modeling efforts such as performance, reliability, availability, weight, power,
and volume, and hence its validity rests upon theirs. It also requires numerous quantitative
inputs, as listed in Section 9, which must be obtained for each mission scenario of interest.
The FLCCPSU model is intimately tied to the mission scenario, which includes the
maintenance scenario, the dispatch policy, the cost of failing to sortie, the cost of losing a
vehicle, and other mission-specific parameters. Thus, while the general modeling approach
may have utility for many Army missions, the model presented in Section 9 itself can not
be viewed as being a formulation which is valid independent of the application.
Page 11-22
11.6. Verification of Weight, Power, and Volume Predictions
Section 9 of this report contains simple m_els for the total fielded weight, power, and
volume (WPV) of an AFTA implementation. These models are parameterized upon the
number of LRMs and LRUs in the implementation, their power consumption and volume,
and other parameters. WPV-related characteristics of each AFTA component are known
via empirical measurement or engineering predictions: NDI components' characteristics can
be empirically measured, while the NE chaxa_teristics can be predicted based on engineer-
ing calculations and past experience. As the_ Brassboard NE design progresses, these pre-
dictions will increase in accuracy. The AFTA WPV are estimated by the simple summation
of the WPV of the components comprising a configuration. Verification of this linear com-
position model is straightforward.
Page 11-23
Thispage intentionally left blank.
Page 11-24
12. AFTA Architecture Synthesis
AFTA is characterized by numerous physical and operational parameters which may be
adjusted to meet the throughput, reliability_ aT|ability, and other requiremenfs Of a particu-
lax mission. The effects of varying these Parameters upon the AFTA availability, reliabil-
ity, weight, power, volume, and cost (for the i_erative aircraft mission) are provided by the
analytical formulations in Section 9 of this report. The process of suitably adjusting these
parameters in conjunction with use of the AV!'A analytical models is denoted "architecture
synthesis."
In this Section an architecture synthesis procedure is described which uses the mission
requirements described in Section 2 and the analytical formulations developed in Section 9-
this procedure is but one of many equally valid procedures suitable for use at a conceptual
study level of detail. This section subsequently demonstrates the use of this procedure to
determine tentative AFTA configurations for the helicopter TF/TA/NOE/FCS and Ground
Vehicle applications, within the limitations of the incomplete requirements data.
12.1. AFTA ArchitectUre Synthesls -_=
The AFTA architecture synthesis procedure consists of adjustment of the AFTA
configurable parameters. A list of these pararh-eters is presented below.
12.1.1. Configurable Parameters
Number of VGs: Selection of the num_0f VGs is based on the throughput require-
ments of the application. The AFTA may c6-/itain from one to forty VGs. The delivered
throughput of a VG has a second-order depen_nce on its redundancy level which is suffi-
ciently minor to be is neglected in this study. The exact relationship between throughput
overhead and redundancy level will be measured as the AFTA design and development
proceeds.
Redundancy levelof each VG: Multiple PEs in the AFTA can be formed into redundant
synchronous Virtual Groups (VGs) to achieve a degree of tolerance of random hardware
faults. Each VG may have a differentredundancy level. A VG's redundancy level may be
either simplex, triplex, or quadruplex. A simplex VG has little if any fault tolerance, a
triplex VG is fail-operational/fail-safe, and a quadruplex VG is fail-operational/fail-opera-
tional/fail-safe.
PRECEDING PAGE BLANK NO'i" FILMED
Page 12-1
Number of Processing Elements: The number of VGs and the redundancy level of each
VG determines the number of Processing Elements required. Each FCR in an AFTA may
possess up to eight active Processing Elements at a given time. A minimal AFTA configu-
ration would consist of four FCRs, three of which possess a single PE. A maximal AFTA
configuration would consist of five FCRs, each of which possesses eight PEs for a total of
forty PEs. Spare PEs may be added to any FCR, but only eight may use that FCR's net-
work dement at any given time.
Number of Fault Containment Reoons: An AFI_A implementation may possess either
four or five FCRs, depending on the reliability and throughput requirements of the applica-
tion. A four-FOR AFTA is fail-operational/fail-safe while a five-FCR AFTA is fail-opera-
tional/fail-operational/fail- safe.
Number of Network Elements: Each FCR of an AFTA must possess at least one net-
work element. Additional spare Network Elements can be added to the AFFA.
Number of Power Conditioners: Each FCR must possess at least one power condi-
tioner. Additional spare PCs may be added for exhaustion resilience.
Number of I/O Controllers_ There is currently no constraint on the number or types of
IOCs resident in any given FCR. Neither is there any current constraint on the number or
type of input or output devices that they control.
Redundancy management policy: Wide latitude exists regarding the management of the
Ab'TA's reconfigurable processing resources. Selection of a redundancy management pol-
icy is dependent upon the real-time constraints of the VG in which a fault is detected, the
mission duration and phase, the resources (throughput, memory, bandwidth, etc.) which
can be devoted to the recovery process, and other considerations. In addition, the redun-
dancy level of any VG can be varied at any time in response to faults, changes in mission
mode, and testing status.
12.1.2, AFTA Architecture Svnthesis Procedure
The simplified Ab'TA architecture synthesis procedure is as follows. It is assumed that
the delivered throughput, availability, and reliability requirements are known.
Step 1. From the throughput requirements and the available throughput per VG, de-
termine the number of VGs required using the delivered throughput model described in
Section8. The analytical models provide=aciirve of delivered throughput versus the num-
ber of VGs, which facilitates this selecti0n_= _e number of VGs may also be determined
based on other criteria such as functional p_fi0ning or prior experience.
Step 2. From the reliability requirements (either mission reliability or vehicle reliabil-
ity), the mission characteristics (environment, duration, etc.), and the selected mission re-
dundancy management strategy, determine the (mission or vehicle) redundancy level(s) of
the AFTA's VGs using the models described in Section 9. The analytical models provide a
curve of reliability versus the number of VGs and VG redundancy level. The VG redun-
dancy levels may also be chosen based on s_e other criteria, such as fail-op/fail-safe, etc.
Step 3. From the availability requireme_s, determine how many spare PEs are needed
in each FCR using the sortie availability model described in Section 9. The analytical
models provide a curve of availability versus the number of VGs, VG redundancy level,
number of spare FCRs, and number of spar e PEs per FCR. The number of spares may
also be chosen based on some other criteria! _....
Step 4. Using the weight, power, and volume models described in Section 9, deter-
mine the WPV of the configuration. The analytical models provide a curve of WPV versus
the number of PEs and FCRs.
Step 5. For the iterative aircraft mission, the FLCCPSU model provides an estimate of
the life cycle costs associated with the configuration. The analytical models provide curves
of FLCCPSU versus the parameters mention_ above and other cost-related inputs.
12.2. AFTA Architecture Synthesis
12.2.1. AFTA Characteristics Commonto Both Missions
Certain AFTA characteristics are to some extent independent of the mission. Assuming
that both missions utilize similar PEs (e.g. the R3000 PE for the flight vehicle mission and
the 68040 for the ground vehicle mission), these characteristics include the delivered
throughput.
12.2.1.1. Delivered Throughput
The following chart depicts the delivered throughput of AFTA as a function of the
number of PEs and the redundancy level into which the PEs are grouped. It is assumed
that PEs having a raw throughput of 20 MIPS (e.g., 68040 or R3000-class) are used, all
VGs in a configuration possess the same redundancy level, and the net overheads due to
the Ab_A Operating System and FDIR are equal to 20%.
Throughput,
MIPS
700.00
600.00
500.00
400.00
300.00
200.00
100.00
J
.,,jJ
ooo
J
f
. mx Ix
5 10 15 20 25 30 35 40
# PEs
-m.Simplex
"O"Triplex
.X. Quadruplex
Figure 12-1. AFTA Delivered Throughput vs. Number of Processing Elements
12,2,2, AFTA Confi g!_mition for TF/TA/NOE/FCS Mission
The preliminary requirements analysis of Section 2 indicates that nine processing sites
are needed for flight-critical processing functions in the helicopter TF/TF/NOE/FCS
mission. For reasons outlined in Section 2, it is thought that the throughput obtained using
this processor count greatly exceeds that actually needed when operating system overheads
are extracted and processor throughputs are increased. Therefore, when the presentation
requires for conciseness and concreteness that some number of VGs be specified, it will be
assumed that six VGs are needed for flight-critical processing functions. It should be
borne in mind that this number may still represent a throughput overkill, and that the AFTA
analytic',d models produce results for any realizable AFTA processor count.
page 12-4
12.2.2111 AnalyticaI Results
12.2.2.1.1. Failure Rarest
The AFTA component failure rates were calculated assuming that the hiatus
environment corresponds to the Ground, Fixed (GF) environment, and the mission
environment corresponds to the Rotary Wing Aircraft (AR) environment, both described in
MIL-HDBK-217E. ii_i__
Component
PElt
GF failure rate, per h AR failure ra_, per h
1.92E-5 6.58E-5
NE* 4.08E-5 1.85E-4
PC °* 1.59E-5 5.40E-5
FCR Bust* 1.92E-6 6.58E-6
Table 12-1. AFTA Component Failm Rates for Helicopter Mission Scenario
12.2.2.1.2. Reliability
For the rotary wing aircraft mission, the AFTA's reliability depends upon the VG
redundancy level and the duration of the _ssion. The following two charts show the
AFTA reliability for AFTA configurations composed of all simplex, all triplex, and all
quadruplex VGs for mission durations of on_nd four hours and for AFTA configurations
comprising four FCRs (curves labeled with the "-4" suffix) and five FCRs (curves labeled
with the "-5" suffix). The baseline NE as described in Section 4 is assumed to be used.
I/O failures are not included, and in-flight redundancy management corresponds to the
graceful degradation policy described in Sections 5 and 9.
t Permanent failure rates only.
tt Extrapolated from Lockheed Sanders R3000 VME PE sales literature.
* Baseline NE.
** Extrapolated from Varo Industries JIAWG 27VDC to 5VDC_50A module data.
"]'* Assumed to be 10% of the PE failure rate.
Page 12-5
1E-02
1E-03
1E-04
1E-05
Probability of 1E-06
AFTA Loss 1E-07 _'_
r _
1E-08
1E-09
1E-10
1E-11
P
5 10 15 20 25 30 35 40
# PEs
"X' Simplex-4
"_- Triplex-4
-o- Quadruplex-4
"X' Simplex-5
"_' Triplex-5
,,o, Quadruplex-5
Figure 12-2. Probability of AFTA Failure for l-hour Rotary Wing Aircraft Mission
Page 12-6
1E-01
1E-O2
1E-O3
1E-04
Probability of
AFTALoss
1E-05
1E-06
1E-07
1E-08
 lllll
_x_X_Xm,_x_X_'
_ _ i _ ,
|
5 10 15 20
I
25 30 35 40
# PEs
-X" Simplex-4
"t" Triplex-4
-o- Quadruplex-4
-x' Simplex-5
"a" Triplex-5
,,o, Quadruplex-5
Figure 12-3. Probability of AFTA Failure for 4-hour Rotary Wing Aircraft Mission
As expected, the all-simplex VG configurations have significant higher failure
probability than the redundant configurations and can be eliminated from further
consideration for flight critical processing on this ground. The all-triplex VG
configurations have intermediate reliability levels-their variation with the number of PEs in
the configuration illustrates that PE failure_ as opposed to NE failure, is the dominant
failure mode. Interestingly, the all-quadruplex VG configuration residing in four FCRs has
reliability commensurate with the triplex configurations-._FTA failure in this configuration
is dominated by NE failure, as indicated by itS flat response to the number of PEs in the
configuration. This implies that if, for whatever reason, only four FCRs can be supported
in the vehicle, then no improvement in mission reliability is achieved by utilizing
quadruplex VGs instead of triplex VGs for the mission times of interest. The all-
quadruplex VG configuration residing in five FCRs surpasses all configurations in
reliability, while its variation of reliability with respect to the number of PEs in the
configuration indicates that AFTA failure in _is configuration is dominated by PE loss.
The probability of AFTA failure scales quadratically with the mission time for the all-
triplex VG configurations and cubically with the all-quadruplex configuration residing in
Page 12-7
d s
five FCRs; this scaling is typical of attrition-dominated triplex and quadruplex systems. It
is interesting to note, however, that the all-quadruplex configuration residing in four FCRs
anomalously scales quadratically with mission time in a manner more reminiscent of a
triplex system• This can be explained by the fact that AFTA loss probability for this
configuration is dominated by the failure of NEs, as evidenced by its fiat response to
variation in the number of PEs in the configuration.
12.2.2.1.3. Throughput-Reliability Tradeoff
From the VG versus delivered throughput and reliability versus # PEs curves, a com-
posite chart (Figure 12-4) can be constructed which directly shows the tradeoff between
AFTA's delivered throughput and reliability.
10"2
10"3
m10-4
el
o
10-5
._1o -e
•-- T
.RIO-
el
J_
10-8
10-9
10"10
0 200 400 600 800
Delivered Throughput, MIPS
Simplex-4
---e----- Simplex-5
Triplex-4
-----e---- Triplex-5
-.---=--- Quad-4
"---g"-" Ouad-5
Figure 12-4. Delivered Throughput vs. AFTA Failure Probability for 1-hour Rotary Wing
Aircraft Mission
Page 12-8
12.2.2.1.4. Effect of VHSICIVLSI Network EIement Technology on Reliability
Based on the Network Element failure rate calculations presented in Section 9, the ex-
tensive use of VHSIC/VLSI technology to fabricate the Network Element would increase
its MTBF in the rotary wing aircraft environment by a factor of approximately 1.64. The
analytical models can be used to translate this component MTBF improvement into mission
reliability improvement. The results of thelanalysis using the VHSIC/VLSI-based "High
End" NE are presented in Figure 12-5 for the 1-hour rotary wing aircraft mission. The
relative improvement in failure probability is shown in Figure 12-6, which plots the failure
probability of an AFTA using the Baseline NE divided by the failure probability of an
AFrA using the VHSIC/VLSI-based NE.
1E-02
1 E-04
1E-05
1E-06
Probability of . ._--____ _ r --
L _ _ qAFTA Loss 1 E-07 _ _
]
1E-08
1E-09
1E-11
5 10 15 20 25 30 35 40
# PEs
•X" Simplex-4
-&- Triplex-4
43- Quadruplex-4
•X' Simplex-5
•'If' Triplex-5
•13-Quadruplex-5
Figure 12-5. Probability of AFTA Failure for l-hour Rotary Wing Aircraft Mission using
VHSIC/VLSI-based Network Element
Page 12-9
Ratio of
Probability of 0.60
AFTA Loss with
VHSICJVLSI NEro 0.50
Probability of 0.40
AFTA Loss with
Baseline NE 0.30
0,20
1.00 1 I I I
0.90 I I __,_:_,_--k---->
,...._._--_-". i I
o.6o _x_,,, ' '_'=-x
0.70 -
m =L • •
_ ......._ _""/r '''_ _
p_,v- -
0.10
0.00
5 10 15 20 25 30 35 40
# PEs
•X' Simplex-4
-_- Triplex-4
43- Quadruplex-4
•X' Simplex-5
-It Triplex-5
-13-Quadruplex-5
Figure 12-6.' Ratio of Probability of AFTA Failure for 1-hour Rotary Wing Aircraft
Mission: Ba_line NE divided by VHSI_VLSI-based NE
12.2.2.1.5. Unavailability
From the throughput requirements one may determine the number of VGs required to
perform the mission's functions. From the reliability models one may determine the mini-
mum redundancy level these VGs must possess at sortie to meet the mission's reliability re-
quirements. This complement of resources is denoted the Minimum Dispatch Complement
(MDC). If MDC is not available at sortie due to faults, then the vehicle can not sortie.
AFTA allows the addition of spare components to attempt to increase mission availability.
Figure 12-7 shows the effect on mission availability of adding spare PEs in each FCR
and spare FCRs to an AFFA, assuming that an MDC of six VGs and four FCRs are needed
to sortie. The analysis was performed for configurations comprising all simplex, all
triplex, and all quadruplex VGs. The hiatus interval is assumed to be 23 hours at Ground,
Fixed failure rates. The curves labeled with a "4" suffix refer to a configuration containing
no spare FCR, while the curves labeled by the "5" suffix refer to configurations containing
a spare FCR. For the given model input parameters, addition of more than a single spare
PE per FCR does not significantly enhance availability. Addition of a spare FCR increases
availability somewhat, while the combination of one spare PE per FCR and one spare FCR
significantly enhances availability.
Page 12-10
>. 10 .3
,_ 10"4"
w
J
No Spare FCRs
I Spare FCR
II • •
10 .5 - ,
0 2 4
W
Spare PEe/FCR
Simplex,4
Simplex,5
¢_ Triplex, 4
= Triplex, 5
Quad,4
---I!--- Quad,5
Figure 12-7. AFTA Unavailability after 23-hour Hiatus for Six-VG/Four-FCR MDC
Figure 12-8 shows the effect on missionavailability of adding spare PEs in each FCR,
assuming that six VGs and _ FCRs are needed to sortie. The analysis was performed
for configurations comprising all simplex, all triplex, and all quadruplex VGs and the
hiatus interval is assumed to be 23 hours at Ground, Fixed failure rates. Again, addition of
more than a single spare PE per FCR does not significantly enhance availability. Note that
a spare FCR can not be added to this MDC _onfiguration since the number of FCRs would
then exceed that supported by AFTA. Consequently the extremely low unavailability levels
associated with the combined spare PEs and FCR can not be achieved.
Page 12- ll
m
m
II
I
w-
.t:"
10-2
10-3 . , .
0 2 4 6
Spare PEs/FCR
Simplex
Triplex
Qu=
Figure 12-8. AFTA Unavailability after 23-hour Hiatus for Six-VG/Five-FCR MDC
12.2.2.1.6. Weight
It is assumed for the calculation of the AFTA physical parameters for the helicopter
mission that JIAWG-class SEM-E packaging is used. Representative component weights
are enumerated in Table 12-2.
Page 12-12
PE
NEt
Rack
PC
1 lb.
1.5 lb.
5 lb.
3 lb.
Table 12-2. AFTA Component Weights for Helicopter AFTA
The weight of AFTA for the helicopter mission is depicted below as a function of the
number of PEs and FCRs.
90
80
70
60
Weight, 50
lb. 40
30
20
10
0
/x
_]_"_..x _
! !
5 10 15 20 25 30 35 40
#PEs
"X' 4 FCR Wt.
•o- 5 FCR Wt.
Figure 12-9. AFTA Weight for Helicopter Mission
12.2.2.1.7. Power
Representative power consumptions of JIAWG-class components are enumerated in
Table 12-3.
t Includes optical splitters.
Page 12-13
PE Power
NE Power
Bus Power
PC Effurien 9,
Table 12-3.
20W
20 W •
1W
9O%
AFTA Component Powers for Helicopter AFTA
ii
Power,
W
1000
900
800
700
600
500
400
300
200
100
0
5 10 15 20 25 30 35 40
# PEs
-X" 4 FCR Power
•0- 5 FCR Power
12.2.2.1.8.
Figure 12-10. AFTA Power for Helicopter Mission
Volume
The volumes of JIAWG-class SEM-E AFTA modules are enumerated in Table 12-4.
Page 12-14
PE Slots
NE Slots
PC slots
Rack Ends (# slots)
Slot Volume
1
1
2
2
1.77E-02 cu. ft.
(6.88 x 7.42 x 0.60 in 3)
Table 12-4. AFTA Component Volumes for Helicopter AFTA
1.40
0.40
0.20
0.00
5 10 15 20 25 30 35
# PEs
"X" 4 FCR Vol.
-o- 5 FCR Vol.
40
12.2.2.1.9.
Figure 12-11. AFTA Volume for Helicopter Mission
Cost
The Fleet Life Cycle Cost per Service Unit model described in Section 9 generates vo-
luminous cost data for many possible configurations of AFTA. For conciseness, this re-
port only presents the cost data for the one-hour Rotary Wing Aircraft mission for an MDC
of six VGs. The input parameter values for the cost model are listed below. These
parameters are for illustrative purposes only, and should be updated as additional
Page 12-15
information becomes available during AFTA development and mission requirements
acquisition.
Parameter
Fleet service life
Value
100,000 hours
Hiatus time 23 hours
Sortie duration 1 hour
Number of vehicles requ.ire d ]?e.r sortie
Baseline vehicle cost
Ab'TA LRM cost
Number of LRMs in AFTA
Cost of AFTA rack
Number of racks in AFTA
Manpower cost for field repair, per hour
Mean time for field repair, hours
Manpower cost for depot diagnosis, per
hour
100
$6,OO0,0OO
Mean time for depot diagnosis, hours
_ Manpower cost for depot repair, per hour
Mean time for depot repair, hours
Depot condemnation ratio
LRM/LRU refurbishment parts cost
$10,000
Depends on Configuration
$10,000
Depends on Configuration
$200
4
$100
4
$200
4
1
$1,000
Table 12-5. FLCCPSU Input Parameters for Rotary Wing Aircraft Mission
For clarity of presentation, the cost analysis results are presented separately for three
different AFTA configurations: a four-FCR MDC configuration with no spare FCR, a
four-FCR MDC configuration with one spare FCR, and a five-FCR MDC configuration
with no spare FCR. The results are presented in tabular format. The leftmost column
indicates the redundancy level of the AFTA's VGs-this may be simplex, triplex, or
quadruplex. It is assumed that all VGs an a configuration are of identical redundancy level,
an assumption which may be relaxed without difficulty. In addition, for a given VG
redundancy level, the leftmost column lists the number of spare PEs per FCR which may
be added to increase availability. For conciseness, this parameter is varied from zero to one
spare PE per FCR, based on the availability analysis' illustration of the rapidly diminishing
return of additional spare PEs for a 23-hour hiatus. The row corresponding to VG
Page 12-16
redundancy level and number of spare PE s per FCR contains the corresponding cost data.
The column denoted Fleet Life Cycle Cost isihe sum of the the cost of unreliability, the
cost of maintenance labor, the cost of AFTA spares, the cost of vehicles, and the initial
procurement cost of the fleet's AFTAs. in_dition, the last column of the tables below
indicates the cost of unavailability, i.e., the cost of additional vehicle procurements required
to meet the sortie requirement given non-uni_ AFTA availability. This cost is included in
the cost of vehicles and cost of AFTAs coiumns and is included as a separate italicized
column solely for edification. _
The re,suits are presented in the following:table for an AFTA configuration consisting of
a four-FCR MDC and no spare FCR.
VG Re-
dundancy
Level
Simplex
0 spare
PEs[FCR
1 Spare
PE/FCR
Triplex
0 spare
PEs/FCR
l Spare l
PE/FCR
Quadmptex
0 spare
PEs/FCR
1 Spare
PE/FCR
Fleet Life
Cycle
Cost
4.03E+09
4.07E+09
Cost of
Unreli-
ability
3.36E+09
3.39E+09
Cost of'...... Cost of
Mainte-
nance
Labor
5.11E+06
6.13E+06
Spares
4.26E+07
5.11E+07
Cost of
Vehicles
6.1ME+08
6.03E+08
Cost of
AFTAs
1.61E+07
2.01E+07
Cost of
Unavaila
bility
3.88E+06
3.35E+06
,, ,,,
7.11E+08 1.81E+06" 6.80E+07 6.05E+08 2.82E+07 4.79E+06
7.OIE+07
7.64E+07
8.49E+07
7.23E+08
7.24E+08
7.36E+08
6.03E+08
6.05E+08
6.03E+08
1.82E+06
8.16E+06
9.17E+06
9.17E+06
3.22E+07
3.:23E+07
3.62E+071.02E+07
9.52E+05
9.58E+05
3.42E+06
15.10E+06
3.44E+06
Table 12-6. FLCCPSU Output Parameters for Rotary Wing Aircraft Mission, Four-FCR
MDC with no spare FCR
The analysis indicates that the minimum_cost configuration for a six VG, four-FCR
MDC AFTA with no spare FCR consists of six triplex VGs with no spare PEs. The total
cost for this configuration is $711M. Tabie i2-7 shows the relative contributions to this
cost.
Page 12-17
Cost of
Unreliability
0.26%
Cost of
Maintenance
Labor
1.15%
Cost of Cost of
Spares Vehicles
9.61% 85.48%
Cost of
AFTAs
3.99%
Cost of
Unavailability
0.68%
Table 12-7. FLCCPSU Constituent Costs for Four-FCR MDC with no spare FCR
The analytical results are presented in the following table for an AFTA configuration
consisting of a four-FCR MDC and one spare FCR.
VG Re-
dundancy
Level
Fleet Life
Cycle
Cost
Cost of
Unreli-
ability
Cost of
Mainte-
nance
Labor
Cost of
Spares
Cost of
Vehicles
Cost of
AFTAs
Cost of
Unavaila
bility
Simplex
0 spare 4.03E+09 3.36E+09 4.26E+07 6.01E+08 1.60E+07 5__6E+05
PEs/FCR
1 Spare 4.06E+09 3.39E+09 5.11E+07 6.00E+08 _2.00E+07 1.16E+04
PF__/FCR
Triplex
7.07E+08 1.81E+06 6.80E+07 6.01E+08 2.81E+070 spare
PEs/FCR
l Spare
PENCR
Quadruplex
5.11E+06
6.13E+06
8.16E+06
9.17E+06
9.17E+06
1.02E+07
1.82E+06 7.64E+07
"1.64E+07
8.49E+07
0 spare
PEs/FCR
l Spare
LPE/FCR
6.00E+08
6.02E+08
6.00E+08
7.19E+08 3.20E+07
3.21E+07
3.60E+07
7.20E+08 "9152E+05
7.32E+08 9.58E+05
1.40E+06
1 33E+04
1.69E+06
1.41E+04
Table 12-8. FLCCPSU Output Parameters for Rotary Wing Aircraft Mission-Four-FCR
MDC with one spare FCR
The minimum cost configuration for a six VG, four-FCR MDC AFTA with one spare
FCR again consists of six triplex VGs with no spare PEs. The total cost for this configu-
ration is $707M. Table 12-9 shows the relative contributions to this cost. Note the
reduced cost due to unavailability due to the added FCR.
Cost of
Unreliability
Cost of Cost of Cost of
Maintenance Spares Vehicles
Labor
0.26% 1.15% 9.61% 85.01%
Cost of
AFTAs
3.97%
Cost of
Unavailability
0.20%
Table 12-9. FLCCPSU Constituent Costs for Four-FCR MDC with one spare FCR
Page 12-18
Finally, the results are presented below for an AFTA configuration consisting of a five-
FCR MDC.
VG Re-
dundancy
Level
Fleet Life
Cycle
Cost
Cost of
Unreli-
ability
Cost of Cost of
Mainte-
nance
Labor
Spares
Cost of
Vehicles
Cost of
AFTAs
Cost of
Unavaila
bility
Simplex
1 Spare 4.70E+09 4.02E+09 6.39E+06 5.33E+07 6.04E+08 2.01E+07 4.19E+06
PE/FCR .........
Triplex
8.93E+06 6.05E+089.08E+050 spare
PEs/FCR
1 Spare
PE/FCR
Quadruplex
7.20E+08
7.35E+08 9.15E+05' 1.02E+07" 6.04E+08
1.02E+07
1.15E+07
7.36E+08
7.44E+07
8.50E+07
8.50E+07
9.56E+077.51E+08
8.81E+02 6.05E+08
6.04E+08
0 spare
PEs/FCR
1 Spare
PELFCR
3.03E+0;/
3.52E+07
3.53E+07
4.03E+0")8.88E+02
5.38E+06
4.30E+06
5.71E+06
4.33E+06
Table 12-10. FLCCPSU Output Parameters_0i Rotary Wing Aircraft Mission-Five-FCR
MIX2
Once again, the minimum cost configuration for a six VG, five-FCR MDC AFTA con-
sists of six triplex VGs with no spare PEs. The total cost for this configuration is $720M.
Table 12-11 shows the relative contribution_'_:::this cost.
Cost of
Unreliability
0.13%
Cost of Cost of Cost of
Maintenance Spares Vehicles
Labor
1.26% 10.52% 85.55%
Cost of Cost of
AFTAs Unavailability
4.28% 0.76%
Table 12-11. Minimum-FLCCPSU Constituent Costs for Four-FCR MDC with one spare
FCR
Of all the configurations modeled, the lowest-cost configuration consists of a four-FCR
MDC AFTA with one spare FCR and no spare PEs at all. Note that this configuration dif-
fers from one selected based on maximum reliability (five-FCR MDC containing all
quadruplex VGs) or maximum availability (one spare FCR and one spare PE per FCR). It
should also be emphasized that these results are presented primarily to illustrate the use of
cost modeling to assist in the architecture synihesis process. The actual results obtained
Page 12-19
depend strongly on the input parameters and will lead to different conclusions as these
parameters change.
12.2.3. AFTA Analysis for Ground Vehicle Mission
12.2.3.1. Throughput
The AFTA throughput for the Ground Vehicle mission is presented in Section
11.2.1.1.
_ili_
In the context of the Ground Vehicle mission the AFTA's unreliability contributes to the
failure of the vehicle to perform its mission, in accordance with the Ground Vehicle
mission state diagram presented in Section 2. This is not assumed to result in loss of the
vehicle. Moreover, the relaxed temporal constraints associated with the ground mission
and the long mission times dictate the use of an availability maximization redundancy
management policy, described as the "processor replacement" option in Section 5 and
modeled as PPR in Section 9. We therefore parameterize mission success as the probability
that the Minimum Mission Complement (MMC) of VGs and FCRs are available. It is
assumed that, regardless of the redundancy level of the VGs at the beginning of the
mission, the AFTA can perform its intended function as long as there are at least MMC
functioning simplex VGs which may have started out as simplexes, or which may be
degraded triplexes or quadruplexes. Therefore the following charts show PPR as the
probability that MMC simplex VGs are functional, as a function of MMC and the number
of spare PEs per FCR and the number of spare FCRs.
The failure rates are calculated assuming that the mission environment corresponds to
the Ground, Mobile (GM) environment as described in MIL-HDBK-217E.
Page 12-20
Component GM failure rate, per h
PE 3.23E-5
NE 9.26E-5
PC 2.67E-5
3.23E-6FCR Bus
Table 12-12. AFTA Component Failure Rates for Ground Vehicle Mission Scenario
The following charts show the probability that MMC VGs can not be formed from the
simplex processing resources in AFTA as a function of the MMC, the number of spare PEs
per FCR, and the number of spare FCRs. The Curves are presented for four mission times:
8, 24, 168, and 720 hours.
1.00E+O0
1.00E-01
1.00E-02
AFTA
1.00E-03
Unreliability
1.00E-04
1.00E-05
1.00E-06
5
!:
K_X X,_._,-X-_"--'X_1
!
10 15 20 25 30
VG MMC
,.o. 0 spares
"ql,' spare PE
spare NE
"X' spare PE + NE
Figure 12-12. AFTA Unreliability for Eight-Hour Ground Vehicle Mission
Page 12-21
1.00E+00
AFTA
Unreliability
1.00E-01
1.00E-02
1.00E-03
1.00E-04
1.00E-05
(_,-:
l
p,.
(_: :-------X-'--_X -
5 10 15 20 25 30
VGMMC
,.o- 0 spares
•"9' spare PE
-a' spare NE
-x' spare PE + NE
Figure 12-13. AFTA Unreliability for 24-Hour Ground Vehicle Mission
Page 12-22
I .001:+00
_m
AFI'A
Unreliability
1.00E-01 j
1.00E-02
1.00E-03
5 10 15 20 25 30
VGk-_IC
•o, 0 spares
"¢P spare PE
spare NE
-x' spare PE + NE
Figure 12-14. AFTA Unreliability-fi5 r 168-Hour Ground Vehicle Mission
Page 12-23
AFTA
Unreliability
1.00E+00
1.00E-01
1.00E-02
J
t'
i
5 10 15 20
VGMMC
25 30
,,o, 0 spares
"@"spare PE
-B' spare NE
-X' spare PE + NE
Figure 12-15. AFI'A Unreliability for 720-Hour Ground Vehicle Mission
12.2.3.3. Weight
It is assumed for the calculation of the AFTA physical parameters for the Ground
Vehicle mission that SAVA-compatible packaging is used. The component weights are
enumerated in Table 12-13 (for lack of any better information, these weights are the same
as for the SEM-E version of AFTA).
PE
NEt
Rack
PC
1 lb.
1.5 lb.
5 lb.
3 lb.
Table 12-13. AFTA Component Weights for Helicopter AFFA
The weight of AFTA for the Ground Vehicle mission is depicted below as a function of
the number of PEs and FCRs.
t Includes optical splitters.
Page 12-24
9O
80
70
6O
Weight, 50
lb. 40
30
20
10
0
• ii
5 10 15 20 .........25 30 35 40
# PEs
"X"4 FCR Wt.
"0"5 FCR Wt.
12.2.3.4.
Figure 12-16. AFTA Weight for Ground Vehicle Mission
Power
Representative component power consumptions for MIL-STD-344 modules are
enumerated in Table 12-14.
PE Power
NE Power
Bus Power
PC Efficiency
15W
15W
1W
90%
Table 12-14. AFTA Component Powers for Ground Vehicle AFTA
Page 12-25
Power.
W
700
600
500
400
300
200
100
0
5 10 15
l
20 25 30 35 40
# PEs
-X" 4 FCR Power
-o- 5 FCR Power
12.2.3.5.
Figure 12-17.
Volume
AFTA Power for Ground Vehicle Mission
The volumes of MIL-STD-344 AFTA modules are enumerated in Table 12-15.
PE Slots
NE Slots
PC slots
Rack Ends (# slots)
Slot Volume
1
1
2
2
3.55E-2 cu. ft.
(10.5 x 7.3 x 0.80 in 3)
Table 12-15. AFTA Component Volumes for Ground Vehicle AFTA
Page 12-26
2.50
2.00
1.50
Volun_,
cu. ft.
1.00
0.50 ,
0.00
5
......J _1
J
10 15 20 25 30 35 40
# PEs
-X. 4 FCR Vol.
-u- 5 FCR Vol.
Figure 12-18. AFTA Volumefor Ground Vehicle Mission
=
:7
Page 12-27
This page intentionally left blank.
Page 12-28
Appendix A. References
[Ab188] Abler, T., A Network Element Based Fault Tolerant Processor, MS
Thesis, Massachusetts Institute of Technology, Cambridge, MA,
May 1988.
[AMD89a]
[AMD89b]
The SUPERNET Fa_ly for FDDI, Advanced Micro Devices Data
Book, Publication #_734 Rev. C, February 1989.
Am7968/Am7969- i 75 TAXlchip TM Integrated Circuits, Advanced
Micro Devices Data Sheet, Publication # 12834 Rev. A, November
1989.
[ANSI139]
[ANSII48]
"Fiber Distributed D]ta Interface (FDDI) - Token Ring Media Ac-
cess Control (MAC)" American National Standard, ANSI X3.139-
1987, November 5, 1986.
"Fiber Distributed Data Interface (FDDI) - Token Ring Physical
Layer Protocol (PHY)," American National Standard, ANSI
X3.148-1988, Jun(30, 1988.
[ANSII66]
lAPS90]
[Bab90]
[Ber87]
IBer90l
"Fibre Data Distribuied Interface (FDDI) - Token Ring Physical
Layer Medium De_ndent (PMD)," American National Standard,
ANSI X3.166-1990, September 28, 1989.
Acarlar, M. S., Ploui'de, J. K., Snodgrass, M. L., "A High Speed
Surface-Mount Optlcal Data Link for Military Applications,"
IEEE/AIAA/NASA 9th Digital Avionics Systems Conference Pro-
ceedings, October 15:18, 1990, p. 297-302.
Babikyan, C., "TheFault Tolerant Parallel Processor Operating
System Concepts andPerformance Measurement Overview," Pro-
ceedings of the 9th Digital Avionics Systems Conference, October
1990, pp. 366-371.
Bertsekas, D., Gailager, R., Data Networks, Prentice-Hall, 1987.
Berger, K. M., Abramson, M. R., Deutsch, O. L., "Far-Field Mis-
sion Planning for Heiicopters," CSDL Technical Report CSDL-R-
2234, March 1990. i
[Bev90]
Bevier, W.R., and Young, W;D., "The Proof of Correctness of a
Fault-Tolerant Circuit Design,' 2nd International Working Confer-
ence on Dependable Computing for Critical Applications, Tucson,
AZ, February 1991,
[Bic90] Bickford, M., and Srivas, M., "Verifying an Interactive Consis-
tency Circuit: A Cas_$tudy in the Reuse of a Verification Technol-
ogy," NASA Formal Methods Workshop 1990, NASA Conference
Publication 10052, November 1990.
PRF.£,EDING PAGE BLANK NOT FILMED
Page A- 1
[Biv88]
[Bla91]
[Boo88]
[Bur89]
[But88]
[CAMP]
[Car841
[Cha84]
[Che871
1Coh87]
[Coh88]
[Coh90a]
[Coh9Obl
[Cohn88]
Bivens, G. A., "Reliability Assessment of Surface Mount Technol-
ogy (SMT)," RADC report RADC-TR-88-72, March 1988.
Black, Uyless, OSI : A Model For Computer Communications
Standards, Prentice-Hall, i991.
Booth, F., "Advanced Apache Architecture," 8th Digital Avionics
Systems Conference, October 1988.
Burkhardt, L., Advanced Information Processing System: Local
System Services, NASA Contractor Report 181767, April 1989.
Butler, R. W., "A Survey of Provably Correct Fault Tolerant Clock
Synchronization Techniques," NASA Technical Report TM-
100553, NASA Langley Research Center, February 1988.
CAMP-1 Final Technical Report AFATL-TR-85-93, 3 Volumes,
Available as DTIC AD-B102 654, AD-B102 655, and AD-B102
656 from Defense Technical Information Center, Alexandria, VA
22304-6145.
Carlow, G. D., "Architecture of the Space Shuttle Primary Avionics
Software System", Communications of the ACM, 27(9):926-36,
September 1984.
Chambers, F. B., ed., Distributed Computing, Academic Press,
1984.
Cheng, S-C., Stankovic, J. A., Ramamritham, K., "Scheduling
Algorithms for Hard Real-Time Systems - A Brief Survey," in Hard
Real-Time Systems, IEEE Computer Society Press, 1988.
Cohn, Marc D., "The Conformance of the ANSI FDDI Standard to
the SAE-9B HART High Speed Data Bus Requirements for Real-
Time Local Area Networks," Society of Automotive Engineers
Aerospace Systems Conference Proceedings, November 1987.
Cohn, Marc D., "The Fiber Optic Data Distribution Network: A
Network for Next-Generation Avionics Systems," AIAA/IEEE 8th
Digital Avionics Systems Conference Proceedings, October 17-20,
1988, p. 731-737.
Cohen, G. C., et. al., Design of an Integrated Airframe Propulsion
Control System Architecture" NASA Contractor Report 182004,
March 1990.
Cohen, G. C., et. al., Final Report." Design of an Integrated Air-
frame Propulsion Control System Architecture, NASA Contractor
Report 182007, March 1990.
Cohn, A., "Correctness Properties of the Viper Block Model: The
Second Level," Tech. Report 134, Univ. of Cmabridge, Cam-
bridge, England, May 1988.
Page A-2
[Com91]
[CSDL9214]
[Cullyer 88]
[CVC2]
IDACS]
lDa173]
[Deu88]
[DIDS0811]
[DiVg0]
[DiV91]
[Do1821
[Do184]
[FeI90]
[Fis82]
Comer, D. E., lnternctworking with TCP/IP, Prentice-Hall, 1991.
Completion of the Advanced Information Processing System, re-
sponse to NASA L_gley Research Center, CBD Announcement
REF SS017, issue PSA,9214, November 12, 1986.
Cullyer, W. J., "Implementing Safety-Critical Systems: The VIPER
Microprocessor," VLSI Specification, Verification and Synthesis,
Kluwer Academic Publishers, 1988.
"System Specification for Combat Vehicle Command and Control
(DRAFT)," CVC2 Systems Implementation Working Group, 31
October 1990.
Defense & Analysis Center for Software, Kaman Sciences Corpora-
tion, P.O. Box 120, Utica, NY 13503.
Daly, W.M., Hopkins, A.L., and McKenna, J.F., "A Fault-Toler-
ant Digital Clocking System," 3rd International Symposium on Fault
Tolerant Computing; Palo Alto, CA, June 1973.
Deutsch, O. L., De,i, M., "Development and Demonstration of an
On-Board Mission Planner for Helicopters," CSDL Technical Re-
port CSDL-R-2056, April 1988.
"VHSIC Hardware Description Language (VHDL) Documentation,"
Data Item Descripti0fi, DD Form 1664, DI-EGDS-80811, May 11,
1989.
Di Vito, B. L., Butler, R. W., Caldwell, J. L., Formal Design and
Verification of a Reli_le Computing PlaoCorm for Real-Time Con-
trol, NASA Technica! Memorandum 102716, October 1990.
Di Vito, B., Butler, R., and Caldwell, J., "High Level Design Proof
of a Reliable Computing Platform," 2nd International Working Con-
ference on Dependable Computing for Critical Applications, Tuc-
son, AZ, February 1991.
Dolev, D., "The Byzantine Generals Strike Again," Journal of Algo-
rithms, Vol. 3, 1982, pp. i4-30.
Dolev, D., Dwork, C., Stockmeyer, L., "On the Minimal Synchro-
nism Needed for Distributed Consensus," IBM Research Report RJ
4292 (46990), 5/8/84.
Felter, S. C., Douglas, P. H., Smith, C. A., "Avionics System In-
tegration for the MH-53J Helicopter," 9th Digital Avionics Systems
Conference, October i 990.
Fischer, M. J., Lynch, N. A., "A Lower Bound for the Time to As-
sure Interactive Consistency," Information Processing Letters, Vol.
14, No. 4, 13 June 1982, pp. 183-186.
Page A-3
[Foh89]
[Gal90]
[Goe91]
[Gua90]
[Han89]
[Har87]
[Har88a]
[Har88b]
[Har91a]
[Har91b]
[Hirg0]
[Hun86]
[Hwa84]
[IEEE1076]
Fohler, G., Koza, C., "Heuristic Scheduling for Distributed Real-
Time Systems," Research Report No. 6/89, Institut fur Technische
Informatik, Technische Universitat Wien, Vienna, Austria, April
1989.
Galetti, R. R., Real-Time Digital Signatures and Authentication
Protocols, Master of Science thesis, Massachusetts Institute of
Technology, May 1990.
Goel, A.L., and Sahoo, S.N., "Formal Specifications and Reliabil-
ity: An Experimental Study," 1991 International Symposium on
Software Reliability Engineering, Austin, Texas, May 1991.
Guaspari, D., Marceau, C., and Polak, W., "Formal Verification of
Ada Programs," IEEE Transactions on Software Engineering, Spe-
cial Issue on Formal Methods in Software Engineering, Vol. 16,
No. 9, September 1990.
Hanaway, J. F., Morrehead, R. W., Space Shuttle Avionics Sys-
tem, NASA SP-504, 1989.
Harper, R., Critical Issues in Ultra-Reliable Parallel Processing,
PhD Thesis, Massachusetts Institute of Technology, Cambridge,
MA, June 1987.
Harper, R., Lala, J., Deyst, J., "Fault Tolerant Parallel Processor
Overview," 18th International Symposium on Fault Tolerant Com-
puting, June 1988, pp. 252-257.
Harper, R., "Reliability Analysis of Parallel Processing Systems,"
Proceedings of the 8th Digital Avionics Systems Conference.,
October 1988, pp. 213-219.
Harper, R., Lala, J., Fault Tolerant Parallel Processor, J. Guidance,
Control, and Dynamics, V. 14, N. 3, May-June 1991, pp. 554-563.
Harper, R., Alger, L., Lala, J., "Advanced Information Processing
System: Design and Validation Knowledgebase," NASA Contractor
Report 187544, September 1991.
Hird, G.R., "Formal Methods in Software Engineering," 9th
AIAA/IEEE Digital Avionics Systems Conference, Virginia Beach,
VA, October 1990, pp. 230-234.
Hunt, W.A., "FM8501: A Verified Microprocessor," Proceedings
of IFIP Working Group 10.2 Workshop, North Holland, Amster-
dam, 1986.
Hwang, K., Briggs, F., Computer Architecture and Parallel Pro-
_, McGraw-Hill, 1984.
"VHDL Language Reference Manual," IEEE Standard, IEEE Std
1076-1987, March 31, 1988.
Page A-4
[IEEE8021]
[IEEE8022]
[IEEE8023]
[IEEE8024]
[J88N2]
[18701]
[KIj89]
[Kop89]
[Kop91 ]
[Kri85]
[La184]
[La1841
[La185]
lLa186a]
[La186b]
"Local and Metropolitan Area Networks: Overview and Architec-
ture," IEEE Standard, IEEE Std 802-1990, May 31, 1990.
"Logical Link Control," IEEE Standard, IEEE Std 802.2-1989, Au-
gust 17, 1989.
"Carrier Sense Multiple Access with Collision Detection
(CSMA/CD) Access Method and Physical Layer Specifications,"
IEEE Standard, IEEE 802.3-1988, June 9, 1988.
"Token-Passing Bus Access Method and Physical Layer Specifica-
tions," IEEE Standard, IEEE 802.4-1990.
"Linear Token Passing Multiplex Data Bus Protocol," Joint Inte-
grated Avionics Working Group Standard, Document J88-N2,
"Advanced Avionics Architecture (A3) Standard," Joint Integrated
Avionics Working Group Standard, Document J87-01.
Kljaich, J., Jr., Smit_ B.T., and Wojcik, A.S., "Formal Verifica-
tion of Fault Tolerance Using Theorem-Proving Techniques," IEEE
Transactions on Computers, Vol. 38, No. 3, March 1989.
Kopetz, H., et. al., "Distributed Fault-Tolerant Real-Time Systems:
The MARS Approach," IEEE Micro, 9(1):25-40, February 1991.
Kopetz, H., et. al., "The Rolling Ball on MARS," Institut fur
Technische Informatik Research Report No. 13/91, Technische
Universitat Wien, Vienna, Austria, November 1991.
Krishna, C. M., Shin_ ' K. G., Butler, R. W., "Ensuring Fault Tol-
erance of Phase Locked Clocks," IEEE Trans. Computers, Vol. C-
34, No. 8, AugusL 1985.
Lala, J. H., "An Advanced Information Processing System," 6th
AIAA-IEEE Digital Avionics Systems Conference, Baltimore, MD,
Dec. 1984.
Lala, J. H., "An Advanced Information Processing System," 6th
AIAA-IEEE Digital Avionics Systems Conference, Baltimore, MD,
December 1984.
Lala, J. H., "Advanced Information Processing System: Fault De-
tection and Error Handling," AIAA Guidance, Navigation and Con-
trol Conf., Snowmass, CO, Aug. 1985.
Lala, J.H., "Fault Detection, Isolation, and Reconfiguration in the
Fault Tolerant Multiprocessor," Journal of Guidance, Control, and
Dynamics, Sept-Oct. 1986.
Lala, J. H., "A Byzantine Resilient Fault Tolerant Computer for
Nuclear Power Plant Applications," 16 th Annual International Sym-
Page A-5
[La189]
[Lal91]
[Lam851
[Lap90]
[Leh871
[Liu731
[LSP82]
[MA-HDBK]
[Ma78]
[McE88]
[MIL-HDBK-0036]
[MIL-HDBK-59]
[MIL-HDBK-217E]
[MIL-STD-344]
posium on Fault Tolerant Computing Systems, Vienna, Austria, 1-4
July 1986.
Lala, J.H., et. al., "Study of a Unified Hardware and Software
Fault Tolerant Architecture," NASA Contractor Report 181759,
January 1989.
Lala, J.H., R. Harper, K. Jaskowiak, G. Rosch, L. Alger, and A.
Schor "AIPS for Advanced Launch System: Architecture Synthesis
Report", NASA Contractor Report 187544, September 1991.
Lamport, L., Melliar-Smith, P. M., "Synchronizing Clocks in the
Presence of Faults," Journal of the ACM, 32(1):52-78, January
1985.
"Dependability: Basic Concepts and Terminology," J.C. Laprie -
Editor, Published by International Federation for Information Pro-
cessing (IFIP) Working Group 10.4 on Dependable Computing and
Fault Tolerance, December 1990.
Lehoczky, Sha, Ding, The Rate Monotonic Scheduling Algorithm -
Exact Characterization and Average Case Behavior, Technical Re-
port, Department of Statistics, Carnegie-Mellon University, 1987.
Liu, C. L., Layland, J. W., "Scheduling Algorithms for Multipro-
graming in a hard Real-time Environment," J. ACM, 20(1):46-61,
1973.
Lamport, L., Shostak, R., Pease, M., "The Byzantine Generals
Problem," ACM Transactions on Programming Languages and
Systems, Vol. 4, No. 3, July 1982, p. 382-401.
Modular Avionics Handbook, Document No. 21530(0-6), FSCM
51993, Draft C, U. S. Air Force ASD-ALD/AX, 19 April 1990.
Martin, D. L., Gangsaas, D., "Testing of the YC-14 Flight Control
System Software," AIAA Journal of Guidance, Control, and Dy-
namics, Vol. 1, No. 4, July-August 1978.
McElvany, M. C., "Guaranteeing Deadlines in MAFT," IEEE Real-
Time Systerr_ Symposium, Huntsville, AL, December 1988.
"Survivable Adaptable Fiber Optic Embedded Network II -
SAFENET II," Military Handbook, MIL-HDBK-0036, 1 March,
1990.
MIL-HDBK-59, "Computer-Aided Acquisition and Logistic Sup-
port (CALS) Program Implementation Guide," 20 December 1988.
MIL-HDBK-217E, "Reliability Prediction of Electronic Equip-
ment," 2 January 1990.
MIL-STD-344 (draft), "Standard Army Vetronics Architecture," 14
September, 1990.
f
Page A-6
[MIL-STD-785B]
[MIL-STD-1553]
[MIL-STD-1815A]
[NAS 1-18565-14]
[Osd88]
[Pe80]
[PEI90120]
[Pek88]
[Pus89]
[Rad90]
[Rus89]
[SAE91]
[San90] '
[Sch91]
[Spi89]
[Spi90]
MIL-STI)-785B, "Reliability Program for Systems and Equipment
Development and Production," 15 September 1980.
"Aircraft Internal Time Division Command/Response Multiplex Data
Bus," Military Standard, MIL-STD-1553B, 12 February, 1980.
MIL-STD-1815A, "Reference Manual for the Ada Progamming
Language," 17 Febru_ 1983.
Statement of Work for NASA Contract NAS1-18565, Task 14,
June 1990.
Osder, S. S., "Digital Fly-by-Wire System for Advanced AH-64
Helicopters," 8th Digital Avionics Systems Conference, October
1988. ....
Pease, M., Shost_, R., Lamport, L., "Reaching Agreement in the
Presence of Faults,"Journal of the ACM, Vol. 27, No. 2, April
1980, pp. 228-234,
XTP® Protocol Definition, Revision 3.5, Published by Protocol
Engines Inc., September 1990.
Pekelsma, N. J., "Optimal Guidance with Obstacle Avoidance for
Nap-of-the Earth F_ght," NASA Contractor Report 177515, De-
cember 1988.
Puschner, P., Koza, CI, "Calculating the Maximum Execution Time
of Real-Time Programs," Real-Time Systems, 1(2):159-176,
September 1989.
PMV 68 CPU-3A Specification, Issue 3, Publication No.
681/SA/04085, Radst0ne Technology plc, 1990.
Rushby, J., von Henke, F., "Formal Verification of a Fault Tolerant
Clock Synchronization Algorithm," NASA Contractor Report 4239,
June 1989.
SAE/AS-2A Subcommittee RTMT Statement on Requirements for
Real-Time Communication Protocols (RTCP), Issue #1, SAE
ARD50007, August 2 1991.
STAR MVP Technical Description, Document No. 4069718, Lock-
heed Sanders, 25 June I990.
Schutz, W., "On the Testability of Distributed Real-Time Systems,"
Proc. Tenth Symposi_ on Reliable Distributed Systems, Pisa,
Italy, September, 199!.
Spivey, J.M., The Z Notation. A Reference Manual, Prentice Hall
International (UK) Ltd, 1989.
Spivey, J.M., "Specifying a Real-Time Kernel," IEEE Software,
Special Issue on Formal Methods, Vol. 7, No. 5, Sep 1990.
Page A-7
[Sri_]
[Sta87]
[Sun74]
[Tan88]
[X3T951
Srivas, M. and Bickford, M., "Formal Verification of a Pipelined
Microprocessor," 1EEE Software, Special Issue on Formal Meth-
ods, Vol. 7, No. 5, September 1990.
Stankovic, J. A., Ramamritham, K., "The Design of the Spring
Kernel," Proc. of the Real Time Systems Symposium, December
1987.
Sundstrom, R. J., "On-Lin t Diagnosis of Sequential Systems,"
PhD Thesis, University of Michigan, 1974.
Tanenbaum, A. S., Computer Networks, second edition, Prentice-
Hall, 1988.
"FDDI Station Management (SMT)," Preliminary Draft Proposed
American National Standard, X3T9.5/84-49, Rev. 6.2, May 18,
1990.
Xpress Transfer Protocol®, XTP®, and Protocol Engine® are registered trademarks of
Protocol Engines, Incorporated.
Page A-8
Appendix B. Glossary Of Terms and Acronyms
AFTA-Army Fault-Tolerant Architecture-A=computer designed for both high reliability and
high throughput. The AFTA is based on the _PP architecture.
ll_/:i.Qdig...l_-A set of tasks whose iteratirK=tes are unknown or undefined.
ASIC-Application Specific Intetn'ated Circuit:A type of integrated circuit that can be custom
designed by the hardware engineer so that it will perform a particular logic or processing
function and at the same time save circuit bo_d space and power consumption. The advent
of VLSI design techniques has made ASICs a more flexible and practical option for hard-
ware designers.
ATP-Authentication Protocoi-A protocol utilized by the BRNP to sign outgoing packets
and to test the authenticity of incoming packetsl
ATPG-Automatic Test Pattern Generation'The generation of test vectors directly from a
netlist for verification of device functionality. Test vectors from an ATt_ program do not
test the correct functionality of the device; they only test that the device is a correct imple-
mentation of the design as specified by the netlist.
behavioral VH L is defined to be a VHDL ai:chitecture which uses any of the legal VHDL
constructs, including those which do not co_espond to possible hardware realizations of
the description (i.e., pure behavioral may not be synthesizeable). A level of description
that specifies a device functionally in terms of output reactions to input stimulus. A behav-
ioral description can also specify the timing relationships of inputs to outputs.
BIT-Built In Test-This is an internal diagn0siic testing system that is included as part of the
AFTA design. There are three forms of the BIT-- I-BIT is the initial power-on test system,
M-BIT is for maintenance testing, C-BIT is the continuous in-flight test system.
BRNP-Byzantine Resilient Network _to_obA network layer protocol which implements
the Byzantine Resilient Virtual Circuit in order to guarantee that all messages are delivered
accurately.
broadcast addressjrlg-A method of station addressing using an identifier that causes all sta-
tions to respond to the specified address.
Page B'I
_-The ability to effectively isolate a node from the network without disrupting the
continuity of the network.
Byzantine Resilient-Capable of tolerating Byzantine faults. A Byzantine Resilient system is
capable of handling arbitrarily malfunctioning components that may supply faulty informa-
tion to other parts of the system thereby causing a spread of faulty information within the
system.
C3-Cluster _-An bTPP model number. Composed of either 4 or 5 FCRs, 3-40 processors,
1-40 vIDs, simplex, triplex, and quadruplex processor redundancy levels. Previous FTPP
models were C1 (4 FCRs, 16 processors, 4-16 VIDs, simplex, duplex, triplex, and
quadruplex processor redundancy levels) and C2 (4 FCRs, 4 processors, one fixed quad
VID).
cache-A form of memory that is typically much faster and much smaller than main memory.
Through utilization of cache memory, a processor's throughput will be increased. Typi-
cally cache memory acts as a staging area for data; information will be pulled from main
memory and temporarily stored in cache while it undergoes processing.
CDU-Cockpit Display Unit-A cathode ray tube display located in the vehicle cockpit for
display of system status. The CDU may display overall AFTA system status, LRU level
status, or LRM level status.
CID-Communication Identification-A designation assigned to each task which is used for
intertask communication.
class test-A test of the Network Element voting mechanism that requests a non-congruent
message exchange selectively on each channel of a fault masking group.
cluster-An FFPP consisting of 4 or 5 FCRs containing at least one virtual processing site.
Multiple clusters could be connected by a network device (such as a fault-tolerant data bus)
to provide even greater throughput than a single cluster. Most references to an FrPP refer
to a single cluster design.
CMF-Common Mode Fatlll-A type of malfunction which will cause multiple faults or
complete execution failure within a redundant processing group. Common mode faults
may result from software flaws, hardware bugs, design flaws, massive electrical upsets
etc.
Page B-2
_-Input/Output processesthatallow the associatedvirtual groupto perform
othertaskswhile I/O iscollectingdata. Thisa!!owsfor greaterprocessorthroughput.
CRC-Cyclic Redundancy Check-An error det_ecting code used in data communications that
allows the unit receiving a message to ensure through binary mathematics that it is the same
message sent by the transmitting unit.
CSMA/_-Carrier Sens¢ Multiple Access with Collision Detection-A form of media access
control whereby a potential transmitting station will monitor the bus to ensure that it is clear
before transmission begins. During transmission, the station also monitors the bus to
check for message collisions. If a collision _curs, the message must be re-transmitted.
CT-Conf'tgurati_le-A table stored on the Network Element that contains the current
configuration of the system, i.e. which processors are members of which virtual groups.
DAIS-Diotal Avionics Instruction Set-A benchmark for measuring processor throughput.
_-A set of diagnostic level tests executed outside of the constraints of a real-time
environment with emphasis on the isolation of chip level faults in these components. These
tests would occur at a maintenance repair facility in contrast to the various forms of built-in
testing.
DPRAM-Dtlal-Port Random Access Memory-The type of memory that occupies the data
segment. It provides a buffer between the NE and the PE; both the NE and the PE may ac-
cess the data segment asynchronously, provided that they do not attempt to access the same
location.
DR-Discrepancy Report-A report that is filed whenever unexpected behavior of the hard-
ware, software, or system is encountered. By recording observable symptoms of the sys-
tem throughout testing, integration, verification and validation, one may better trace and
identify system flaws.
gltl_-A specific instance of a protocol element in an Open Systems Interconnection layer or
sublayer.
FCR-Fault Containment Region-Usually comprised of a number of line replaceable mod-
ules such as Processing Elements, Network Elements, input/output controller, and power
conditioners. The AFTA is made up of four or five FCR's, and each FCR usually resides
Page B-3
onasinglecircuit board(with theexceptionof thepowerconditioner).An interchangeable
termfor theFCRis Line ReplaceableUnit or LRU.
FDDI-FiberDistributed Data Interface-A networking standard developed by the American
National Standards Institute to provide high bandwidth for Local Area Networks.
FDIR-Fault Detection. Identification and Recovery_ -FDIR software designed for the AFTA
allows it to sustain multiple successive faults by identifying a faulty component and recon-
figuring the AFTA system operation to compensate for the fault.
FIFO-FirstlnFirst Out-A type of information buffer in which the data that is stored first
chronologically will be the first to be extracted.
FMEA-Failure Modes and Effects Analysis
FMG-Fault Masking Group-A logical grouping of three or four processors to enhance the
reliability of critical tasks. The members of an FMG execute the same code with the same
data and periodically exchange messages to ensure that they produce the same outputs.
FrC-Eau!t_T_olerant Clock-A distributed digital phase-locked loop used for synchronization
of AFTA fault containment regions.
FTDB-Fault Tolerant Data Bus-A local area network designed around principles of Byzan-
tine resilience. Its primary objective is to provide an optimal internetworking system be-
tween simplex and redundant processing sites.
FTNP-Fault Tolerant Navigation Processor-The initial ground vehicle application for the
AFTA is for the navigations system in Armored Systems Modernization vehicles.
FTPP-Fault-Tolerant Parallel Processor-A computer designed for both high reliability and
high throughput. The core of the FTPP is the Network Element.
functional reliability-The probability that a given function can be executed because its re-
sources are operational.
functional synchronization-In maintaining synchronous operation, the members of a VID
perform a synchronizing act after some sequence of functions has been completed. The se-
quence of functions between the synchronization points is referred to as a frame.
Page B-4
GC-Global Controller-A microcoded finite-state machine used to coordinate the functions
throughout the Network Element.
m'aceful de_m'adation-Through self-testing, a virtual group may identify a faulty member
and gracefully degrade its redundancy level using a configuration table update message to
eliminate the faulty channel.
IOC-lnput/Output Controller-These devices connect the AFTA to the outside world, and
they must be compatible with the bus connecting elements of the FCR. They may have a
programmable processor on board to drive the I/O, or they may require off-board proces-
sors for operation. .........
IPS-lnstructions Per Second-The number of machine language instructions that a processor
will execute every second. This measurement is used to reference the speed of the proces-
sor.
hS_Q/.Q_-lntemational Standards Organization/Open Systems Interconnection-A specifica-
tion and model for computer communication networks.
LAN-I_,o_al Area Network-A network topology that interconnects computer systems sepa-
rated by relatively short distances (2-2000 meters). LAN technology is usually based on a
shared medium with no intermediate switching nodes required.
leaf-level-(VHDL) The models at the bottom of the model tree. Leaf-level models in VHDL
are always pure behavioral models.
LERP-Loclal Exchange Request Pattem-A string of bytes describing the current state of the
input and output buffers for each processor in an FCR. The LERP is used to generate the
SERP. Each FCR has a different configuration, therefore the LERPs for each FCR will be
different. For this reason, LERPs must be treated as single-source data.
link-An element in a physical network thai _i_vides interconnection between nodes.
LOC-_-This will occur as a result of a failure in any flight critical portion of
the Flight Control System. For analysis purposes, LOC will be considered as a total loss
of the vehicle.
Page B-5
_-Each virtual groupwill exerciseits own fault detectionandidentificationpro-
cessesto monitorfailuresamongits processors.Also, eachvirtual groupmayinitiate its
own recoveryoptions.
logical addressing-A method of station addressing using an identifier that may select a
group of stations to respond to the specified address.
LRM-Line Replaceable Module-The physical unit for field diagnosis and repair. Typically
it consists of one circuit card assembly with one or more Processing Elements.
LTPB-Linear Token Passing Bus-A media access control method whereby stations pass a
token along a virtual ring from one to another. A station may only transmit when it pos-
sesses the token.
MlX2-Minimum Dispatch Comolement-This specifies the absolute minimum level of oper-
ability for the AFTA system to be cleared for a sortie.
media access control-The method by which access to the physical network media is limited
to a single node so that communications over the media are undisturbed.
ilngllj.g..iay_-One or more physical layer media. Multiple media layers are physically and
electrically isolated from each other to the same degree as a fault-containment region in a
fault-tolerant computer. Most traditional LANs use only a single network layer. A Byzan-
tine resilient network usually employs multiple media layers for redundancy.
memory, alignment-A process whereby the RAM and registers in each processor of a virtual
group are made congruent as part of the resynchronization of a virtual group.
mission reliability-Arithmetically speaking, mission reliability is one minus the probability
that failure of the AFTA causes abortion of the mission.
MMC-Minimum Mission Complement-This specifies the minimum level of AFTA oper-
ability for the vehicle to continue its mission.
b'DI-NorI-Developmental Item
NE-Network Element-The hardware device which provides the connectivity between vir-
tual groups. The primary function of the NE is to exchange and vote packets of data pro-
Page B-6
videdby theprocessors.Theensembleof NetworkElementsforms avirtual busnetwork
to whichall virtualgroupsareconnected.
NE_-Network Element ID-The name by which a Network Element is known in file physi-
cal AFTA configuration. An NEID refers to a specific Network Element in the system, i.e.
the same NEID on different FCRs refers to the same Network Element. The NEID is also
used to refer to the FCR in which the referenced Network Element resides. By convention,
letters are used to denote the NEID.
netlist-A list defining interconnections of components. Netlists are typically used for de-
signing printed circuit boards or ASICs.
NlU-Network Interface Unit-The conneeti0n between a station and the FTDB
node-An element in a physical network thatprovides the necessary interface between a sta-
tion and the network media.
nonp_ reemptible I/O dispatcher-A task on the virtual group that manages the execution of
certain I/O instructions that cannot be inte_pted.
l/a.c.k_-A block of data consisting of a header, data, and a trailer exchanged between peer
protocol entities. The term packet is somewhat generic and is applied at all levels of the
protocol hierarchy.
p.ag.k._-A string of data of fixed or variable length for transmission from one processor to
another through an inter-processor networkl A message-passing network handles data in
packets. The term packet is used here to refer to a fixed-size (64 bytes) block of data which
is transmitted by the Network Elements. _ii
PDU-Protocol Data Unil-A fancy name for a packet. PDU is the name used by OSI.
PE-Processing Elemcnt-A hardware device which provides a general or special purpose
processing site. A minimal PE configuration Contains a single processor and local memory
(RAM and ROM). PEs may optionally have private I/O, making them a combination PE
and IOC.
PEID-Processing Element ID-The name by Which a Processing Element is known in the
physical AFTA configuration. Each PE in an FCR has a unique PEID. However, the same
Page B-7
PEID may be usedby anotherprocessorin anotherFCR. A combinationof NEID and
PEIDis usedto uniquelyidentifya singleProcessingElementwithin acluster.
physical addressing-A method of station addressing using a unique identifier such that at
most one station responds to the specified address.
PIMA-Portable Intelligent Maintenancg Aid-A system resembling a laptop computer which
Will initiate the maintenance built in testing (M-BIT), interrogate AFTA for fault informa-
tion logged during a mission, and extract maintenance records for system components.
PMD-Physical layer Medium D_ndcrit-The standard which defines the physical medium
that is used for the data communications channel on a network.
presence test-The polling of various components to determine if each is active and syn-
chronized. The testing may be performed on members of virtual groups or on the virtual
groups themselves.
orimitive-A function or procedure that one entity provides to another. The primitive def'mi-
tion specifies the inputs, outputs, and data formats for the primitive. :=
PROM-Programmable Read Only Memory-A form of computer memory that will store a
permanent copy of one or more subroutines specifically intended for use by a particular mi-
croprocessor. PROM's allow for a certain level of hard-wired software control over the
processor.
_-A virtual group consisting of four processing sites.
ratcgroupdispatcher-An RG4 task that is responsible for controlling the execution of the
rate group tasks and providing reliable communication between the rate group tasks
throughout the system.
Register Transfer Level (RTL) VHDL-A behavioral format which specifies the functionality
of a block from the standpoint of random combinational logic and/or synchronous regis-
ters. For the purpose of the Ab-TA NE development, RTL is defined to be synthesizeable
behavioral VHDL, that is, a behavioral VHDL description that is suitable for input to a
synthesis tool.
renrocurement-The act of obtaining new parts to replace parts in an existing system, or to
build additional copies of an existing design.
Page B-8
RG-Rglg_.C, tr.9._-A set of tasks whose iteration rate is well-defined and whose execution
times do not exceed the iteration frame (the inverse of the iteration rate).
RISC-Reduced Instruction Set Computer-A t_e of microprocessor which utilizes a limited
set of machine language instructions to allow for more rapid execution of those instructions
and thus greater throughput for the computer.
RTS-Run Time System
SAVA-Standard Army Vetronics Architecture ....
,_lgll.l_lLI/..Q-Input/Output processes that re_ire the managing virtual group to completely
supervise the activity. In other words, the Virtual group must block itself until the I/O is
finished.
SERP-System Exchange Request Pattem-A string of bytes describing the current state of
the input and output buffers for each processor in the system. The SERP is used to deter-
mine if packets can be sent from one Virtual group to another. The LERP from each FCR is
exchanged using a source congruency to ge_ate the SERP. Because the SERP originates
from a source congruency exchange, it can be considered congruent throughout all func-
tioning FCRs.
SIFT-Software Implemented Fault Tolerance-System fault tolerance functions achieved
primarily through operating system programming rather than primarily through dedicated
hardware ......
simplex-A virtual group consisting of only one processing site.
single-source d_Ia-An element of informationwhich originates from a single point. Exam-
ples of single-source data include sensor readings, input values, and syndromes. Single-
source data must be distributed to fault-masking groups using a source congruency ex-
change to maintain Byzantine resilience.
sortie availability-One minus the probability that the vehicle is prevented by the AFTA from
beginning a mission at the desired time.
source cong'r_lgn¢y-A type of exchange used todistdbute data from a single source, such as
an input device, to members of a fault-masking group. The source congruency, which is
Page B-9
also known as a class 2, 2-round exchange, or interactive consistency, is a primary re-
quirement for a Byzantine resilient system.
station-A device connected to a network that can transmit or receive data over the network.
_ _ r
Often a station is a processing site. In the FTDB, a station can be a redundant processing
site.
structural VI-IDL-A level of description that specifies a VHDL architecture by def'ming in-
terconnections of instantiations of VHDL entities. A structural description resembles a
conventional netlist.
_-A bit field indicating the observance of unusual behavior somewhere in the sys-
tem. Syndromes can be used in an attempt to diagnose and repair faults in the system.
_.gglR.ED_I- A process that will coordinate system status and fault information as well as
testing and analyzing shared components.
Iiks.g.IRigl_0-The movement of a necessary task from a failed processor to another pro-
cessor within the same fault containment region.
test bench-A model of a test fixture that is used to test a device being designed with VHDL.
The test bench is written in VHDL and provides a non-proprietary way of stimulating and
monitoring a design in a simulator.
testability-The ability to unambiguously ascertain the functionality of each Line Replaceable
Module of the AFTA.
_]_ff../M]_-TerrainFollowing/Terrain Avoidance/Nat) of the Eanh-A typical helicopter
mission application for which the AFTA will be designed.
THT-Token Holding Timer-A method used with token passing media access protocols to
limit the amount of time each station can transmit on the network.
timeout-A value of time used to monitor skew between processors of an FMG. All proces-
sors in an FMG should be synchronized to within one timeout value, so if a processor does
not respond within the timeout period, that processor is considered faulty, and the other
processors will continue uninhibited. Timeouts are necessary on the AFTA to prevent
faulty processors from halting the system.
Page B- 10
_-A 32-bitquantitythatindicates the _lative time within the cluster. The Network
Element places a timestamp in the input info block for each packet successfully delivered to
a virtual group.
I/_-Transient NE Recovery-The procedure by which a Network Element which has suf-
fered a transient fault is reintegrated into thecluster. The first part of TNR is similar to the
ISYNC procedure. TNR also specifies the realignment of the Network Element state.
transient recovery, policy-A recovery option whereby the faulty component is immediately
disabled and an attempt is made to reintegrate the component into the system.
_-A virtual group consisting of three processing sites.
_-The process of demonstrating that an implemented system correctly performs its
intended functions under all reasonably anticipated operational scenarios.
_tU_i_-In a Byzantine resilient system, a condition in which all functioning members of a
fault-masking group are guaranteed to possess correct data. The validity condition also
implies the agreement condition.
vehicle reliability-One minus the probability that the vehicle is lost due to failure of the
AbTA.
VG-virtual group-A grouping of one or more processors to form a virtual (possibly redun-
dant) single processing site. All processors in a virtual group execute the same instruction
stream. If a virtual group has more than one member, those members must reside in differ-
ent FCRs. Virtual groups of 3 or more mem_rs are known as fault-masking groups.
Y.ttI2L-VHSIC Hardware Description Language-A language for specifying hardware de-
sign. VHDL designs can be expressed in a behavioral or a structural method. VHDL also
defines a simulation environment and incorporates an intrinsic sense of time.
VHSIC-_'-y. High Speed Inte_m'ated Circu_t_A Government-funded project to develop
technologies to be applied to new, high speed integrated circuits. The VHSIC Hardware
Description Language (VHDL) was develo_ under the VHSIC program.
VID-Virtual Identifier-The name by which a virtual group is known to the system. Also,
sometimes used as a synonym for virtual group.
Page B-11
_-A messagesentby all membersof a redundantprocessinggroup. This
messagetype is only usedwhenexactconsensusamongall redundantmembersis ex-
pected.This is alsoknownasaClass1message.
Yoter lest-A test of the Network Element voting mechanism that seeds non-congruent val-
ues selectively on each channel of a fault masking group.
WAN-Wide Area Network-A network topology that interconnects computer systems sepa-
rated by long distances. WAN systems usually use packet switched technology.
_a_[.9.g..0.1]l_-A simple timekeeper that will monitor operations in both the Processing El-
ements and the Network Elements to keep the hardware and software from wandering into
undesirable states.
)Y_Qr,.liJng...gl__tlll-The set of FCRs in a cluster which are synchronized and in the operational
phase. An FCR which suffers a fault drops out of the working group. The working group
may attempt to reintegrate the failed FCR into the working group.
WPV-Y_/eight Power Volume-These are physical characteristics used to describe the AFrA.
Page B-12

REPORT DOCUMENTATION PAGE Form,d;_ro_dOMB No. 0704-0188
PulMI¢ ,!l_oRir_ bmde_ Io, thiJ oolleclion of kdamm,lon il eelmlal_l to i'_rlqle 1 hou, per ,,t¢,o_le. iP,dud.:_g the til_ kx fevkr#_ tn_ctioM, torching exbl_ dal, ilou,cet.
gldhed_g Imd rn_nla/_ll tho data ended, and aom_Ing lind rBv_/_ _ eo_ld_ el irdonneJion. 8end oomrr_nls r_ding thio btm:lee eetlmlte ¢_ imy olher Ul:_¢t el IhhJ
¢mkellon el i_'om_km. _ luiKl_tleee le_ _ue_ It_ _. le Wuhm_ee _umne*'= S_v_c_. Di,_o_=le to, I_om_ Ol=_=d_ _ Repent, 1216 J_=.rsoe D=,_
_y. Suite I204. AHk_Im_. VA _-430_ m_l le the O_te_ el MJn_)em,=m and Budget. P_r,*mrk Reduct_ P_ (0704-O1M). WImhingtoe. DC 20503.
I. AGENCY USE ONLY (/.._wb_nk) 2. REPORTDATE 3. REPORTTYPE'ANDOATESCOVERED
July 1992 Contractor Report
4. TITLE ANDSUBTITLE _$. FUNDINGNUMBERS
Advanced Information Processing System: The Army Fault Tolerant Architecture WU 505-64-52-53
Conceptual Sludy -Volumc II: Army Fault Tolerant Architecture Design
and Analysls C NASI-18565
a. AU'rHOR(S) ..... TA 14
R. E. Harper, L. S. Alger, C. A. Babikyan, B. P. Butler, S. A. Friend, R. J. Ganska,
J. H. Lala, T. K. Masotlo, A. J. Meyer, D. P. Morton, G. A. Nagle, and C. E. Sakamaki
=,, . ,. .,
7, PERFORMINGORGANIZATIONNAME(S)AND ADDRESS(E$_
The Charles Stark Draper Laboratory, Inc.
555 Technology Square
Cambridge, MA 02139
g. SPONSORINGI M_IITOR!NQ AGENCY NAME(S}ANDADDRESS(E$)
National Aeronautics and Space Administration
Langley Research Center
Hampton, VA 23665-5225
I_. SUPPLEMEN'rARVNO_ES
8."PE'RFORMING ORGANIZATION
REPORTNUMBER
10. SPONSORING/ MONITORING
AGENCY REPORTNUMBER
NASA CR-189632, Volume II
TechnicAl Monitor: Carl R. Elk_, Aerostructurcs Directoratc,AVRADA, AVSCOM,
La1:glcyRcscarch Ccnter, Hampton, VA
112a.D_STRIBUTION/ AVAILABILITYSTATEMENT '_
Unclassified- Unlimited
Star Category 62
13. ABSTRACT (Maximum 200 words)
12b. DISTRIBUTION CODE
The Army Avionics Research and Development Activity (AVRADA) is pursuing programs that would enable effective and
efficient management of large amounts of situational data that occurs during tactical rotorcraft missions. The
"Computer-Aided Low Altitude Night Helicopter Flight Program" has identified automated Terrain Following/Terrain
Avoidance, Nap of the Earth (TF/TA, NeE) operation as key enabling lechnotogy for advanced tactical rotorcraft to enhance
mission survivability and mission effectiveness. The processing of critical information at low altitudes with short reaction
times is life-critical and mission critical necessitating a ultrareliable/high throughput computing platlorm Ior dependable
service for flight control, fusion of sensor data, route planning, near-field/let field navigation, and obstacle avoidance
operations.
To address these needs Ihe Army FaulI-Tolerant Architecture (AFTA) is being designed and developed. This computer
system is based upon the Fault-Tolerant Parallel Processor (FTPP) developed by Charles Stark Draper Laboratory, Inc.
(CSDL). AFTA is hard real-time, Byzantine fault-tolerant parallel processor which is programmed in the ADA language.
This document describes the results of conceptual study (Phase I of a 3-year project) of the AFTA development. This
document contains detailed descriptions of the program objectives, the TF/TA NeE application requirements, architecture
overview, hardware design, operating systems design, analytical models and development plan.
14_ SUBJECTTERMS
Fault-tolerant. Real-time digital computer, Terrain-following/terrain avoidance helicopter
operation
_17, SECURITY CLASSIFICATION
OF REPORT OFTHIS PAGE OF ABSTRACT
Unclassified Unclassified
qSN 7540-O1-280-5500
18. SECURITYCLASSIFICATION lg. SECURITYCLASSIFICATION
15. NUM'SEROF'PAGES
436
16. PRICECODE"
AI9
20. LIMITATIONOF ABSTRACT
Standard Form 299 2_t(Rev's2"SS)Pretcri_ by ANSI aM,
1
11.
