A system-level supervisory approach to mitigate single event functional interrupts in data handling architectures. by Maqbool, Shazia.
UNIVERSITY OF SURREY LIBRARY
ProQuest Number: All rights reserved
INFORMATION TO ALL USERS 
The quality of this reproduction is dependent upon the quality of the copy submitted.
In the unlikely event that the author did not send a com plete manuscript 
and there are missing pages, these will be noted. Also, if material had to be removed, 
a note will indicate the deletion.
uest
ProQuest 10130240
Published by ProQuest LLO (2017). Copyright of the Dissertation is held by the Author.
All rights reserved.
This work is protected against unauthorized copying under Title 17, United States C ode
Microform Edition © ProQuest LLO.
ProQuest LLO.
789 East Eisenhower Parkway 
P.Q. Box 1346 
Ann Arbor, Ml 4 81 06 - 1346
A System-level Supervisory Approach to Mitigate 
Single Event Functional Interrupts in Data Handling
Architectures
By
Shazia Maqbool
Submitted for the degree of Doctor of Philosophy
UniS
School of Electronics and Physical Sciences 
University of Surrey
May 2006
© Shazia Maqbool 2006
In the Name of the God
Dedicated to Farooq
Abstract
This thesis examines the effective mitigation of the Single Event Effects (SEEs) in 
commercial State-Of-The-Art (SOTA) data handling devices to provide high performance and 
fault-tolerant data handling architectures for space missions. It concentrates upon Single 
Event Functional Interrupts (SEFIs), whereby a single particle hit in sensitive device cross- 
section leads to unexpected device behaviour.
Reports o f SEFIs are increasing in all key data handling technologies, e.g. memories, 
microprocessors, field programmable gate arrays (FPGAs) and on-board local area networks 
(LANs). Constructing a high performance on-board data handling (OBDH) architecture will 
therefore require a large number o f resources to cope with problem of SEFIs. This research 
proposes an architectural/system-level approach to SEFI mitigation, where a global supervisor 
is added into the architecture to monitor heterogeneous OBDH units. In the proposed OBDH 
architecture, all units are connected through a router-based SpaceWire network. A supervisor 
is a radiation hardened microprocessor, which is part of the router unit. The supervisor 
expects to receive special detection and diagnosis (DAD) packets from the underlying units. 
Health information collected through these packets is compared with designer’s input to 
produce any fault signature. The supervisor intervenes when the state of a unit does not match 
expectations or DAD packets stop arriving. In such an event, the supervisor will apply a 
recovery procedure based on fault signature observed and any previous recovery record for 
that unit
Theoretical and experimental analyses are presented to establish practicality o f the scheme. 
The outcome of this thesis is a SEFI-tolerance methodology aimed at applications that 
demand for SOTA commercial technologies and increased availability but cannot afford high 
cost and resources associated with traditional redundancy-based mitigations. This is 
particularly useful for small satellites where very limited mass, volume and power resources 
preclude the use of multiple-redundant system-based architectures. Therefore, it promises a 
measurable increase in small satellite utility across range o f mission performance 
requirements
TABLE OF CONTENTS
TABLE OF CONTENTS.........................................................................................................................  1
LIST OF FIGURES............................................................................................................................................. 5
LIST OF TABLES................................................................................................................................................7
ACKNOWLEDGMENTS.................................................................................................................................... 8
ABBREVIATIONS.............................................................................................................................................. 9
1. INTRODUCTION................................................................................................................................. 12
1.1 MOTIVATION................................................................................................................................. 12
1.2 OBJECTIVES AND SCOPE..............................................................................................................14
1.3 NOVEL CONTRIBUTIONS.............................................................................................................. 18
1.4 STRUCTURE OF THESIS.................................................................................................................19
2. TRENDS IN ON-BOARD DATA HANDLING.................................................................................. 21
2.1 THE RADIATION ENVIRONMENT..............................................................................................21
2.1.1 Trapped Radiation Belts............................................................................................................. 21
2.1.2 Galactic Cosmic Rays.................................................................................................................. 22
2.1.3 Solar Particle Events................................................................................................................... 23
2.2 BASIC MECHANISMS AND THEIR TRENDS............................................................................23
2.2.1 Single Event Effects..................................................................................................................... 23
2.2.1.1 LET Cross-Section C w ve ........................................................................................................23
2.2.1.2 Single Event Upset (SEU)........................................................................................................24
2.2.1.3 Single Event Transient (SET).................................................................................................. 25
2.2.1.4 Single Event Latch-Up (SEL).................................................................................................. 25
2.2.1.5 Single Event Functional Interrupt (SEFI)..................  26
2.2.1.6 Single Event Hard Error (SHE).............................................................................................. 27
2.2.1.7 Single Event Gate Rupture (SEGR)........................................................................................27
2.2.2 Multiple Bit Upset (MBU)........................................................................................................... 27
2.2.3 Total Ionising Dose Effects...................................................................................................... 28
2.3 TRENDS IN LOGIC DEVICES....................................................................................................... 28
2.3.1 Memories.......................................................................................................................................... 28
2.3.1.1 DRAMs....................................................................................................................................28
2.3.1.2 SRAMs..................................................................................................................................... 29
2.3.1.3 Flash Memories........................................................................................................................30
2.3.1.4 EEPROMs................................................................................................................................. 30
2.3.2 Microprocessors............................................................................................................................30
2.3.3 Field Programmable Gate Arrays (FPGAs) ............................................................................32
2.3.4 Data Handling Networks...................................................................................................   34
2.4 TRENDS IN OBDH ARCHITECTURES........................................................................................ 34
2.5 CONCLUSIONS.............................................................................................................................  37
3. INVESTIGATION INTO FUNCTIONAL INTERRUPTS..............................................................38
3.1 RANDOM ACCESS MEMORIES.................................................................................................... 38
3.1.1 DRAMs...............................................................................................................................................38
3.1.2 SDRAMs............................................................................................................................................ 40
1
3.2 FLOATING GATE MEMORIES..................................................................................................... 41
3.3 MICROPROCESSORS....................................................................................................................... 42
3.4 FIELD PROGRAMMABLE GATE ARRAYS............................................................................... 43
3.4.1 Device De-Configuration............................................................................................................43
3.4.2 Interruptions from JTAG Operations  ................................................................................ 44
3.4.3 Configuration Upsets Leading to a SEFI................................................................................ 44
3.4.4 Activating Output Drivers on an Input Pin ........................................................................... 44
3.5 DATA HANDLING NETWORKS.....................................................................................................44
3.6 CONCLUSIONS................................................................................................................................... 45
4. FAULT TOLERANCE STRATEGIES...............     45
4.1 MITIGATION TECHNIQUES FOR MEMORIES.......................................................................46
4.1.1 Error Detecting Codes................................................................................................................ 46
4.1.2 Error Detection And Correction Codes (EDAC).................................................................46
4.1.3 SEFI Detection for Ground Testing......................................................................................... 48
4.2 MITIGATION TECHNIQUES FOR MICROPROCESSORS....................................................48
4.2.1 Fault Detection............................................................................................................................. 48
4.2.1.1 Watchdog Timers......................................................................................................................48
4.2.1.2 Lockstep.................................................................................................................................... 49
4.2.1.3 Built-In Testing.........................................................................................................................49
4.2.2 Fault Handling.............................................................................................................................. 50
4.2.2.1 Hardware Faidt Tolerance......................................................................................................50
4.2.2.2 Software Fault Tolerance........................................................................................................51
4.2.3 Computer State Recovery.......................................................................................................... 52
4.2.3.1 Hardware Rollback Recovery..................................................................................................53
4.2.3.2 Software Rollback Recovery....................................................................................................53
4.2.3.3 Message Passing Rollback Error Recovery........................................................................... 54
4.2.3.3 Roll- Forward Error Recovery................................................................................................54
4.2.4 Computer State Recovery for Space Systems..................................................................... 55
4.2.4.1 Triple Modular Redundant Flight Computer........................................................................55
4.3 MITIGATION TECHNIQUES FOR FPGAS.................................................................................. 55
4.3.1 Bitstream Repair Technique....................................................................................................... 56
4.3.1.1 An Example Architecture: Adaptive Instrument Unit (AIM).......................................................56
4.3.2 Redundancy Techniques for FPGAs ..........................................................................................57
4.4 MITIGATION TECHNIQUE FOR DATA NETWORKS.............................................................59
4.4.1 X2000 4-Layer Fault Tolerance Strategy.............................................................................59
4.4.2 SSTL’S CAN PROTECTION................................................................................................................60
4.5 CONCLUSIONS.......................................................................................................   60
5. PROPOSED FAULT-TOLERANT ARCHITECTURE..................................................................... 62
5.1 DESIRABLE FEATURES OF THE ARCHITECTURE...............................................................63
5.2 CANDIDATE LANS FOR SPACEFLIGHT.................................................................................... 64
5.2.1 SpaceWire........................................................................................................................................ 65
5.2.2 Radiation Response of SpaceWire............................................................................................. 66
5.2.3 The SpaceWire Router..................................................................................................................66
5.3 PROPOSED FAULT-TOLERANT ARCHITECTURE................................................................ 67
5.3.1 Requirements for the Supervisor......................................................  69
2
5.3.2 Role of Interface Node...............................................................................................................70
5.3.3 Role of Code Store...................................................................................................................... 71
5.3.4 Requirements for an OBDH Un it ...............................................................................................72
5.4 OVERVIEW OF THE SUPERVISORY PROTOCOL..................................................................73
5.4.1 OBC...................................................................................................................................................73
5.4.2 Payload System..............................................................................................................................74
5.4.3 SSDR..................................................................................................................................................74
5.4.4 Code Store...................................................................................................................................... 75
5.5 ENVIRONMENT MONITORING......................................................   75
5.6 COMPARISON WITH OTHER SEFI MITIGATION APPROACHES...................................76
6. SUPERVISORY PROTOCOL FOR THE OBC................................................................................. 78
6.1 OBC UNIT..............................................................................................................................................78
6.1.1 Spacecraft Operating System.....................................................................................................79
6.2 ASSUMPTIONS.................................................................................................................................... 80
6.3 SEFI DETECTION POLICY....................................................................................   80
6.3.1 Fault Sources and Associated Signatures............................................................................ 82
6.3.2 Monitoring the Current Consumption of the OBC...............................................................84
6.4 OBC STATE RECOVERY.....................    85
6.4.1 OBC Software Architecture.................................................................................................... 86
6.4.2 Checkpointing an OBC Task ........................................................................................................ 88
6.4.2.1 Checkpointing with Occasionally Cooperating Tasks................................................................ 90
6.4.2.2 Checkpointing with Frequently Cooperating Tasks.....................................................................91
6.4.3 State Recovery Task ...................................................................................................................91
6.4.3.1 Incremental Checkpointing o f State Recovery Task............................................................ 92
6.4.4 Initialization of the System after a Reset..............................................................................95
6.4.5 Monitoring the State Recovery Task ......................................................................................95
6.5 SEQUENCE OF EVENTS IN THE DAD TASK............................................................................ 96
6.5.1 DAD Packet Format..................................................................................................................... 96
6.6 SCREECH PACKET.......................................................................................................................... 97
6.7 RECOVERY PROCEDURES.................................................................................................................... 98
6.7.1 Recovery # 1 .....................................................................................................................................98
6.7.2 Recovery # 2 .....................................................................................................................................99
6.7.3 Recovery # 3 ...................................................................................................................................101
6.8 AVAILABILITY OF THE OBC NODE......................................................................................... 101
6.8.1 Prioritize Interrupts.................................................................................................................... 101
6.8.2 Validate Execution Time of the Test sequence....................................................................102
6.9 RECOVERY RECOMMENDATIONS.......................................................................................... 102
7. LATENCY BOUND OF THE SCHEME ........................................................................................103
7.1 SYSTEM MODEL.............................................................................................................................. 103
7.1.1 Intel 386EX CPU...............................................................................  104
7.1.2 Operating System ......................................................................................................................... 106
7.1.3 SMCS332 Communication Controller....................................................................................109
7.1.4 SMCSlite......................................................................................................................................... I l l
7.1.5 Interface between OBC and Communication Controller................................................112
3
7.1.6 8051 Architecture.........................................................................................................................113
7.2 DEFINITIONS AND ILLUSTRATIONS....................................................................................... 114
7.3 LATENCY BOUND CALCULATIONS..........................................................................................116
7.3.1 DAD Trunaround Time................................................................................................................ 116
7.3.2 Communication Delay Experienced by a DAD Packet....................................................... 119
7.3.3 Supervisor Processing Time........................................................................................................120
7.3.4 Timers Required at the Supervisor.........................................................................................121
7.3.5 Supervisor Time Slice....................................................................................................................123
7.6.6 Invocation Period.......................................................................................................................... 124
7.3.7 Fault Detection Latency............................................................................................................ 125
7.3.8 Recovery Latency.........................................................................................................................126
7.4 CONCLUSIONS.................................................................................................................................. 127
8. SYNCHRONIZATION BETWEEN THE OBC AND THE SUPERVISOR................................ 129
8.1 COORDINATION METHODS.........................................................................................................130
8.1.2 Using One Timer Only .................................................................................................................. 130
8.1.2 Using Two Timers...........................................................................................................................130
8.2 TESTBED.............................................................................................................................................131
8.3 FUNCTIONAL DESCRIPTION...................................................................................................... 134
8.3.1 Address Resolution Protocol...................................................................................................135
8.3.2 User Datagram Protocol...........................................................................................................136
8.4 ERROR PERFORMANCE OF THE TEST BED................................................   141
8.4.1 Components of the System.......................................................................................................... 141
8.4.2 Systematic Errors......................................................................................................................... 144
8.4.3 Statistical Errors........................................................................................................................145
8.5 EXPERIMENTAL RESULTS...........................................................................................................146
8.5.1 Discussion........................................................................................................................................147
8.4 RESYNCHRONIZATION AFTER AN OBC-PROGRAM CRASH......................................... 149
8.5.1 Coordination method 1................................................................................................................ 151
8.5.2 Coordination method 2................................................................................................................ 151
8.6 COMPARISON OF COORDINATION METHODS....................................................................151
8.7 IMPLICATIONS OF THE EXPERIMENTAL RESULTS.........................................................152
8.8 CONCLUSIONS................................................................................................................................ 153
9. CONCLUSIONS AND FUTURE WORK.......................................................................................... 154
9.1 CONCLUDING OVERVIEW...........................................................................................................154
9.2 ACHIEVEMENTS OF THIS RESEARCH................................................................................. ...155
9.3 FUTURE WORK........................................................................... .....................................................156
REFERENCE.....................................................................................................................................................157
APPENDIX A: LIST OF PUBLICATIONS.................................................................................................168
APPENDIX B: SOFTWARE CODE............................................................................................................. 169
4
LIST OF FIGURES
Figure 2-1 Trapped Radiation Belts Around Earth, After [Kaya-03]----------------------------------- 22
Figure 2-2 SEE Cross-Section versus LET Curve, After [Asen-98]------------------------------------ 23
Figure 2-3 Relationship between Feature Size and Critical Charge, After [Pete-82]---------------- 24
Figure 2-4 X2000 Data Handling Architecture, After [Chau-99]--------------------------------------- 35
Figure 2-5 SNAP-1 System Block Diagram-----------------------------------------------------------------36
Figure 3-1 Block Diagram of the Intel Flash Memory, After [Schw-97]------------------------------ 41
Figure 4-1 TREMOR Block Diagram, After [Angi-04]--------------------------------------------------- 55
Figure 4-2 AIM Architecture, After [Cond-99]-------------------------------------------------------------56
Figure 4-3 FPGA TMR Design, After [Gais-02]------------------------------------------------------------57
Figure 4-4 Logic Partitioning, After [Carm-99]-------------------------------------------------------------58
Figure 4-5Logic Duplication, After [Gais-02]---------------------------------------------------------------58
Figure 4-6 Device TMR, After [Gais-02]-------------------------------------------------------------------- 59
Figure 5-1 A Dual Redundant SpaceWire Network, After [Park-03]---------------------------------- 66
Figure 5-2a (Left): High Performance OBDH Architecture, Figure 5-2b (Right): Proposed OBDH
Architecture---------------------------------------------------------------------------------------------------------67
Figure 5.3a: Proton Environment Measured by the KITSAT-1 Cosmic-Ray Experiment (CRE) at
1330 km Altitude, After [Unde-96]--------------------------------------------------------------------------- 75
Figure 5.3b: S80/T Program Memory Upsets at 1330 km Altitude, After [Unde-96]--------------- 76
Figure 6-1: SSTL OBC-386, After [Sstl-05]-----------------------------------------------------------------78
Figure 6-2: The OBC Unit---------------------------------------------------------------------------------------79
Figure 6.3: SSTL OBC Software Structure, After [Jack-05]---------------------------------------------86
Figure 6-4 Synchronization Protocol-------------------------------------------------------------------------- 89
Figure 6-5 Flow Chart for SR Task--------------------------------------------------------------------------- 93
Figure 6-6 DAD Packet Format------------------------------------------------------------------------------- 95
Figure 7-1 System Model-------------------------------------------------------------------------------------- 103
Figure 7-2 Process States, After [Goor-89]----------------------------------------------------------------- 106
Figure 7-3 Interface between the OBC and SMCS332--------------------------------------------------- 111
Figure 7-4 8051 Architecture, After [Inte-94]--------------------------------------------------------------114
Figure 7-5 Illustration of Definitions------------------------------------------------------------------------ 116
Figure 7-6: Worst Case Detection Latency----------------------------------------------------------------- 126
Figure 8-1 Experimental Test Bed----------------------------------------------------------------------------133
Figure 8-2a Celoxica RC 203 (Rear of the Board)------------------------------------------------------- 133
Figure 8~2b Celoxica RC 203 (Front of the Board)------------------------------------------------------134
5
Figure 8-3a ARP Packet, Ethernet Type is 0x0806------------------------------------------------------- 137
Figure 8-3b UDP Packet, Ethernet Type is 0x0800------------------------------------------------------ 137
Figure 8-4 Left: Coordination method 1, Right: Coordination method 2----------------------------- 137
Figure 8-5 Flow Chart for Interface FPGA Program----------------------------------------------------- 138
Figure 8-6 Flow Chart for the Supervisor Program------------------------------------------------------ 139
Figure 8-7 Flow Chart for the OBC Program--------------------------------------------------------------140
Figure 8-8 Ethereal Graphical User Interface--------------------------------------------------------------142
Figure 8-9a: Configuration 1----------------------------------------------------------------------------------142
Figure 8-9b: Configuration 2----------------------------------------------------------------------------------143
Figure 8-10 File Transfer Utility------------------------------------------------------------------------------150
6
LIST OF TABLES
Table 2.1 Upset Threshold of Microprocessors, After [John-98]--------------------------------------- 31
Table 3-1 SEFIs in Flash Memory---------------------------------------------------------------------------- 41
Table 3-2 SEFIs in Microprocessors----------------------------------------------------------------------- —42
Table 5-1 Comparison of LANs------------------------------------------------------------------------------- 64
Table 6-1 Comparison of the OBC Monitoring Schemes-------------------------------------------------81
Table 6.2: Functions of OBC Tasks--------------------------------------------------------------------------- 87
Table 6-3 Send Table Entries-----------------------------------------------------------------------------------92
Table 6-4 Sent Table Entries------------------------------------------------------------------------------------92
Table 6-5 Record of Previous Recovery Line-------------------------------------------------------------- 92
Table 6-6 Description of the Flags Field---------------------------------------------------------------------96
Table 6-7 Fault Types and Recovery Procedures---------------------------------------------------------- 97
Table 6-8 Inputs for Recovery of the OBC after a Screech-----------------------------------------------98
Table 7-1 8051 Instructions Details------------------------------------------------------------------------- 121
Table 7-2 Time Sharing vs Overlapped STS---------------------------------------------------------------125
Table 8-1: Experimental Results: Configuration 1------------------------------------------------------- 144
Table 8-2 Results Summary for Coordination Method 1------------------------------------------------ 147
Table 8-3 Results for Coordination Method 2--------------------------------------------------------------147
7
ACKNOWLEDGMENTS
First and foremost, I am grateful to the most beneficent, the most merciful God, Who created 
me, and has been providing me with all that is required to live a happy life. I put my faith in 
Him.
I would like to take this opportunity to express my deep thanks to my supervisor, Dr. Craig 
Underwood. I have no doubt that this PhD program has greatly enhanced my professional 
capabilities and his supervision has certainly been one of the most contributing factors.
I am sincerely thankful to Prof. Sir Martin Sweeting for his valuable feedback on my work 
throughout this research program.
Mr. Chris Jackson, Mr. Tim Plant and in particularly, Mr. Adrian Woodroffe are warmly 
thanked for making me familiar with SSTL’s design practices. I am also thankful to Mr. Bary 
Cohn from Celoxica for helping me with debugging of my Handel-C programs.
My thanks go to the British Government for awarding me an overseas research studentship to 
partially sponsor this PhD. This work was also supported in part by Surrey Space Centre of 
the University of Surrey, and BAE SYSTEMS, Air Systems Warton and the UK Department 
of Trade and Industry as part o f the SPAESRANE (Solutions for the Preservation of 
Aerospace Electronics Systems Reliability in the Atmospheric Neutron Environment) project, 
and Government of Pakistan.
Last but not least, I am grateful to my family including my dear parents, who worked hard to 
enable their children to get the best education, my loving brothers and sister, my sweet grand 
mother, who prays daily for my success, and most importantly my husband whose love is a 
treasure of my life, and many other family members and friends.
ABBREVIATIONS
ADC Attitude Determination Control
ALU Arithmetic Logic Unit
API Application Program Interface
ARP Address Resolution Protocol
ASIC Application Specific Integrated Circuit
BER Backward error Recovery
BIT Built-In Test
CAN Controller Area Network
CAN-SU CAN for Space Use
CMOS Complementary Metal Oxide Semiconductor
COTS Commercial Off The Shelf
CPU Central Processing Unit
CRC Cyclic Redundancy Check
DAD Detection And Diagnosis
DHD Diagnostic Health Data
DMA Direct Memory Access
DMC Disaster Monitoring Constellation
DPRAM Dual Port RAM
DRAM Dynamic Random Access Memory
DS Data Strobe
ECSS European Cooperation on Space Standards
EDAC Error Detection And Correction
EEP Error End of Packet
EEPROM Electrically Erasable Programmable Read-Only Memoiy
ESA European Space Agency
ESTEC European Space Research and Technology Centre
FCB Frame Control Block
FCT Flow Control Token
FER Forward Error Recovery
FPGA Field Programmable Gate Array
GCR Galactic Cosmic Ray
GEO Geosynchronous Earth Orbit
GPS Global Positioning System
9
GSFC Goddard Space Flight Centre
GUI Graphical User Interface
I-IDLC High-level Data Link Control
HST Hubble Space Telescope
IC Integrated Circuit
IDE Integrated Development Environment
I/O Input/Output
IP Internet Protocol
ISR Interrupt Service Routine
JPL Jet Propulsion Laboratory
LAN Local Area Network
LEO Low-Earth Orbit
LVDS Low Voltage Differential Signalling
M2M Metal to Metal
MOS Metal Oxide Semiconductor
MTBF Mean Time Between Failure
NASA National Aeronautics and Space Administration
NIC Network Interface Card
NMI Non Maskable Interrupt
NMR N Modular Redundancy
OBC On-Board Computer
OBDH On-Board Data Handling
OMNI Operating Missions as Nodes o f the Internet
ONO Oxide-Nitride-Oxide
OSI Open System Interconnect
OTP One Time Programmable
PAL Platfonn Abstraction Layer
PDK Platform Developer’s Kit
PSL Platfonn Specific Layer
ROM Read Only Memory
RS Reed Solomon
SAA South Atlantic Anomaly
SB ST Self-Based Self Test
SCOS Spacecraft Operating System
10
SDRAM Synchronous Dynamic Random Access Memory
SEE Single Event Effect
SEFI Single Event Functional Interrupt
SEL Single Event Latchup
SET Single Event Transient
SEU Single Event Upset
SFR Special Function Register
SOI Silicon On Insulator
SOS Silicon On Sapphire
SOTA State Of The Art
SPE Solar Particle Event
SR State Recovery
SRAM Static Random Access Memory
SSC Surrey Space Centre
SSDR Solid State Data Recorder
SSTL Surrey Satellite Technology Ltd.
TAP Test Access Port
TDMA Time Division Multiple Access
TED Total Ionizing Dose
TMR Triple Modular Redundancy
UDP User Datagram Protocol
UTC Coordinated Universal Time
VLSI Very Large Scale Integration
XTMR Xilinx Triple Modular Redundancy
11
CHAPTER 1
INTRODUCTION
1.1 Motivation
One of the primary design considerations for a space mission is the survivability of its electronic 
systems in an ionizing radiation environment. In the past, rad-hard technology derived from 
military programmes has dominated the space industry and military applications have had 
significant influence on the direction and capability of the electronic industry. However, 
continued reductions in military budgets, coupled with dramatic growth of the consumer 
electronics industry, have reduced this influence. Market drivers for consumer electronics have 
taken their products and supporting technologies far beyond the capability of specialized military 
technology. Consequently, rad-hard technology lags behind the commercial development by 
about two generations.
The increasingly complex mission requirements of modem spacecraft for military, commercial 
and scientific applications have mandated the need for highly capable microelectronics -  
particularly in the context of their on-board data handling systems. This trend has been 
accompanied by the need to provide affordable, light weight, low volume and low power systems. 
Both these trends have resulted in the need to adopt commercial microelectronics for space 
missions as well as to examine emerging technologies. However, the use of commercial off-the- 
shelf (COTS) technology brings new issues on the reliability in the space environment.
By their very nature, COTS technologies are constantly changing. One of the most significant 
trends is the move towards smaller dimensions -  i.e. “scaling”. Scaling has some benefits for 
radiation effects resilience: As gate oxide thicknesses decrease, hole trapping and interface trap 
build-up are becoming less significant, leading to a trend toward improved total-ionising dose 
(TID) performance [John-98, Dres-98]. Similarly, the increasing use of buried epitaxial substrates 
over the last 20 years [John-00, Oldh-03] has helped decrease single-event latch-up (SEL) 
sensitivity -  both by limiting the charge-collection volume, and by decreasing the substrate series 
resistance [Schr-80, Ocho-81, Dodd-01]. In addition, the concomitant trend towards reduced 
supply voltage levels suggests that device threshold voltages should soon fall below that required 
to sustain latch-up. Indeed, if continued scaling forces the industry to adopt silicon-on-insulator 
(SOI) technology, latch-up should soon be eliminated [John-98, Oldh-03].
12
However, whilst some radiation effects are becoming less of an issue, new threats have emerged: 
Another significant trend in COTS technology is the move towards the use of “smart” logic, in 
device architectures, e.g. advanced memories, complex micro-controllers and field-programmable 
gate arrays (FPGAs), etc. Here the performance of the logic device requires the inclusion of 
complex control circuitry internal to the die. Whilst transparent to the user, such circuitry may 
offer a significant target to ionising particles and thus be prone to single-event effects (SEEs), 
such as a single-event upset (SEU) or single-event transient (SET). In such circuitry, these can 
manifest themselves as single event functional interrupts (SEFIs), whereby the device exhibits an 
unexpected change in its observable output state [Koga-97].
SEFI reports are increasing in all key COTS device technologies we would wish to use to 
construct cost-effective and capable space systems. Whilst device cross-sections are usually 
relatively low, and thus such errors are not expected to occur all that frequently in the space 
environment, none-the-less the consequences of SEFIs are important for system reliability and 
availability.
Generally, a device affected by a SEFI is not available to the system. Depending on the role of the 
device, this may lead to a disruption in system performance. For instance, a SEFI in an on-board 
computer (OBC) can cause it to lock up or to exhibit continuous exceptions. As the OBC is 
responsible for monitoring spacecraft health and for issuing commands for its safety and for 
performing mission tasks, a SEFI in an OBC may pose a threat for spacecraft survival.
The use of “n”-level redundancy and “lock-step” processing remain powerful tools to mitigate 
such effects [Agra-88], but the additional system overheads that these methods entail (in tenns of 
volume, mass and power) are always a problem in spaceflight. Indeed, they are particularly so in 
the context of the “micro/nano”- space systems such as those designed at Surrey, where the entire 
spacecraft are typically only a few 10’s of kg in mass [Sstl-05]. Also, it is worth mentioning that 
lockstep conditions for commercial devices must be well thought out. Design for a lock-stepped 
system is quite involved because of strict synchronization requirements of the scheme. In 
particular, the TID degradation of COTS devices must be examined for clock skew effect. This 
may potentially cause "false" triggers if logic devices get out of time synchronization [Labe- 96a].
Therefore, to mitigate the effect of SEFI simple watchdog timers are generally used to detect such 
an event. For example, at Surrey a payload task is executed by the OBC. One of the functions of
13
this task is to perform a watchdog handshake with the attitude safety unit. The function of this 
watchdog is to allow checking of the OBC to ensure that it is still operational, and in the event 
that it stops responding, for the other systems to place the spacecraft into a safe state. Once in this 
state, all payload operations will cease and the system will require resetting via ground control. 
Such a situation may lead to long mission down times and is therefore not desirable.
1.2 Objectives and Scope
For a COTS based system to be robust for use in a space radiation environment it must be tolerant 
against TID and SEEs. TID is usually not a problem in a low earth orbit (LEO). In a 
geosynchronous earth orbit (GEO) or medium earth orbit (MEO) shielding can mitigate the 
effects. Only orbits within the heart of the inner proton belt are forbidden.
For SEE -  SEU/SET can be mitigated by ED AC and by clock speed control, single event gate 
rupture (SEGR) by careful choice of device technologies. SEL requires fast acting over-current 
protection.
However, SEFI remains problematic. This research proposes a SEFI mitigation method, which 
meets following objectives
1. It must be cost effective
2. It must promote use of the SOTA microelectronics and standards in a small satellite 
scenario
3. It must deliver increased system availability
A detailed literature review was carried out, looking into the radiation response of COTS data 
handling device technologies, with a particular attention paid to their SEFI response. Also, 
currently available mitigation approaches and trends in data handling architectures were 
investigated.
Broadly, a device is likely to exhibit one of the following signatures in a SEFI.
1. Non responsive device, e.g. a hung microprocessor, a non-responding communication 
interface, a floating gate memory stuck into read/write loop etc.
2. Error rate higher than expected, e.g. a memory device is likely to show a greatly increased 
error rate as compared to its normal operating conditions, where errors are caused by SEUs, 
and increased exceptions on microprocessors etc.
3. Variations in current consumption of the affected device.
An OBDH architecture comprises two or more units, where each unit is capable of carrying out a 
distinct function and is comprised of several devices. It is important to note that there can be
14
physically two or more boards, which work together to perform a particular function (e.g. using 
TMR with three computer boards to act as an OBC in the system). Architectural or system level is 
defined as combination of two or more heterogeneous units. It is possible to apply mitigation at 
one of the following three levels
1. Device level: i.e. by using specialized devices to avoid fault occurrence or by 
incorporating fault mitigation within the device. For example use of rad-hard device technologies 
or devices that have some special internal features, e.g. the ERC 32, Leon [Gais-00] etc., 
incorporating redundancy within the device, e.g. through the use of TMR or logic duplication in 
FPGAs.
2. Unit level: i.e. by adding mitigation within the unit to improve overall reliability. For 
example, application of error detection and correction (EDAC) algorithms to protect memory 
cells, TMR and lockstepping for microprocessors, configuration bitstream scrubbing for 
reconfigurable FPGAs etc.
3. System level: i.e to have a mitigation, which has a global (involving more than one unit) 
effect in the data handling architecture. Provision of the device-level mitigation is rarely available 
in true “COTS” components (with the exception of the FPGAs). By definition, a SEFI in a 
reconfigurable FPGA results in huge errors in configuration bitstream and therefore cannot be 
mitigated with internal redundant functional units. Therefore it can be stated that almost all 
mitigations developed for SEFIs to date are targeted at unit level. However, recent efforts of 
having local area network (LAN) on-board and to build spacecraft as nodes connected on Internet 
are changing the scene. Inclusion of an on-board LAN will allow the spacecraft design to get 
benefit of the distributed computing. European space agency (ESA)/European space research and 
technology centre (ESTEC) program European cooperation on space standards (ECSS) has 
recommended IEEE 1355 (SpaceWire), whilst national aeronautics and space administration 
(NASA)/jet propulsion laboratory (JPL) X2000 program has adopted the IEEE 1394 (FireWire) 
network standard [Buch-03]. By its nature, LANs have a global effect on the system and 
therefore, a system level mitigation approach becomes a natural choice.
Generally, recovery from a SEFI requires resetting or power cycling of the affected device. This 
recovery requirement demands a two-step procedure. For instance, a SEFI in an OBC will require 
its resetting or power cycling, hi order to get the OBC back into operation, its operating system 
and user tasks are required to be reloaded into its volatile main memory. Either the spacecraft 
needs to have a non-volatile storage of these programs or these will need to be uploaded from the 
ground.
15
By definition, the exact location of the fault in the device is unknown in a SEFI and it can only be 
detected by the malfunctioning behaviour of the affected device. Generally small satellites require 
ground-intervention to recover from events like SEFIs, resulting in long downtimes. On-board 
monitoring is thus desirable for future missions in order to meet increasing demand of maximum 
system availability. This implies that each unit in the data handling architecture should be 
provided with some kind of intelligent monitor to detect SEFI like signatures as quickly as 
possible. Because of the question of “who guards the guards” such a monitor must itself be rad- 
hard. This research proposes a solution whereby rad-hard monitor is provided once within the 
OBDH architecture. The fact that a fast communication between on-board data handling units is 
now available is exploited to make such a scheme favourable. Thus, a system level solution is 
proposed, where a single intelligent supervisor is added to the data-handling network. This 
monitors all the spacecraft systems and checks for unexpected changes or loss in functionality. 
The supervisor needs to be rad-hard, however this is not onerous given that it is a single unit 
which does not have to perform other processor intensive tasks. All the other sub-systems on the 
OBDH network can be COTS based. In addition to this supervisor, the architecture contains a 
code store to hold set-up and configuration data for underlying units to provide speedy system 
recoveries.
This thesis presents a SEFI-tolerant, reusable and scalable architecture comprising SOTA OBDH 
technologies. Once such architecture has been developed, future missions can take advantage of 
adaptability of this scheme. For example, adding a new unit to the system will not require 
inclusion of any further SEFI mitigation hardware. Also, the overheads associated with this 
approach come in the form of the level of complexity of the software running on the supervisor, 
units under observation, and network traffic that can always be traded according to the mission 
requirements.
After proposing a novel SEFI mitigation scheme, the next task was to investigate its feasibility. 
This task was divided into three activities
1. Defining the supervisory protocol. It includes identifying possible fault sources in a unit, their 
fault signatures, ways of detecting these signatures and appropriate recovery against each 
signature observed. In this work monitoring of an OBC with proposed supervisory approach is 
considered. A packet format is defined along with description of the information, which is
16
required for SEFI detection. It is assumed that the OBC will execute a special task, called the 
detection and diagnosis (DAD) task along with other user tasks. This DAD task will include a test 
sequence, which will he executed by the OBC and the results will be sent to the supervisor along 
with other information in the DAD packet. The Surrey Satellite Technology Ltd. (SSTL)’s OBC 
uses the spacecraft operating system (SCOS), developed by VyTek wireless. SCOS has been part 
of 34 small spacecraft missions [Rash-00]. A detailed study of this operating system was made to 
develop a practical instance of the supervisory protocol.
2. In one supervisory cycle, the supervisor will cycle through all OBDH units. In order to 
produce an estimate for invocation period of the supervisory cycle, supervisor time slice length, 
and detection and recovery latency bound, a system model was prepared and analysed. The 
system model consisted of three units, the OBC, the code store and the supervisor. The OBC was 
heavily based on the existing OBC design of SSTL with the exception that a SpaceWire node 
(SMCSlite from Astrium [Chri-01]) was added to the unit. A SMCS 332 device [Chri-99] was 
chosen to act as the router. Choice of the components was based on the availability of the 
required information for this analysis. Intel 8051 microprocessor architecture, which represents a 
simple 8-bit microprocessor, was investigated to act as the supervisor and was assumed to be part 
of the router unit. This choice was driven by the desire to investigate the feasibility of one of the 
simplest and very basic microprocessor to be used as the supervisor. The discussion has been 
extended to take into account effect of the monitoring of multiple units by the supervisor.
3. In order to establish periodic communication between the OBC and the supervisor, the DAD 
task is required to be invoked at regular intervals and coordination is required between both units. 
Two well known coordination methods can be used. These methods include polling the OBC, 
which requires only one timer at the supervisor, and using another timer at the OBC to invoke the 
DAD task in addition to the supervisor timer. In former method, the supervisor is responsible for 
both invoking of the DAD task and tracing of deadline for a DAD packet, whilst in latter 
approach the supervisor only tracks time-out for the DAD packet. A test bed was developed to 
demonstrate and compare both coordination methods. The test bed consisted of two personal 
computers (PCs) and an FPGA-based board from Celoxica Ltd [Celo-04]. The latter method 
produced better results. As it is based on the concept of two timers at two units on a data network, 
synchronization is required between both timers to avoid false error triggers at the supervisor. 
This synchronization can be achieved either by inclusion of a margin on the supervisor to account 
for difference in clock at two nodes or by using a common time source such as the SSTL global 
position system (GPS) receiver that can provide time to OBDH units on-board [Jack-05].
17
This combination of intelligent supervisor using packets transferred on the data network as an 
indicator of health status of all key devices in a unit and system architecture is a novel approach 
to the problem of mitigating complex radiation effects phenomena in COTS devices operating in 
the space environment. The automatic recovery promised by such a scheme will enhance the 
availability of COTS-based equipment on-board spacecraft.
1.3 Novel Contributions
At the close of this research, the following novel contributions have been made.
>  In contrast to the traditional approach of addressing one device technology at a time for 
SEFI mitigation, a global approach has been adopted to monitor heterogeneous OBDH 
units with diverse device technologies to lower overall cost and resource requirements of 
the OBDH architecture. In order to demonstrate the supervisory approach, a novel data 
handling architecture, which is based on SOTA design practices, has been presented. This 
architecture is used to outline different requirements and recommendations associated 
with the supervisory approach.
>  A novel approach of detecting SEFIs has been adopted, where operational parameters of
an OBDH unit are compared against expected values. Any deviation from expected
behaviour is treated as a fault signature.
>  The programmable nature of the supervisor allows multiple recovery procedures. 
Application of a recovery procedure is based on the fault signature observed and previous 
recovery records for that OBDH unit. .
>  Computer stat recovery techniques have been used for terrestrial applications. However, 
this is not the case for space applications. In particularly, the concept of checkpointing
OBC states is new. This technique will not only speed up the system recovery after an
OBC crash but will also enhance the diagnosis capability of the supervisor.
>  Two coordination methods to be used with the supervisory approach have been 
compared on an experimental test bed and it is found that using a timer at the supervisor 
along with a timer at the other OBDH unit in question exhibits better performance, 
compared with a polling approach where there is only one timer at the supervisor, which 
is used to poll the other OBDH unit.
18
1.4 Structure of Thesis
Chapter 2 summarises trends in on-board data handling. A brief overview of the space 
environment is given prior to defining radiation effects and their trends. SEEs and TED effects are 
defined and their trends with device scaling and advanced device architectures are described. This 
is followed by a discussion of the data handling device technologies of interest. This chapter also 
covers approaches for constructing OBDH architectures.
Chapter 3 aims at providing an insight into functional interrupts. This chapter summarizes SEFI 
reports from the literature. In each case, a brief introduction to the device architecture is given 
followed by the SEFI signatures observed in different examples and possible reasons for each 
SEFI. The purpose of this chapter is to demonstrate that SEFI signatures assumed in the following 
chapters are in accordance with existing knowledge of SEFIs.
Chapter 4 examines different fault tolerance techniques, which have been adopted to mitigate 
radiation effects in data handling device technologies. This chapter demonstrates that so far SEFI 
mitigation has typically been considered at unit level.
Chapter 5 introduces proposed architecture and establishes its requirements. It gives a top-level 
description of the system. Chapters 6-8 address the practicality of the proposed scheme. Chapter 6 
details the proposed supervisory protocol. It was not possible to develop the supervisory protocol 
for each unit in the OBDH. Instead the OBC unit is considered as a particular case study because 
of its importance in an OBDH architecture.
Chapter 7 presents latency bound calculations of the scheme. The proposed SEFI tolerance 
scheme is based on the concept of periodic testing, which falls under category of non-concurrent 
test strategies that trade off between fault-detection latency and performance overheads. The 
scheme requires the OBC to run a special DAD task to support supervisory fault detection and 
diagnosis. In such a case, the DAD task is another process that has to compete with user processes 
for system resources including central processing unit (CPU) cycles, memory and communication 
input/output (I/O). To alleviate the system’s overhead, the DAD task should run in a minimum 
number of clock cycles. Fault detection latency depends on the time interval between two 
consecutive DAD-task executions, as well as, the DAD task execution time. The time interval is 
specified as a trade-off between user-program performance and fault detection latency with the 
capability of the system to detect SEFIs. A system model is presented to discuss DAD task
19
turnaround time, supervisor time slice, invocation period of the supervisory cycle, and detection 
and recovery latencies of the scheme.
Chapter 8 presents an experimental test bed. This test bed consists of three components including 
2 PCs corresponding to the OBC and the supervisor, and an FPGA-based development board to 
act as the OBC interface node, which is connected to the OBC on parallel port and has an 
Ethernet interface with the supervisor. The proposed scheme is a system level approach, therefore 
both the supervisor and the OBC have different clock domains. This chapter explores 
synchronization/coordination requirements of the scheme. Two coordination methods have been 
demonstrated on the test bed. This chapter compares the time required, standard deviation, 
overheads required and capability of getting back into synchronization after an OBC crash for 
both schemes.
Chapter 9 concludes this thesis and outlines futur e directions.
20
CHAPTER 2
TRENDS IN ON-BOARD DATA HANDLING
COTS technology is attractive for use in space missions but it is prone to space radiation effects. 
This chapter presents the space radiation environment and its effects on the microelectronics of 
interest. The effects of radiations on COTS technologies have widely been discussed in the 
literature. This chapter summarizes these discussions to outline trends in key data handling 
technologies including memories, microprocessors, FPGAs and LANs. Trends in OBDH 
architectures are also presented.
2.1 The Radiation Environment
One of the major factors that have to be considered when planning a space mission is the ionising 
radiation environment. The interaction of such radiation with electronic devices can cause failure, 
degradation or malfunctioning in their performance. In general, the level of ionising radiation 
encountered by an Earth-orbiting satellite is much greater than that usually found in a terrestrial 
environment. There are three major sources of radiation, which collectively form the corpuscular 
radiation environment.
1. The Van Allen belts or trapped radiation
2. Galactic cosmic rays (GCRs), which consist of interplanetary protons and ionised heavy 
nuclei
3. Solar Particle Events (SPEs) comprising protons and heavy ions
2.1.1 Trapped Radiation Belts
The space flight of a radiation monitor in 1958 showed unusual regions of high counts, which 
Van Allen identified as regions of radiation trapped in the Earth’s magnetic field [Dyer-03]. 
These belts mainly consist of electrons of energies up to a few MeV and protons up to several 
hundred MeV. These can be divided into two belts, an inner belt extending to 2.5 Earth radii and 
comprising both electrons and protons, and an outer belt comprising mainly electrons extending 
out to 10 Earth radii. The Earth's atmosphere removes particles from the radiation belts and low 
Earth orbits (LEO) can be largely free of trapped particles [Dyer-03]. The Earth’s magnetic field 
is not geographically symmetrical; local distortions are caused by an offset and tilt of the 
magnetic axis and by geological influences; one of distortions is the extension of inner radiation 
belt to low altitudes in the region of South Atlantic and is known as the South Atlantic Anomaly 
(SAA). Spacecraft orbiting in Low Earth Orbits pass through the inner belt in the SAA region, 
and high inclination LEO satellites will pass through electrons of the outer belt near the poles.
21
Geostationary satellites orbit within the outer-belt, and elliptical orbits cross both inner and outer 
belts [Unde-96].
South
Atlantic
Anomaly
Outer
Proton Belt
Figure 2-1 Trapped Radiation Belts Around Garth, After [Kaya-03|
2.1.2 Galactic Cosmic Rays
The particles encountered in space are collectively known as cosmic-rays. They are often 
distinguished according to their source. GCRs are primary cosmic rays that originate outside the 
solar system, but are associated with the galaxy and provide a continuous, low-flux component of 
the radiation environment. They comprise about 85% protons, 14% alpha particles and 1% 
heavier nuclei with energies extending to lGeV and beyond. They are partly kept out by the 
Earth's magnetic field and have easier access at the poles compared with the equator [Dyer-03].
2.1.3 Solar Particle Events
In an eleven year cycle, intense solar activities known as SPEs occur inside the Sun leading to a 
greater stream of charged particles being released ito space [Asen-98]. These events last up to 
several days and comprise both protons and heavy ions with variable composition from event to 
event. Their energies are slightly lower than galactic cosmic rays. Such events can produce 
significant effects on high inclination or high altitude systems [Dyer-03].
22
2.2 Basic Mechanisms and Their Trends
Good reviews on effects of scaling on radiation response of microelectronics can be seen in 
References [John-98, Dres-98, Laco-03] and are summarized hereafter.
2.2.1 Single Event Effects
SEEs are the result of free-charge generation in semiconductor materials caused by a single 
ionizing particle strike. Different phenomena may arise depending on the location and the instant 
of the particle strike.
2.2.1.1 LET Cross-Section Curve
The results of ion strikes are usually presented as curves of SEE cross-section versus the ion’s 
effective Linear Energy Transfer (LET) [Pete-92]. The LET or mass stopping power of an ion is 
commonly defined as the rate of energy loss by the ion per unit distance, as it traverses a device. 
The LET threshold is taken to be that which deposits the minimum charge required for upsetting 
the most sensitive node of the device and the saturation cross-section is that cross-section at 
which an increase in ion LET will have no further effect on the upset rate. Ideally, the saturation 
cross-section should approximate the total area of the sensitive regions on the device.
Linear Energy Transfer (MeV/mg/cm2)
Figure 2-2 SEE Cross-Section versus LET Curve, After [Asen-98]
All SEEs are most efficiently caused by heavy ions, which have a high LET and thus can cause a 
high specific ionisation (i.e. deposit a significant charge in a small volume). However, high-
23
energy (1 Os-100s MeV) protons and neutrons may also cause SEEs -  albeit indirectly -  via 
nuclear reactions.
2.2.1.2 Single Event Upset (SEU)
Single-Event Upset (SEU) is a non-destructive single-event-effect. These are soft errors 
associated with changes in the state of memory elements -  i.e. unexpected bit-flips due to particle 
strikes. Re-writing the data will restore the memory states.
As technologies moved towards smaller feature sizes and thus had less charge stored on their 
circuit nodes, it was expected that the critical charge (and thus the LET threshold) for SEU would 
decrease [Ake-95]. Indeed this trend was observed in the 1980’s in the now older generation of 
devices. However, results on more recent technologies, with minimum feature sizes ranging from 
3pm to 0.35pm do not follow this trend; in fact, the threshold LET has remained approximately 
constant. This levelling off at a minimum LET has been postulated to be the result of 
manufacturers taking into account the need to maintain a low upset rate from naturally-occurring 
radioactivity (e.g. alpha particles) or atmospheric neutrons during the design and manufacture of 
their parts [John-98, Dres-98]. Recent SEE testing of complementary metal oxide semiconductor 
(CMOS) technologies (0.15 pm, 0.13 pm, and 90 nm) conforms with these predictions and it is 
found that the decreased size of the shrinking latches combined with robust design efforts has 
stabilized the critical neutron cross sections of state of the art static latches. Indeed, the mean time 
between failures (MTBF) calculated for a million logic gate FPGA fabricated with a 90-nm 
technology is better than that of the MTBF of a similar complexity array fabricated in 150- or 
130-nm processes [Lese-05].
Figure 2-3 Relationship between Feature Size and Critical Charge, After [Pete-821
24
2.2.1.3 Single Event Transient (SET)
These are also a class of non-destructive soft-error that can cause changes of logical state in 
combinational logic, or may be propagated in sequential logic through “glitches” on clock or set/ 
reset lines, etc.
So far, it has been observed that the radiation response of logic devices is dominated by errors in 
registers and memory cells -  i.e. SEUs [John-00]. However, as devices are further scaled down to 
smaller feature sizes and faster speeds, soft errors in combinational logic, SETs, are expected to 
become more probable. In contrast to SEUs, which do not show frequency dependence, SETs 
depend significantly on the operating speed of the devices in question [Dres-98, John-00]. 
According to models developed by Shivakumar et al., soft errors in combinational logic are likely 
to become comparable to SEUs in unprotected memory elements by the year 2011 [Shiv-02],
2.2.1.4 Single Event Latch-Up (SEL)
Single Event Latch-Up (SEL) is a potentially destructive single event effect. It can occur due to 
the presence of parasitic bipolar transistors inherent in CMOS structures, and may lead to an 
effective short circuit through the device. There are two types of SEL: In traditional or destructive 
SEL, the device current consumption increases immediately above the maximum specified for the 
device, causing permanent damage to the device through the burn-out of bond-wires, etc. The 
other type is known as a micro-latch-up, and during such an event, the device current increases 
above the normal operating voltage but not above the maximum specified. Micro-latch-up halts 
device operation and a power reset is required to recover the device. The device may still operate 
but shows signs of permanent damage, such as exhibiting a step-change in its current 
consumption. Alternatively the damage may not be immediately apparent. Sometimes, a series of 
micro-latch-up events in quick succession can lead to destructive SEL.
As device technology has evolved, buried epitaxy technology has replaced bulk substrates. 
Epitaxial substrates can decrease latch-up sensitivity, both by limiting the charge-collection 
volume and by decreasing the substrate series resistance [Schr-80, Dodd-01]. As supply voltages 
reduce, it is expected that threshold voltages will soon fall below the required latch-up holding 
voltage, and so SEL susceptibility should diminish. Indeed, if continued scaling forces industry to 
adopt silicon-on-insulator (SOI) technology at some point, latch-up will be eliminated [John-98, 
Oldh-O3a].
25
However, the long-term trend for SEL-sensitivity is more difficult to discern than for SEU or 
SET. For example, recent 0.15 /tin SRAM technology has shown increased latchup susceptibility 
[Page-05].
2.2.1.5 Single Event Functional Interrupt (SEFI)
One of the most significant trends in COTS technology is the move towards device architectures 
with in-built “smart logic” -  that is for devices which have internal operations transparent to the 
user. Smart logic requires the inclusion of more control circuitry internal to the die. For example, 
flash memory devices contain a state machine that generates the signals needed to perform 
read/write operations. Many dynamic random-access memory (DRAM) devices, microprocessors 
and field-programmable gate-arrays (FPGAs) have a built-in self-test capability, which again 
increases device complexity. Many DRAMs include redundant memory rows and columns 
intended to increase device reliability. When the DRAM is powered up, weak or bad rows or 
columns are replaced with their redundant counterparts and a redundancy-latch has to be used to 
hold the device configuration. Many synchronous-DRAM (SDRAM) devices also have internal 
state machines that implement pipelines, programmable refresh modes and various power states. 
In most of these cases, this control logic is not directly accessible to the user.
Experience from spaceflight and ground-based ion-beam testing of many COTS devices reveals 
that this internal control circuitry is susceptible to radiation effects, often manifesting itself as the 
class of single event effects (SEEs) called single event functional interrupts (SEFIs), whereby an 
internal upset in this logic causes an unexpected change in the observable state of the device. 
SEFI was first mentioned in 1996 [Eia-96, Koga-97], although SEFI-like signatures had been 
reported earlier [Koga-85, Vela-92, Labe-92]. Early reports were confined to microprocessor 
SEFIs, however, emerging data handling devices, such as advanced memories and FPGAs, have 
also been found to be susceptible.
2.2.1.6 Single Event Hard Error (SHE)
Some memory devices have been seen to exhibit single-event hard-error (SHE) in ground-based 
ion-beam tests -  particularly when under heavy-ion high-LET particle bombardment [Koga-91, 
Dufo-92]. These errors lead to the pemianent fixing of the state of one or more bits in the device 
-  i.e. the bit becomes stuck at one or stuck at zero and cannot be over-written as in a conventional 
SEU. Here the failure mechanism would appear to be one of total-dose damage (see Section 
2.2.3) but at the level of an individual circuit element, due to energy deposited by a single ion 
strike. Smaller feature sizes seem to be a risk factor for such phenomena.
26
2.2.1.7 Single Event Gate Rupture (SEGR)
Single event gate rupture (SEGR) [Fisc-87] is caused when a heavy ion passing through an 
insulator under high field conditions leads to the catastrophic breakdown of the insulator with a 
consequent thermal runaway condition. Such events may occur in the gate dielectric of non­
volatile SRAM or EEPROM during a write or clear operation. The increasing use of such 
technology in data handling systems means that SEGR is an increasing risk factor in COTS 
systems.
2.2.2 Multiple Bit Upset (MBU)
Sometimes, more than one memory cells can be corrupted by the charge deposition originating 
from the same particle. Such events are characterized as single-event multiple-bit upsets (MBUs). 
These errors are mainly associated with memory devices, although any register is a potential 
target. MBUs, which have been observed during ground-based ion-beam testing and in during 
spaceflight, may be attributed to three mechanisms:
• A single particle strike affects more than one bit in one or more words;
• Coincidental, independent SEUs (from different particle strikes) appear as an MBU;
• A single SEFI event causes MBUs (termed as block SEFIs).
The second mechanism is not in fact a true MBU, but is instead an artefact o f the diagnostic 
algorithm and the underlying (usually high) SEU rate. Both SRAMs and DRAMs are prone to 
MBUs, with DRAMs, generally having a higher probability of getting such errors. MBU rates are 
expected to increase with the device scaling, as device scale lengths are now comparable with the 
thickness of the charge-track left by an ionizing particle (circa. 100 nm diameter). However, in 
some cases, less dense devices have been found to show higher MBU rates than more dense 
devices [Unde-96]. On the other hand, reports of block SEFI events are increasing. Floating gate 
memories, which do not exhibit traditional MBUs, are susceptible to this class of MBUs.
2.2.3 Total Ionising Dose Effects
Total ionising dose effects in semiconductor devices depend on the creation of electron-hole pairs 
within dielectric layers and subsequent generation of traps at or near the interface with the 
semiconductor or of trapped charge in the dielectric. This can produce a variety of device effects 
such as flat-band and threshold voltage shifts, surface leakage currents, and speed degradation of 
the devices. Many current COTS parts are very susceptible to total dose damage, and may fail at 
total-dose levels of 5 krad(Si) or even less. Total dose is therefore an issue in spaceflight where 
long mission lifetimes, and/or exposure to the Van Allen radiation belts means that these dose 
levels can be easily exceeded.
27
As commercial devices are scaled to smaller feature sizes, so gate oxide thicknesses are also 
decreasing. The threshold voltage shift that results from hole traps in gate oxides decreases as the 
square of oxide thickness, assuming that the hole trapping efficiency is unchanged as the oxide 
thickness is reduced [Ma-89]. It has been shown that hole trapping will reduce further because of 
tunnelling as oxides are scaled below lOnm [Saks-84]. Thus, for the highly scaled devices, hole 
trapping and interface traps are not likely to remain major concerns for devices operating in a 
radiation environment. However, it should be noted that even in highly scaled devices there 
would be isolation dielectrics, which are still relatively thick. Thus this is isolation/field oxides 
that dominate total dose response of commercial devices [Under-96, John-98, Dres-98,].
2.3 Trends in Logic Devices
This section presents a review of radiation issues in SOTA and emerging microelectronics.
2.3.1 Memories
2.3.1.1 DRAMs
DRAM technology has grown rapidly in recent years. The first DRAM, introduced in the early 
1970s, contained only 1,024 bits, but modem DRAMs are available containing 256 megabits or 
more. Large DRAMs incorporate a controller to provide parallelism in operations and hence an 
increased memory throughput. Their high density and superior performance characteristics make 
them an attractive candidate for on-board data recording to fulfill increasingly higher mission data 
storage requirements. In addition, recent technology has permitted the stacking o f dies into single 
units providing even denser packaging of DRAMs into a single device. However, the use of 
DRAMs for space applications does not come without certain penalties. DRAMs have historically 
been considered as devices very sensitive to SEUs, since the circuit errors were observed on these 
components back in the seventies. Their high sensitivity to ionising particle strikes comes from 
their characteristic of passively storing bit information as charge stored in a circuit node. 
Although the amount of charge stored decreases steadily from one technology generation to the 
next, unexpectedly, the error rate has been found to decrease or stay constant [Mass-96, Facc-00, 
John-00]. However, the observation of non-traditional SEEs, e.g. SEFIs, raises new concerns over 
the reliability of these devices in radiation environment.
Both the Cassini and. Hubble Space Telescope (HST) missions have used solid state data 
recorders (SSDRs) made from DRAM [Labe-98]. Observed SEE results include the standard
28
single bit upset (SEU) errors, stuck bits, MBUs, and column or row errors (where a single ion 
strike induces a partial or full address column or row to be in error in a single device) [Labe-98, 
Swif-01]. Label termed column and row errors as block SEFIs [Labe-96b]. Ground-based ion- 
beam testing of DRAMs has also shown standard SEFIs (e.g with the device entering test or 
standby modes) in addition to above-mentioned errors [Labe-96b, Koga-97].
It should be noted that advanced DRAM architectures are also emerging. An example of an 
advanced DRAM is Synchronous DRAM (SDRAM) that can provide even more storage capacity 
on a chip along with an increase in access speed. These devices have underlying characteristics of 
stuck bits and MBUs. Increased levels of complexity in these devices lead to higher susceptibility 
of SEFIs [Rodg-01].
2.3.1.2 SRAMs
These devices are susceptible to SEUs, generally with quite low thresholds (often below 1 
MeVcmVmg) [Coss-99]. They are also subject to MBUs. However, their MBU rates are 
significantly smaller than their SEU rates. There is no evidence from in-orbit observations that the 
denser devices are necessarily more prone to MBUs than the older devices [Unde-96, Unde-99] 
although ground based tests suggest this. Stuck bits are sometimes observed, but only during 
irradiation with particles of high LET (i.e. heavy-ions) [Facc-00]. Newer generations of SRAMs,
using a 6T cell design, are expected to have an improved SEU and total dose response [Leli-96,
Poiv-98, Facc-00].
2.3.1.3 Flash Memories
Flash memory technology comes with advantages in terms of density and non-volatility. It 
provides fast read access, comparable to that of DRAMs. However, erasing and writing new data 
in flash memories is a more complex operation, requiring the application of high voltage. Newer 
flash devices use complex internal architectures to improve write and erase times, as well as to 
make device operation more transparent to the user.
There are two types of flash architecture: NOR and NAND. Whilst both architectures have the 
same basic storage element, it is the interconnection of these memory cells that distinguishes their 
structure. In the NOR flash memory array, the bit line logic goes to “0” if any of the memory ceil 
transistors is on. hi the NAND flash memoiy array, all memory cell transistors are required to be 
on for bit line to go to “0” state. The NAND architecture is more sensitive to radiation [Miya-03].
29
Radiation testing has been performed on Intel, Samsung, Toshiba, AeroFlex, and SanDisk Flash 
memory devices [Schw-97, Nguy-99, Nguy-02]. The observed SEEs are dominated by errors in 
their complex internal architectures rather than in the non-volatile storage elements. Test results 
have shown that flash devices are immune to SEUs when powered off. The probability of getting 
an upset is particularly high during write mode. Static and dynamic read modes have also 
demonstrated complex control-related upsets, where the functional consequences of these errors 
can be multipled. Sometimes, a steep increase in the current consumption is observed during or 
after irradiation.
2.3.1.4 EEPROMs
Electrically Erasable Programmable Read-Only Memory (EEPROM) provides non-volatile 
storage like Flash memories. The principal difference between the two technologies is that an 
EEPROM requires data to be written or erased one byte at a time whereas flash memory allows 
data to be written or erased in blocks. This makes flash memory faster. Radiation test results on 
EEPROMs are presented in [Koga-97, Labe-97]. Similar to flash memories, the devices are found 
to be more susceptible during write mode. SEFI has been observed in these memories as well.
2.3.2 Microprocessors
The most complex type of digital logic device used in data-handling systems is the 
microprocessor. Using commercial processors for space missions brings potential advantages in 
terms of cost and design time reduction. These devices are accompanied by many COTS software 
applications, development tools, and operating systems. Processors such as the 80Cx86 family 
and the Power PC are being flown in space despite being susceptible to ionizing radiation. These 
processors are not radiation hardened and have been found to be susceptible to SEUs.
Microprocessors work by continuously moving data between registers; the memory unit with 
arithmetic and logic unit operations and flag registers updates etc. It is unlikely that all sections 
will be in use during the processing of a program. The application software, which is being 
executed, determines how many sections are being used at any given time. Moreover, each 
section might have a different sensitivity to SEU. Therefore, the SEU-induced effects in 
microprocessors are strongly application dependent. Many commercial processors have been 
tested to evaluate their suitability for space applications. These include the 80x86 architectures, 
SPARC, Motorola 68020, Pentium, Alpha, and PowerPC’s [Estr-93, Mora-95, Labe-96b, Howa- 
01, Seif-02, Irom-02], All devices experienced some type of SEFI or device lockups at fairly low
30
threshold LETs and authors have suggested that reliable operation in the radiation environment 
will require at least detection and reset capability.
One result, which was certainly unexpected, is that the threshold LET has essentially not changed 
during last many years [John-98]. Table 2.1 compares threshold LET of different 
microprocessors.
Table 2.1 Upset Threshold of Microprocessors, After [John-98]
Device Manufacturer Feature Size 
(approx)
Threshold LET 
(MeV-cm2/mg)
Z-80 Zilog 3 pm 1.5-2.5
8086 « Intel 1.5pm 1.5-2.5
80386 Intel 0.8pm 2-3
68020 Motorola 0.8 pm 1.5-2.5
LS64811 LSI 1.2pm 2-2.5
90C601 MHS 1.2pm 2-2.5
80386 Intel 0.6pm 2-3
PC603e Motorola 0.4pm 1.7-3
Pentium Intel 0.35 pm 2-3
Power PC750 Motorola 0.25pm 2-2.5
PowerPC7455 Motorola 0.18pm 1
2.3.3 Field Programmable Gate Arrays (FPGAs)
This technology offers number of advantages including a highly compact solution, high integrity, 
flexibility, reduced cost, faster and cheaper prototyping, and reduced lead-time before flight. The 
capacity and performance of FPGAs suitable for space flight have been increasing steadily for 
more than a decade. The application of FPGAs has moved from simple glue logic to complete 
platforms that combine several real-time system functions on a single chip [Gais-02].
The choice of the type of storage for the configuration drives the radiation performance of the 
device [Katz-97]. FPGAs can be split into two categories: namely re-programmable and one time 
programmable (OTP). Reprogrammable technology offers volatile SRAM or non-volatile 
EEPROM/Flash cells to hold the device configuration. SRAM-based technology has been 
evolving at a faster pace than OTP technology, and now features a million system gates or more
31
on a single chip, and hence has become an attractive choice for high performance applications. 
However, one must be aware not only of radiation issues, but also of concerns such as the total 
loss of configuration from single ion hit. OTP technology on the other hand offers radiation 
performance comparable to application specific integrated circuit (ASICs). The most commonly 
used OTP FPGAs use antifuses for storing its configuration, either using oxide-nitride-oxide 
(ONO) or metal-to-metal (M2M) antifuse structures [Katz-99].
FPGAs are available at different reliability levels, ranging from inexpensive COTS devices to 
high reliability, rad-hard devices, as well as different device speeds, capabilities, and 
configurations. Both Actel and Xilinx have been offering radiation tolerant devices for space 
applications. The Actel OTP devices have been the primary FPGA technology used in space 
flight systems to date. Actel’s ACT1, ACT2 and ACT3 have been very popular among space 
designers [Katz-97]. Commercial SX and SX-A series have also been reported as being capable 
of tolerating total doses as high as 100 krads(Si)* or more. These devices are latch-up immune as 
well. The flip-flop element of the basic device is considered as being radiation “tolerant”; 
enhanced radiation hardened levels can be achieved via incorporation of software or hardware 
implemented mitigation strategies [Katz-99]. Although ONO antifuses are susceptible to SEGR, 
the threshold is found to be quite high [Katz-97, Cron-98]. The probability of an SEGR event is 
low and one rupture does not cause permanent damage to the device. Typically it would take at 
least 10 ruptures to cause a failure. Actel introduced products using a M2M antifuse. It is reported 
that this antifuse technology is immune to SEGR effect to a currently tested LET threshold of 80 
MeV-cm2/mg [Cron-98].
Actel Corporation offers two classes of rad-hard FPGAs, RH1020 and RH1280 with equivalent 
gate densities of 2,000 and 8,000 respectively. These products are latchup immune and can 
withstand total dose in excess of 300 krads (Si). Its SEU performance is predicted to be I O'6 
upsets per bit-day in a 90% worst-case geosynchronous earth orbit. Actel proposes mitigation 
techniques to improve SEU performance of these devices by incorporating TMR or by avoiding 
flip-flops in the sequential units [Acte-97]. The adaptation of one of these techniques is 
recommended for critical design sections.
Another rad-hard FPGA RHAX250-S is to be launched in a couple of years [Pate-04]. This 
device technology offers a gate count of 250, 000 with even better radiation performance. In 
addition to these rad-hard products, Actel offers radiation tolerant devices such as RTAX_S. This
* The rad is still often used as a unit for ionizing dose even though it is not a SI unit. The SI unit is the Gray.
Gy =  1 Jkg'l= 100 rad 
So 1 krad -  10 Gy
device technology has TED response comparable to RH1020 and RH1280 devices, and latchup 
and SEU immunity to 104 and 60MeV-cm2/mg respectively [Acte-05].
Xilinx offers the XQR series for space applications. This series is a subset of their commercial 
XC40000XL family. Although these devices have demonstrated acceptable total dose and latch- 
up performance for many space applications, they are susceptible to single event effects including 
SEFIs [Full-00, Gais-02]. Work is in progress at Xilinx for the development o f the single event 
immune reconfigurable FPGA (SIRF) [Bogr-05].
SEUs in reprogrammable FPGAs can be grouped into three categories [Gais-02]:
Configuration upsets are defined as the upsets in the configuration memory of the device and can 
be detected by read-back. The likelihood of failure will depend upon the upset location and the 
specific design utilisation of the device resources. A configuration upset can sometimes lead to 
high current states. For example, a SEU may cause two output drivers to be connected together 
[Wang-99].
User logic upsets represent upsets in the circuit elements that are not directly accessible by read- 
back. The effect of these is dependent upon the particular logical design implemented by the user.
System upsets are those upsets that occur in the control circuitiy of the FPGA. As an example, it 
is possible for a single upset in the configuration control circuitry to change many configuration 
bits simultaneously [Carm-01]. This kind of anomaly can only be detected by noting the 
malfunction of the device.
2.3.4 Data Handling Networks
Commercial data handling networks are of increasing interest for space missions. Results on 
radiation testing of both SpaceWire and firewire have been published in [Sied-02, Buch-03, 
Buch-05]. Broadly, errors can be divided into two categories: soft errors, which do not disrupt 
data transmission across the network, and hard errors or SEFIs, which stop communication and 
are most likely to be caused by an upset in protocol registers or control logic.
2.4 Trends in OBDH Architectures
Space missions typically require substantial computing resources. Mission-supporting computer 
systems include those computers on-board the spacecraft, as well as those on the ground.
33
Traditionally, the OBDH subsystem for a space mission consists of a computer or series of 
computers programmed to perform following tasks
• Process and store payload data.
• Gather and process routine housekeeping data, such as health data of different 
subsystems.
• Process and carry our instructions from ground controllers.
Broadly, space missions can be divided into two categories: manned and unmanned. Unmanned 
spacecraft computers differ from manned spacecraft computers in that they are designed to work 
much longer and use fewer spacecraft resources. Clearly, these circumstances cause different 
requirements for flight computers. This thesis focuses on mitigations, which are more suitable for 
the unmanned missions. Manned missions provide a different order of risk, which may well 
require additional robustness.
Of the two types of unmanned spacecraft, one is designed for earth orbit operations and the other 
flies to the Moon, planets, or deep space. Earth orbiters usually need no navigation after 
achieving orbits; space probes, however, are critically dependent on proper guidance. Earth 
orbiters can be commanded nearly instantaneously from the ground during the roughly 10% of 
the time they are visible to ground stations. Interplanetary probes need to be autonomous or at 
least capable of independent routine operation, due to longer periods out of Earth control. 
Therefore, the basis of fault handling on an interplanetary probe is failure detection and repair. 
Although desirable for earth orbiters, autonomous fault detection and recovery is less critical. 
Generally, the earth orbiters concentrate on ‘safing’ the spacecraft until the ground stations can 
help out [Toma-05].
There are two extreme approaches to the design of spacebome computer systems. One approach 
favours the use of highly reliable, large central data processors to control and process several 
subsystems on the entire spacecraft. The opposing approach favours the development of small yet 
flexible, low power, dedicated processors. A wholly centralized computer solves many of the 
information exchange and coordination problems, but creates another in the number and diversity 
of its interfaces with all the sensors and it poses a big challenge to fault-tolerant computer and 
system design. A compromise between the two extremes, which seems to combine their 
advantages, is a distributed system, where information processing is done at various levels. The 
level at which a particular process takes place is almost entirely a function of the sophistication, 
reaction time and bandwidth requirements associated with a process [Nasa-05].
34
As number of science missions is increasing per year while individual mission budgets are 
smaller, recent developments are designed to enable proven computer systems and techniques to 
fly or support more than one mission, reducing the cost associated with customized solutions. 
Also, there is a continuing reliance on multiple small computers operating in a network as 
opposed to large single computers [Toma-05]. This philosophy is reflected in deep space system 
technology program (DSSTP), which is managed at the JPL and is also called X2000 [Woer-98]. 
Each X2000 delivery will receive its requirements from a set of planned missions. For example, 
its first delivery is aimed for these five missions: Pluto/Kuiper express, Europa orbiter, solar 
probe, DS4/Champolion, and Mars sample return. The X2000 system architecture is distributed, 
scalable, and fault-tolerant. Among the COTS products deployed in this architecture, the use of 
commercial bus standards is the highest payoff application. This is due to the fact that unlike a 
system component or subsystem, a bus interface has a global impact on system cost and 
capability. The IEEE 1394 (also called Firewire) and I2C buses have been selected to act as 
system buses. The system architecture is shown in the Figure 2-4.
Command *  Data Handling QtoM  Maas Memory
■N
Non* i Flight
Volatile I ComputerM«miw F
Attitude Determination ^  & Rym Sllbsys,am
Figure 2-4 X2000 Data Handling Architecture, After [Chau-99]
The University of Surrey (Surrey Space Centre -  SSC) and SSTL have been one of the pioneers 
of the “smaller, faster, cheaper” design philosophy for space missions. Surrey’s satellites are 
largely based on COTS products and standards. In order to cope with the risk associated with 
COTS technology, Surrey relies mainly on sound design practices (application of appropriate 
mitigations etc.) and experience gained from previous missions.
35
In terms of physical architecture, Surrey’s spacecraft similarly use a COTS bus to link elements 
of the spacecraft OBDH network together, in this case using the two-wire Controller-Area 
Network (CAN) bus. Figure 2-5 shows a typical configuration -  in this case as adopted for the 
SNAP-1 Nano-satellite [Maqb-04]. The OBDH architecture can have multiple computers (OBC 
and payload computers), however, distributed computing is not supported. The processors and 
memories are not necessarily identical since they are assigned different tasks to perform. They 
can communicate on CAN, which make each computer look like an I/O device to the other.
Figure 2-5 SNAP-1 System Block Diagram
2.5 Conclusions
Developments in COTS device technology reveal a trend towards improved total ionising dose 
and latch-up response. The SEU sensitivity of many commercial technologies appears to be close 
to the limit imposed by alpha particles generated by naturally occurring radio-isotopes. Although 
this in itself results in a very low threshold LET, it seems at present that this is unlikely to get 
worse as devices are scaled further. However, the probability of getting complex failure modes 
(SEFIs) is growing with increased device complexity.
It is important to note that advanced memories including floating gate memories, DRAMs and 
SDRAMs are found to be prone to SEFIs. Microprocessors are considered to be the most complex
36
of logic systems. Although susceptible to single event effects, commercial microprocessors have 
successfully been deployed to space missions with careful mitigations.
Although OTP FPGAs present good radiation characteristics, there is growing interest in 
reprogrammable FPGAs because of their higher gate count and reprogrammable nature. Single 
event effects, in particular SEFIs, present a serious challenge to logic designers. OTP technology 
itself is not prone to SEFI like errors, however, an application running on an OTP FPGA still can 
suffer from SEFIs. For example, if an OTP FPGA is used as a network interface, a SEU in 
network protocol registers can manifest itself as a SEFI on network.
Although modem COTS technology has exhibited an improved latch-up response, the 
significance of SEL protection cannot be neglected. Device current consumptions should be 
carefully monitored, and prompt action (i.e. removal of power) should be taken in the event of an 
over-current condition being detected. Current monitoring can also help in detecting SEFI events.
Reusable and scalable OBDH architectures are required to lower the cost per mission. This 
requirement has led to adoption of commercial data networks. Fault tolerance is desirable to 
ensure reliable operation and to increase the availability of the system, and it becomes particularly 
important for interplanetary probes.
37
CHAPTER 3
INVESTIGATION INTO FUNCTIONAL INTERRUPTS
This chapter elaborates on the concept of functional interrupts. An overview of device 
complexities is provided followed by examples of SEFIs that have been observed in different 
devices during their exposure to radiation.
3.1 Random Access Memories
3.1.1 DRAMs
A DRAM memory cell stores information on a tiny capacitor, accessed through a metal oxide 
semiconductor (MOS) transistor. To read a cell, the bit line is first precharged to a voltage 
halfway between HIGH and LOW, and then the word line is set HIGH. Depending on whether 
the capacitor voltage is HIGH or LOW, the precharged bit line is pulled slightly higher or lower. 
A sense amplifier detects this small change and recovers a 1 or 0 accordingly. In addition, 
DRAM-based memory systems use refresh cycles to update eveiy memory cell periodically. In 
order to make the refreshing task simple and manageable, DRAMs are organized in two- 
dimensional arrays. Larger DRAMs have larger arrays and often have multiple arrays. One 
advantage of multiple arrays is to ease the electrical and physical design problems that would 
occur with an extremely large array. But even more important is the parallelism that can be 
achieved with multiple arrays. Taking advantage of the multiple arrays, a modern DRAM 
controller can perform several operations in parallel -  for example, completing a write operation 
in one array while initiating a read operation in another. In this way, effective throughput of the 
memory technology is increased [Wake-01]. However, these features make internal structure of a 
DRAM quite complicated and hence there is an increased susceptibility to functional interrupts. 
The following are examples of SEFIs in DRAM memory technology.
SEFI in DRAMs was first mentioned in 1997, where irradiation caused device to enter test mode 
[Koga-97]. During test mode, the device places a specified sequence of bits at outputs and does 
not respond to the usual read and write commands. Reading device during this mode reveals an 
unexpectedly high upset rate. The device often stays in the test mode, unless a proper signal is 
given to terminate. The device tested in this example was Oki Semi. 4Mb DRAM.
38
The next example to quote is an in-flight anomaly experienced by HST mission [Labe-98]. A 
SSDR containing DRAM for telemetry storage was installed on the HST in 1997. SSDR 
comprises 1440 16Mbits IBM DRAMs. Initial SEE testing, with heavy ions and protons, was 
performed on DRAM dies from the HST flight lot [Labe-96b]. Heavy ion testing revealed 
signature of a SEFI (a block of bad memory locations within a single die) with a very small cross- 
section. The predicted error rate was very small (< 1 per 200 years for the entire SSDR). No SEFI 
was observed during proton testing. However, the HST mission experienced two SEFIs during 
first nine months of its operation. The authors suggest that more extensive ground test data are 
required to predict a more realistic error rate for a mission. The authors suggest that this 
particular event might have been caused by an error in a redundancy latch that is internal to the 
DRAM device. Each device contains redundant rows and columns. When the device is powered 
up, weak (those where the data retention of the cells is suspect) or bad (those where bits are 
always incorrect) rows and columns are replaced with redundant rows and columns. A 
redundancy latch maintains the device configuration. A single particle strike can affect this 
configuration circuit. When this event occurs, a weak or bad row or column may get placed into 
the device configuration. A reset or power cycling through the device is required to remove the 
anomalous condition. Anomalous events, such as those observed in the HST mission, have 
demonstrated that future missions will require a modified fault tolerant system design capable of 
detecting and cycling power through devices to mitigate. In addition to these events, temporary 
block errors have also been observed where writing correct data to erroneous locations cures the 
problem.
Makihara reports SEFIs in NEC 16 Mbits DRAM [Maki-00]. The author suggests that these 
errors might have occurred because of upset in control circuitry, which turned on the gates of 
access transistors before the bit lines were precharged, resulting in unknown data being stored 
into the cells.
3.1.2 SDRAMs
SDRAM technology represents the state of the art in high density, volatile memory. SDRAMs 
have a complex internal architecture that controls their operation and also affects their 
susceptibility to radiation effects.
These devices have internal state machines to provide pipelines, programmable refresh modes and 
power states etc. A mode register is included on chip, which is used to configure basic device
39
operation. The mode register is set up once at power up. This register is found to be susceptible to 
radiations [Rodg-01, Layt-03, Nasa-03]. Few many bit combinations in mode register are not 
defined and, are reserved for future use. A single hit to mode register can result in loss of 
functionality until device is reset or in some cases power is cycled through the device. In [Nasa-
03], the author writes that toggling of “not connected” pins is also possible and it can cause 
damage to device. Power cycling is required to recover the device.
In an SDRAM, the memory array is divided into 2 or more banks. This allows one bank to be 
precharged while the other is being accessed. Hence, parallelism in operations can provide 
improved throughput. However, an SEU on the control circuit can cause block errors when active 
bank information is lost [Rodg-01]. Row-based errors have also been observed [Layt-03]. 
Typically, a large number of errors were seen in sequential rows, i.e. several rows that were next 
to each other would have a similar number of errors. The author termed it as logic SEFIs. Several 
types of logic SEFIs were observed. One type was accompanied by an increase in standby current 
of approximately 0.5mA to 2mA. hr most of cases, a device can be recovered from a SEFI by 
reinitializing the affected device and rewriting its mode register.
Koga et al. present their results on radiation testing of 128 and 256 Mbits SDRAMs from 
Samsung, IBM, Hitachi and Hyundai [Koga-01]. All devices that were examined underwent SEFI 
with heavy ions. In most cases, both read and write modes are not accessible, while in others one 
mode, read or write, functions properly. These results tend to indicate that some state machine 
bits in the control section have experienced SEU. If they are not properly restored, the device 
loses read and/or write functions. The authors report that a current increase by about 30 mA 
normally resulted in SEFI. Power cycling was required in order to recover from a SEFI and 
therefore, the authors termed these as permanent SEFIs [Koga-01].
3.2 Floating Gate Memories
This memory technology has a floating gate MOS transistor at eveiy bit location. Each transistor 
has two gates. The “floating” gate is not connected and is surrounded by high-impedance 
insulating material, hi contrast to read operations, writing to floating gate memories is a slow 
process. Internal write and command state machines have been included in flash device 
architectures to improve the apparent write time by interposing page buffers and successive 
execution commands [Schw-97]. The basic architecture used by the Intel 16-Mb flash devices is 
shown in Fig 3-1.
40
Arfdrean
flJlH
inftut
Butters
Address 
TSmliiQ aixl 
Decoding
TV
i t
Uilertaee 
C&rtlirol 
Wagin '
. F L A S H  ■
A r r a y  Caffi
o m a i i t i
S t m & G k i
X
5 1 3 .’HUlls
G t & J i i t .
F m i W c  "
| I
i i
Commandgist*
Machine
Write
Slate
Matihtne
_ . - i _
T*<r
Wrlte
E n a b l «
Figure 3-1 Block Diagram of the Intel Flash Memory, After [Schw-97]
It can be seen that flash memories have some parallels with microprocessors because they use a 
complex internal architecture, hence making it susceptible to complex single event effects but the 
visibility of internal changes caused by the radiation depends upon the way that the part is 
applied. For instance, no SEU has been reported during an unpowered mode. Table 3-1 
summarizes SEFI signatures that have been observed in flash memories along with proposed 
recovery procedures [Schw-97, Miya-98, Nguy-02].
Table 3-1 SEFIs in Flash Memory
SEFI Type Manufacturer Recovery Method
Read operation locked into an 
endless loop, with an increase in 
supply current
AeroFlex Power cycling
Read operation locked AeroFlex, Toshiba, 
SanDisk, Intel, Samsung
Repeat the read process or 
cycle power
Write operation locked Intel, Samsung Power cycling
Row/column changes: Large 
portions of the memory array 
change state
Intel Power cycling
Block-erase: the device is stuck 
in a “busy” state
AeroFlex, Toshiba, 
SanDisk, Intel, Samsung
Power cycling
Partial-erase: the device requires 
repeated erase commands for one 
or more blocks
AeroFlex, Toshiba, 
SanDisk, Intel
Repeat the erase operation
SEFIs have also been obseived in EEPROMs. Koga reported two types of SEFIs in Atmel and 
Xicoro EEPROMs during Read mode [Koga-97]. One type manifests by the appearance of 
repeated errors, once the first error had been detected during ion irradiation. Errors were altered 
bits in one word at various address locations. Simultaneously with the observation of the first
41
error, the device bias current increased to 26mA from 20mA (normal value). The bias current 
continued to be 26mA until the reading process stopped. At that time, the current dropped to
0.2mA. Reading device again exhibited same higher current and higher error rate. However, once 
power was cycled device became functional again i.e nomial current value and normal upset rate. 
The second type of errors manifests by “00” (Atmel part) and “FF” (Xicoro part) in all address 
locations. Power cycling was required in all cases to recover from SEFI.
3.3 Microprocessors
Microprocessors are often sensitive to SEUs, which can occur in any basic element of the 
processor including the program counter, the sequence controller, the register, and the arithmetic 
logic unit (ALU) [Koga-85]. Depending on time and location of upset, it will manifest itself as a 
calculation error or SEFI caused by alteration of control or address bits. In addition to these 
events, errors have been observed where the processor simply reset itself or stopped functioning 
without showing any observable bit errors. The latter is called lock-up SEFI (or simply lock-up) 
[Koga-97]. Tests on commercial unhardened microprocessors have shown their sensitivity to this 
particular type of single event effect. Johnston has postulated that for modem commercial 
microprocessors approximately 10% of test mns result in a “hung” processor. The dependence on 
the operating system running on the target machine has been reported [John-00]. Test results on 
Power PC’s and 80X86 architectures are presented in table 3-2 [Mora-95,Labe-96b, Howa-01, 
Irom-02].
Table 3-2 SEFIs in Microprocessors
SEFI Type Device under test Recovery Method
Device entered halt state 80486 Power cycling
Device locked up, current stayed at 
typical operating level
80386 Hardware resetting
Device locked up, current dropped 
to a value indicative of standby 
mode
80386 Power cycling
All types Pentium HI and AMD K7 Resetting or Power cycling
Device locked up and was unable to 
respond to an external interrupt
Power PC’s Resetting or Power Cycling
3.4 Field Programmable Gate Arrays
Several different types of SEFIs have been observed in FPGAs [Cami-99, Full-00, Facc-00, Gais- 
02, Matt-01] and are summarized below.
42
3.4.1 Device De-Configuration
An upset that occurs in the device power on reset circuit will cause the device to lose its 
configuration data. Mattsson postulates that this type of error is detected when all shift registers 
go out of function at the same time. The device requires a reset and complete reconfiguration for 
recovery [Matt-01].
Another example of this phenomenon was observed during heavy ion testing of the 
VirtexXQVR300 FPGA with a different signature [Full-00]. The author observed that the number 
of observed upsets exceeded the total number of particles radiated on the die by as much as 10 
times. The observed threshold LET was between 8 to 16 MeVcm2/mg.
3.4.2 Interruptions from JTAG Operations
The standard test access port (TAP) controller implementation is a 4-bit binary encoded state 
machine. A single event upset to one of these registers can move the controller to any of the 
available TAP states. It may result in activation of the boundary-scan registers and disengaging 
I/Os from standard operation. This type of SEFI can be observed in both SRAM-based and 
antifuse devices.
3.4.3 Configuration Upsets Leading to a SEFI
It is reported that on average 6.5 configuration upsets are required to produce a functional failure 
in a device with no mitigation at all [Carm-01]. Section 3.4.4 is an example of such a situation.
3.4.4 Activating Output Drivers on an Input Pin
For any given single input multiple configuration cells are required to be upset to activate the 
output driver. Although this condition is extremely unlikely, it may occur and can cause bus 
contention. An SEU may cause two output drivers internal to the chip to be connected, resulting 
in an unintentional high current state that my exceed current density requirements for reliable 
operation.
3.5 Data Handling Networks
In network interfaces erroneous behaviour is usually caused by an upset in protocol registers or 
state machines. For instance, Seidleck et al presents an example of a soft error in the 
asynchronous request filter low (ARFL) register, which is a link layer register of the IEEE 1394 
bus [Sied-02]. The function of this register is to enable reception of asynchronous packets. When 
an asynchronous request packet is received, the bit corresponding to the source ID is examined in
43
ARFL register. If a SEU turns a bit from 1 to 0, packets with corresponding source ID will neither 
be acknowledged nor be queued.
Another example of how a SEFI can occur is if an SEU occurs in bit 17 of the host controller 
control register on the link layer. This bit is set to “1” when the system is ready to begin 
operation. If an SEU switched it to a “0”, the link layer would immediately be disconnected from 
the 1394 bus. No packets would be received or transmitted. To resume communications, a “1” 
would have to be restored via software instruction [Sied-02].
Similarly, the SpaceWire standard specifies a state machine to hold a sequence of steps for 
recovering from link disconnect errors [Ecss-03]. An SEU on one of the entries can cause 
erroneous behaviour and it will not be evident unless the node experiences a link disconnect error.
Conclusions
SEFI is a collective term for device interrupts due to radiation. Its detection is purely based on 
observation. In a microprocessor a SEFI is an event, which causes the processor to lockup, to 
have continuous exceptions, to execute standby mode, or to go into some unknown, 
unrecognizable state. In a RAM device, an error rate higher than expected will be treated as a 
SEFI eyent. For a flash memory device, it is possible to see a high upset rate or hanging of its 
operation, e.g. it is stuck in a read operation. In a reprogrammable FPGA, a SEFI can completely 
disrupt its functionality. A data network SEFI would result in loss of communication on the 
affected link. All device technologies may exhibit variations in current consumption during a 
SEFI event. Resetting (reconfiguration in the case of reconfigurable FPGAs) or power cycling is 
required in almost all cases to restore a device from a SEFI.
44
CHAPTER 4
FAULT TOLERANCE STRATEGIES
In order to use microelectronic devices effectively and reliably, different approaches have been 
adopted. Broadly, there are two ways to achieve reliability in microelectronics: fault avoidance 
and fault tolerance. Fault avoidance attempts to eliminate faults from the system at the design 
stage, e.g. by hardening of the hardware cells, adopting rigorous software development processes 
or formal verification techniques [DeVa-04]. On the other hand, a fault tolerant system designer 
believes that fault will occur but that the system should have sufficient mitigation to perfonn the 
required tasks despite the faults in the system [Shir-01]. Currently, a particular stream of the 
space industry has become driven by the “faster, better, cheaper” design philosophy and 
therefore, in these regimes, fault tolerance has become the prevailing practice.
This chapter summarises different mitigation/fault tolerance strategies that are currently being 
applied at different levels to increase reliability of a system for space missions.
4.1 Mitigation Techniques for Memories
4.1.1 Error Detecting Codes
A system that uses an errors-detecting code generates, transmits, and stores only encoded words 
(i.e. the original data plus additional code bits). Thus, error in a bit string can be detected by a 
simple rule -  if  the bit string is a valid encoded word, it is assumed to be correct; if it is a non- 
valid encoded word, it is assumed to contain an error [Wake-01]. Two commonly used error- 
detecting codes are parity check and cyclic redundancy check (CRC). Parity, usually a single bit 
added to the end of a data structure, indicates whether an odd or even number o f ones were in the 
structure. This method detects an error if an odd number of bits are in error, but if an even number 
of errors occurs, the parity will be still correct -  hence the method, whilst simple to implement, is 
far from robust. The CRC technique is based on performing modulo-2 arithmetic operations on a 
given data stream, then interpreting the result as a polynomial. When encoding occurs, the data 
message is modulo-2 divided by the generator polynomial. The remainder of this operation 
becomes the CRC character that is appended to the data structure. For decoding, the encoded bit 
pattern is divided by the generator polynomial. If the remainder is non-zero, it is an indication 
that an error has occurred [Skla-88]. This is a more complex but more robust error detection 
scheme.
45
4.1.2 Error Detection And Correction Codes (EDAC)
Error detection and correction codes are used to protect digital data against errors that can occur 
in storage cells or transmission channels. One commonly used code is the Hamming code. The 
early UoSAT series of satellites used a Hamming (12,8) code to protect each byte of program 
memoiy for the on-board computers. This code uses 12 bits to store and protect an 8-bit byte. The 
ED AC circuit generates the 4 extra parity bits, whenever a byte is written to memory. When the 
CPU requests a byte, all 12 bits are read into the ED AC circuit, which performs bit correction 
before passing on corrected data to the CPU. However, the EDAC circuit does not write 
corrected data back to memory. Thus, in order to keep the program memory free from error 
accumulation, each memoiy location must be periodically read and written back to the memory. 
This process is called memory washing or scrubbing. The Hamming (12,8) code can detect and 
correct any single bit error in each word. Single event upsets are relatively easily handled by 
using Hamming EDAC scheme. However, multiple bit upsets as well as stuck bits cause more of 
concern as they can defeat the code. A simpler way of reducing multiple bit upsets is to choose 
devices with no physical bits adjacent to any of their logical neighbours. However, these data are 
not readily available for commercial devices. One possible way is to perform ground testing (e.g. 
with lasers) to map out the memory structure of a device. Another solution to mitigate MBUs is to 
re-arrange the memoiy so that it is constructed from devices with a “x 1-bit” architecture (i.e. bit- 
plane memory). However, the major implication for this scheme is that it requires several devices 
to be enabled at once in order to perform read or write to the memory. Hence, it requires greater 
power consumption for the memoiy system. Another disadvantage is that, if a memory device 
fails, this scheme will no longer be able to cope with any SEUs. More recently, SSC has devised a 
modified Hamming-like (16,8) code, which takes 8 data bits and generates 8 parity bits. This code 
is capable of detecting and correcting 2 errors in one code word [Hodg-99], and has been found to 
be sufficiently robust to protect Surrey’s current satellite program memory systems.
Block error codes, such as the Reed-Solomon (RS) code [Skla-88], provide more powerful error 
correction. A Reed-Solomon encoder takes K words of data and produces L parity words, where 
the correctable number of words is L/2. Note the code symbol is the data word not the data bit. 
The NASA veiy large scale integration (VLSI) design centre has developed RS (255,223) code on 
a single integrated circuit (IC). This particular code is capable of correcting up to 16 consecutive 
bytes in error in any 223 byte block [Label-96a]. The UoSAT satellites use RS (255,252) code 
that corrects one byte and detects two bytes in error for their SSDRs. Although less probable, 
single event upsets causing bit flips in two or more bytes might occur. A significant improvement
46
in temis of ED AC capability can be achieved by moving from traditional RS (255, 252) to a 
modified RS (256, 252) code, this latter being capable of correcting up to two bytes [Unde-96]. 
This new ED AC scheme has been designed and developed at SSC [Hodg-99].
ED AC codes can be implemented in either hardware or software. The main considerations in the 
application of ED AC schemes are the associated area overhead and performance penalties [Lima- 
02]. The implementation of a technique in software results in reduced availability of the computer 
for other computational tasks. When the ED AC scheme is implemented in hardware, the penalty 
comes from additional board space, weight and power requirements. Another important 
consideration is the speed associated with the implementation. Hardware implementations are 
faster than software ones.
Built-in test (BIT) can serve as another ED AC scheme. This method essentially performs a read 
operation on unused memory cells, compares the read values with the known values, and writes 
back the correct value if  the two differ [Sied-95]. This scheme requires the storing of data in two 
different memory units, where one should be superior to other in terms of radiation response. In 
addition to bit compare, error-detecting codes can also be used to detect the error in a data 
structure.
4.1.3 SEFI Detection for Ground Testing
Guertin et al presents dynamic SEFI detection and recoveiy for ground testing of Hynix/Hyundai 
SDRAMs devices. They define a SEFI as an event when n out of N bits in a memory region are in 
error. They have calculated a n:N ratio of 0.375, i.e a SEFI is declared when 96 out of 256 bits are 
in error or 348 in 1024 and so on [Guer-04].
It is also reported that the probability of the occurrence of SEFIs in a SDRAM device can be 
reduced by periodically rewriting its mode register [Koga-01, Guer-04].
4.2 Mitigation Techniques for Microprocessors
A fault tolerant system can provide two types of recovery, backward or forward. In backward 
recovery, the system goes back to a previous, error free state and performs the task. On the other 
hand, forward recovery constructs a valid, error free new state from existing (usually redundant) 
information [Aviz-97]. When a permanent fault is detected, fault removal is performed by either 
substituting a good spare subsystem, or by reconfiguring the system to operate without the faulty 
unit.
47
4.2.1 Fault Detection
A system fault can be detected manually or automatically depending on operating modes and how 
quickly the system needs to be restored. The art of designing an automatic fault detection system 
requires an understanding of the expected behaviour of the device under consideration. Faults can 
be detected either on-line or off-line. With the off-line detection, the device is unable to perform 
any function during the test.
4.2.1.1 Watchdog Timers
One of the most commonly used on-line detection techniques for microprocessors is the 
application of watchdog timers. Watchdog timers can be classified into two categories: active and 
passive. In the active watchdog technique, one device is programmed to send a pulse to another 
independent device at a specified time interval. If the first device fails to send this pulse, the 
second device will take a recovery action, e.g. it might reset first device. In the passive watchdog 
technique, the normal operating conditions of a device are monitored. For example, in spacecraft 
X’s normal operating scenario, it receives uplink messages from the ground station say every two 
hours. There is a timer on-board the spacecraft that times out if no uplink is received within this 2 
hour period. The spacecraft then initiates an action such as a switch to a redundant antenna or 
uplink interface, a power cycling of the uplink interface, etc. What makes this a passive watchdog 
is that no specific signal needs to be sent between devices, but a monitoring of normal operating 
conditions is sufficient. Multi level watch-dogging has also been demonstrated and is summarized 
below [Labe-92, Labe-96a].
• A software task executing in the main microprocessor times-out if a value is not passed
by a second software task and that restarts the processor from a known state.
• A programmable interrupt signal from the main microprocessor provides a reset pulse to 
an external timer circuit that times-out if not written to within an N second window 
causing a hardware reset pulse to occur to the processor.
• An "I'm okay" pulse between the prime and secondary processors must occur once every 
X seconds, upon which the secondary processor may remove or cycle power to the main 
processor or place the system in safe mode for external intervention.
• A multi-day timer places the system into a safe mode if proper system operations have 
not occurred within a 24 hour period.
4.2.1.2 Lockstep
Operating two identical circuits with synchronized clocking is termed a lockstep system [Ksch- 
91, Labe-96a]. Lockstep or duplication of the logic can serve as a means of detecting the faults
48
on-line. If the lockstep systems disagree, they halt and an error-flag is raised to enable a recovery 
procedure. An important consideration during the design of such technique is the total ionising 
dose response of devices, as there can be increasing clock skew with increasing dose. If each 
device responds even slightly differently to dosage, the system may experience false triggers.
4.2.1.3 Built-In Testing
Another common methodology is built-in testing (BIT). BIT ranges in complexity from a lamp 
that lights when a system fails, to a resident computer that generates test signals and evaluates 
system responses [Jsc-03J. BIT can either be active or passive. In active BIT, a device is taken 
off-line and a test pattern is written to and is compared with the expected pattern. Passive BIT 
monitors system performance on-line without the use of test pattern generator.
Czajkowski et al proposes use of a special purpose hardened core (H-core) within an OBC unit to 
mitigate against SEFIs in the OBC processor. The OBC processor sends its status through a 
dedicated line to the H-core. A SEFI in microprocessor results in a failure to send the status 
signal. The H-core chip detects the occurrence of the SEFI and asserts an interrupt signal to the 
microprocessor [Czaj-05]. That is, a unit level approach has been adopted. The proposed research 
has a wider scope as in this case a rad-hard supervisor has been added into the OBDH 
architecture. Both schemes are capable of including active or passive BIT attributes because of 
programmable nature of the monitoring entity. However, in Czajkowski’s approach, a rad-hard 
core is used to monitor the OBC-processor whereas in this thesis a rad-hard microprocessor is 
used to monitor a chain of devices (microprocessor, OBC interface and the supervisor interface). 
Detection latency will somewhat increase in our approach. However, the cost of the Czajkowski’s 
approach multiplies with the number of units to be protected.
4.2.2 Fault Handling
There are two ways to achieve fault tolerance in computer systems. Either hardware overheads 
can be included in the design to make it robust, or the same purpose can be achieved by the 
provision of fault tolerant software.
4.2.2.1 Hardware Fault Tolerance
Hardware fault tolerance is the ability of hardware to detect and recover from a fault that is 
happening or has already happened.
49
Fault Masking
In this approach, a system is designed to perform correctly by hiding the effects of failure in any 
unit. It certainly requires redundancy. The most popular technique for fault masking is N modular 
redundancy (NMR). This technique masks the effect of the fault by performing voting on three or 
more units. This technique can be implemented with or without spare processors. If there are 
spare processors in the system, they can replace a faulty processor and hence can improve 
dependability of the system. NMR can handle both permanent and transient faults. The voter is a 
critical node for a NMR design, however, it is implemented in combinational logic and the chance 
of getting upset is low [Lima-02]. NMR can be used in memory systems as well [Unde-96]. The 
main disadvantage of this technique is large overhead. Fault masking presents an example of a 
forward recovery technique.
Dynamic Redundancy
In this technique, only one copy of a computation is running at a time and the system has a spare 
processor to take over in fault conditions. This is also known as cold redundancy. Cold 
redundancy is adopted at Surrey to tolerate permanent faults in computing units [Unde-96]. 
Another form of dynamic redundancy is where two unchecked copies of the same computation 
are running on main and spare processors. In case of a fault on the main processor, the system 
will switch to spare one. This is also known as hot redundancy. In contrast to fault masking where 
voting detects the faulty unit, dynamic redundancy requires a detection mechanism to declare that 
main processor is not functioning correctly.
Dynamic redundancy offers savings in the hardware as compared to a voting system. Its 
disadvantage is that computational delays occur during fault recovery, fault coverage is often 
lower, and usually it requires some form of software fault tolerance as well.
4.2.2.2 Software Fault Tolerance
Software fault tolerance is the ability of software to detect and recover from a fault that is 
happening or has already happened in either hardware or software of the system in which the fault 
tolerance software is running [Shi-04].
Multiversion Software
In a multiversion or N-version software system, each software task is built with N different 
implementations. A software voter performs voting to determine the correct output. Each version 
of the software is (preferably) built by a separate team using slightly different methodologies, but
50
with the provision that all versions need to be inter-operable. Design diversity is an important 
consideration while developing multiversion software systems. The purpose of multiversion 
software is to avoid common failure modes among different implementations. Usually, this kind 
of software technique is used with N-modular hardware units - hence, delivering a very robust 
architecture.
Recovery Blocks
The recovery block technique operates with an adjudicator -  i.e. an acceptance test, which 
confirms the result of a computation. In a system with recovery blocks, the system software is 
broken down into fault recoverable blocks. Each block contains the following elements:
• A primary routine, which is analogous to the control routine.
• An alternate routine, which can perform the same task with a diverse and simpler method.
• An acceptance test, which can be a generic fault detection function or application 
specific.
The differences between the recovery block method and multiversion software are not numerous, 
but are important. The N-version method has always been implemented with N-modular hardware 
redundancy [Shi-04]. On the other hand, the recovery block idea is associated with serial 
execution of primary and alternate blocks. Another important difference is between the 
acceptance test and the voter. The recovery block method requires that each block build a specific 
acceptance test. For the N-version method, a relatively simple voter can be used.
Distributed Recovery Blocks
Generally, the recovery block technique is based on the backward recovery concept and therefore 
suffers from the disadvantage of significant latency in fault recovery. An improvement on this 
scheme is the distributed recovery block (DRB) methodology. The primary routine is run on the 
main machine, whilst the redundant machine executes the alternative routine. If the primary result 
fails the acceptance test, the system takes its output from the alternative routine on the redundant 
computer. In this way, the system can cope with both hardware and software faults.
Primary/Follower
In this technique, two microprocessors are used, one of which acts as primary and the second as 
follower. Both of these processors are equipped with software tasks, and their associated 
acceptance test and an exception handler. The primaiy node runs the task. If it does not pass the 
acceptance test; the exception handler is invoked, and output is obtained from the follower node.
51
If the application task crashes or a hardware or operating system failure occurs in the primary, 
then control is taken over by the follower node.
4.2.3 Computer State Recovery
Software fault tolerance has been described in the preceding section. The main puipose of 
software fault tolerance is to prevent the software crashing. However, if a crash does occur, the 
system needs to be restarted. Thus, computer state recovery is aimed at recovering from a crash in 
a graceful manner while preserving computational state and critical data. State recovery gives an 
application or system the ability to save its state, and tolerate failures by enabling a failed process 
to recover to an earlier safe state.
Computer state can be recovered using backward/roll-back error recovery (BER) through 
checkpointing and forward/roll-forward error reeoveiy (FER) through redundant hardware. BER 
schemes can be further distinguished by whether they are implemented in hardware, software, or 
message-passing systems [Sori-02].
4.2.3.1 Hardware Rollback Recovery
In BER schemes, the state of the system is checkpointed periodically. When a checkpoint is 
executed, a snapshot of all program states is saved into some non-volatile, machine accessible 
medium. An error is tolerated by recovering to a previously checkpointed state.
Hardware BER schemes have often utilized the caches and/or the cache coherence protocols. The 
cache-aided rollback error reeoveiy (CARER) scheme for uniprocessors uses a normal cache with 
a write back update policy to assist rapid rollback recovery [Hunt-87]. This scheme is integrated 
with the cache controller, checkpointed system state is maintained in main memoiy, and 
checkpoints are established whenever a modified cache block needs to be replaced. Ahmed et al 
extend CARER for multiprocessors by synchronizing the processors whenever any of them need 
to take a checkpoint [Ahme-90]. Wu et a /’s multiprocessor extension of CARER allows a 
processor to write into its private cache between checkpoints [Wu-90].
4.2.3.2 Software Rollback Recovery
Software checkpointing have been developed, at radically different engineering costs from 
hardware BER schemes. Tandem machines use a checkpointing scheme in which every process 
periodically checkpoints its state on another processor [Serl-84]. If a processor fails, its processes 
are restarted on the other processors that hold the checkpoints. Condor, a batch job management
52
tool, can checkpoint jobs in order to restart them on other machines [Litz-97, Sori-02]. 
Applications need to be linked with the Condor libraries so that it can checkpoint and restart 
them. Other schemes use software to periodically checkpoint applications for purposes of fault 
tolerance [Wang-95, Wang-97, Plan-98]. These schemes differ from each other primarily in the 
degree of support required from the programmer, linked libraries, and the operating system.
4.2.3.3 Message Passing Rollback Error Recovery
Numerous BER schemes exist for message passing systems. Elnozahy et al provide an excellent 
survey of this area of research, which will now be discussed in some detail [Elno-99]. 
Checkpoint-based rollback recovery techniques for message passing systems can be classified 
into two categories: uncoordinated and coordinated.
Uncoordinated Checkpointing
Uncoordinated checkpointing allows each process the maximum autonomy in deciding when to 
take checkpoints. The main advantage of this autonomy is that each process may take a 
checkpoint when it is most convenient. However, this results in useless checkpoints if a 
dependency exits between tasks.
Upon failure of one or more processes in a system, these dependencies may force some of the 
processes that did not fail to roll back, creating rollback propagation. For example, consider a 
situation where a sender of a message has been rolled back to a state that precedes the sending of 
the message. The receiver of message must also be rolled back to a state, which precedes receipt 
of the message to produce a consistent system. Under some scenarios, rollback propagation may 
extend back to the initial state of the computation, losing all the work performed before a failure. 
This situation is known as the domino effect [Elno-99]. Uncoordinated checkpointing is prone to 
this problem.
Coordinated Checkpointing
Coordinated checkpointing is recommended for cooperating tasks to avoid the domino effect, hi 
coordinated checkpointing, processes coordinate their checkpoints to produce a system-wide 
consistent state. This therefore requires each process to maintain only one recovery line (i.e. the 
most recent consistent set of checkpoints) on non-volatile storage and eliminates the need for 
garbadge collection protocols. Garbage collection is defined as deletion of useless recoveiy 
information from non-volatile storage. A common approach to garbage collection is to identify 
the recovery line and discard all information relating to events that occurred before that line. Koo 
and Toueg’s scheme uses an exchange of messages to coordinate checkpointing [Koo-87],
53
whereas other schemes assume synchronized physical clocks to coordinate checkpointing without 
an exchange of messages [Rama-88, Cris-91].
4.2.3.3 Roll- Forward Error Recovery
FER schemes use redundant hardware to mask errors. A typical FER scheme is TMR. Other FER 
schemes can be used to detect errors (requiring only duplicate redundancy) or mask more than 
just a single error (with higher degrees of redundancy). For example, the Stratus computer system 
uses two pairs of processors to mask errors. Within each pair, the two processors compare results, 
if the results do not match, an error has been detected and the other pair is now in control [Sori- 
02].
4.2.4 Computer State Recovery for Space Systems
Computer state recovery is relatively new in space systems. An example has been collected from 
literature and is presented below.
4.2.4.1 Triple Modular Redundant Flight Computer
Angilly presents a triple modular redundant flight computer (TREMOR) [Angi-04]. In case of 
detection of a SEU in any one of three processors, the remaining two processors are signalled to 
save their state on local non-volatile flash memory and then the faulty processor is reset. 
Processors in this scheme run RTLinux operating system, which supports state recovery. 
However, a uniprocessor system cannot support this method because if the only processor of the 
system is faulty it cannot be used to copy tasks states from main memory to flash memory.
54
Figure 4-1 TREMOR Block Diagram, After |Angi-04|
4.3 Mitigation Techniques for FPGAs
As mentioned earlier, antifuse FPGAs present advantages in the context of radiation effects. 
However, these are still prone to SEUs. Actel antifuse FPGAs consist of two types of logic units: 
the register cell (R-cell) and the combinational cell (C-cell). Actel recommends sequential 
elements to be implemented in one of three fault tolerant ways: CC, TMR, or TMR-CC. [Lima- 
02]. The CC technique uses combinational cells with feedback to implement a storage element. 
This technique relies on the fact that latches are more prone to SEUs as compared to 
combinational logic. The TMR technique consists of implementing three registers to perform the 
same job and then using voting logic to determine the accepted output. TMR-CC is again a TMR 
technique but this time each register is composed of combinational cells with feedback.
As reprogrammable technology has been evolving at a rapid pace, a number of mitigation 
techniques have been proposed to cope with radiation issues. Readback with partial 
reconfiguration, and redundancy are widely being used [Gais-02, Lima-02] and are summarized 
below.
55
4.3.1 Bitstream Repair Technique
Iii this technique, the device configuration can be read back at any time without an interruption to 
service. The Virtex configuration interface has the additional capability of addressing small 
portions of the configuration map for read and write operations. This is referred to as partial 
configuration and provides an extremely efficient means for SEU correction [Carm-99]. These 
features allow two techniques for maintaining coherency of the bit stream. Scrubbing simply 
rewrites the device bit stream. Continuous read-back in conjunction with a detection algorithm 
(bit compare, CRC etc.) provides data on any upsets encountered. Then, partial reconfiguration 
repairs any section of the device where an error is detected [Fall-00].
This technique does not cope with the fact that the device is likely to have been in error for some 
time before the error is detected and corrected, and in that time the state and consequences of the 
error are unknown.
4.3.1.1 An Example Architecture: Adaptive Instrument Unit (AIM)
The AIM is one of the experimental payloads on Australian spacecraft, FedSat. It makes use of 
Virtex XQR 4062 FPGA to serve as a reconfigurable processing unit [Cond-99]. A block diagram 
of the AIM board is shown in the Figure 4-2.
Figure 4-2 AIM Architecture, After fCond-99l
The Xilinx FPGA operates under the control of a rad-hard UTMC 80C196 microprocessor, which 
loads the configuration into it from a flash memory. The configuration data are serially read out 
from the Xilinx device to perform error mitigation. The read-back function is performed using a 
combination of circuitry in an Actel FPGA device and software. To start the readback process, the
56
microprocessor sends a pulse to put the Xilnx device into read-back mode and initializes the 
circuitry in the Actel FPGA, which implements dual 128-bit shift registers. The read-back data 
are loaded into one shift register at a time. When the shift register is full, an interrupt is generated 
for the microprocessor and the read-back data are fed into another shift register. When the first 
128 bits of data have been transferred to the second shift register, another interrupt is given to the 
microprocessor and it will initialize the loading of next 128 bits into first shift register, and so on. 
A complete readout of the configuration memory takes about one second. Next, the read-back 
configuration data are compared with the one pre-stored in the flash memory. Thus, configuration 
readback is a complex task.
4.3.2 Redundancy Techniques for FPGAs
TMR has been proposed at different levels for FPGAs. Carmichael et al. proposes TMR within 
the device - i.e. it triplicates all combinational and sequential elements of the design, with static 
SEU correction through configuration scrubbing [Carm-04, Gais-02]. Devices, which incorporate 
a partial reconfiguration capability, allow enhanced performance for SEU correction.
Module Module
* a a
witer
n r
oulo
Module
Q  T- CMa a a
C— g S3.
I ll
voter
□utl
|  volar |  
□UP?
FPGA
Figure 4-3 FPGA TMR Design, After [Gais-02]
This approach is reported to mitigate errors in configuration as well as user memory. This 
technique has been tested for radiation effects and results were analyzed for designs with and 
without TMR (no bit stream collection was performed). Carmichael reports that no soft errors 
were observed with the TMR design. However, functional interrupts were seen. These were 
thought to be due to the accumulation of errors in the configuration memory. Scrubbing of the 
configuration bitstream was then applied, and an enhancement in robustness was observed [Carm-
04]. 3D Plus has adopted TMR and scrubbing to mitigate Virtex-H FPGA to be used as part of 
their radiation tolerant memory unit for space application [Darg-05].
Xilinx offers a TMR tool, which can work with any hardware description language (HDL) and 
any synthesis tool to automatically build TMR, called Xilinx triple modular redundancy (XTMR), 
technology into any Xilinx FPGA design [Bogr-05], Recent results also confirm that the static
57
SEFI cross-section is the dominating factor for calculating orbital error rates for any Virtex-II 
design when mitigated with full XTMR and scrubbing [Carm-05].
The limitation of this TMR technique is obviously the overhead associated with TMR design -
i.e. this technique requires that the user logic should not exceed one third of the FPGA logic 
resources. However, in cases where the design exceeds 1/3 of the FPGA size, logic partitioning 
can be considered. This is shown in the Figure 4-5. All outgoing signals, whether they are internal 
signals between units or design outputs, are required to be mitigated before going off-chip. 
Therefore, this scheme may be somewhat complex. Nevertheless, it can offer required 
performance as long as care is taken to not stretch critical design paths across multiple chips 
[Carm-99].
Ill
Figure 4-4 Logic Partitioning, After [Carm-99J
Logic duplication is another proposed mitigation technique [Carm-99, Gais-02]. The main 
disadvantage of this technique is the possible skew in the output transition times of the two 
devices.
A  _____ 3  v o te r  1---------
FPGA
. 1
A  ----------------------------------------- ’ > -
-------------- 1 FPGA
“  . 3  v o t e r  I--------------- 1
. A.
Figure 4-5Logic Duplication, After [Gais-02]
TMR at device level has also been mentioned as the most effective solution against SEUs and 
SEFIs. It is the most costly solution as well and is found to provide a marginal actual 
improvement over alternative methodologies [Carm-99, Gais-02].
58
Figure 4-6 Device TMR, After [Gais-02]
4.4 Mitigation Technique for Data Networks
4.4.1 X2000 4-Layer Fault Tolerance Strategy
The X2000 program has adopted IEEE 1394 bus as the main system bus. This is a commercial 
bus standard. The 1394 bus supports two modes of communication between nodes, synchronous 
and isochronous. The isochronous mode guarantees on-time delivery but does not require 
acknowledgment from receiver, therefore it is useful for audio or video data transfer. On the other 
hand, the asynchronous mode does not guarantee on-time delivery but offers increased reliability 
of data and is therefore useful for data file transfers. The 1394 bus standard supports different 
mitigations with both of the transfer modes.
A multi-layer fault protection scheme was applied to the system bus as follows [Chau-99]:
Laver 1: Native Fault Protection: The 1394 bus standard supports many built-in fault detection 
mechanisms as summarized below:
• Data and packet headers CRCs for both transfer modes.
• Acknowledgment packets in asynchronous mode.
• Parity bit to protect acknowledgment packets.
• Built-in time-out conditions, response time-out, arbitration time-out, acknowledgement 
time-out.
Layer 2: Enhanced Fault Protection: The X2000 architecture enhances the fault detection 
capability of the target bus with “heartbeat” and polling.
Laver 3: Design Diversity: Most of the COTS devices suffer from the hazard of single point 
failure (transient or permanent). In case of the 1394 bus, its tree topology presents a serious risk 
to mission performance. Failure of a branch node can partition the bus into two or more branches, 
hi the X2000 architecture, this limitation of the 1394 bus is dealt with by connecting all system 
nodes to a redundant bus with a different topology. The I2C bus has a multi-drop bus topology, 
and has been adopted to assist in the 1394 bus fault isolation and recovery.
59
Laver 4: System Level Redundancy: In high reliability applications, system level redundancy is 
considered as an important tool. The X2000 architecture duplicates the COTS bus sets to provide 
system level redundancy.
4.4.2 SSTL’s CAN Protection
Surrey’s satellites contain a dual redundant CAN bus system. On power up, a unit communicates 
on the primaiy link. If a unit does not receive any CAN message for 5 minutes, it assumes link 
failure and switches to the redundant link [Wood-04],
4.5 Conclusions
Memoiy device technology has traditionally been mitigated against upsets in storage cells only. 
Hamming codes are a simple way to provide single bit correction, however such codes may be 
ineffective against MBU unless the device architecture is such that physically near-neighbouring 
cells do not map to logically near-neighbouring cells -  often they do. Majority voting can serve as 
another effective means to mitigate MBUs -  albeit with high storage overhead. In order to avoid 
the accumulation of MBUs due to coincident independent SEUs, the washing of memory systems 
is required to be carried out at a sufficient rate in comparison to the mean SEU rate.
Hardware-based EDAC schemes are recommended for the program (or CPU) memories, as the 
memory contents need to be corrected at the instant they are interrogated by the CPU. More 
robust but complex EDAC schemes, such as the Reed-Solomon (RS) code, can be used to protect 
SSDRSs where the speed of access is not so critical, but memory overhead is.
Increasing reports of SEFIs in memory devices have led to a modified ground testing 
methodology, which takes into account the occurrence of SEFIs. The number of upsets is 
collected in the device under test and is compared against a predefined threshold to declare a 
SEFI.
Microprocessors are considered to be the most complex of logic systems. Although susceptible to 
single event effects, commercial microprocessors have successfully been deployed to space 
missions. Different mitigation strategies are available ranging from software watchdogs to the use 
of multiprocessor systems operating in lockstep with voting. Generally, a “safing” approach has 
been adopted, which uses simple watchdog timers, or redundancy has been deployed to bring 
availability and fault coverage.
60
Software fault tolerance as well as computer state recovery techniques have been used for 
terrestrial applications but have rarely been considered for spaceflight. Resetting and power 
cycling requirements associated with SEFI recovery advocate the adoption of state recovery 
techniques for space missions. Because of redundancy requirements of FER techniques, the BER 
techniques seem more attractive for small satellites.
Reconfigurable FPGAs are desirable but are prone to radiation effects. Prevailing practice is the 
combination of configuration scrubbing to protect configuration memory and TMR (or other 
redundancy based) techniques to mitigate radiation effects in user logic. This combination 
presents an excellent improvement on SEU response and pushes the cross-section for functional 
error for any design in any orbit to at least one order of magnitude below the established cross- 
sections for device level SEFIs. However, this combination does represent an expensive and 
complex solution.
Unit/device level approaches have always been considered for memories, microprocessors and 
FPGAs.
Data networks require additional layers on top of their native fault tolerance features. The 
magnitude of fault tolerance, which needs to be applied, will depend on mission requirements, 
network topology and its fault tolerance features.
All key data handling technologies are susceptible to SEFIs. However, mitigations have always 
been developed targeting individual technologies.
61
CHAPTER 5
PROPOSED FAULT-TOLERANT ARCHITECTURE
This research consists of developing an OBDH architecture using emerging commercial data 
handling technologies with a novel mitigation scheme to handle functional interrupts in the 
system. Chapters 1-4 discuss three levels at which mitigation for a SEFI can be addressed and 
presents different approaches, which have been adopted so far. In this chapter, a proposed fault 
tolerant architecture is presented. A system level supervisor for all data handling units makes it a 
novel architecture. The proposed scheme uses a DAD packet from each unit to act as an indicator 
of health status of all key devices in the corresponding unit.
This chapter describes an OBDH network with example data handling units running under 
monitoring of a rad-hard supervisor. Key components of the scheme include a supervisor to 
provide global SEFI monitoring, a code store to hold recovery data, a data network to comiect all 
units, and an interface node, which connects each unit with the network. The roles of each of 
these components will be described, establishing requirements for the proposed architecture. Fault 
coverage of the scheme will be discussed and will be compared with traditional approaches.
5.1 Desirable Features of the Architecture
Adoption of On-Board LAN: A LAN is a high-speed data network that covers a relatively small 
geographic area. LAN protocols operate at the physical and data link layers of the open system 
interconnect (OSI) model [Cisc-05]. There is a need for high-speed and reliable communications 
between scientific instruments, mass memory, computers and downlinks on spacecraft [Buch-03]. 
The use of commercial standards brings benefits of standard interfaces and minimizes 
development efforts. An on-board LAN in a standard configuration will simplify the interface 
between various spacecraft components. Desirable features of the network includes [Buch-03]
>  High data rate (> 100 Mbps)
> Low power interface, such as low voltage differential signalling (LVDS)
>  Scalable, i.e. increase bandwidth by adding nodes to the network
> Standards available to minimize development efforts
>  Simple to implement (low complexity)
> Robust i.e. components for space flight are available and adding mitigation such as 
redundancy is possible.
62
> Support equipment compatibility and reuse to reduce cost of development and to simplify 
integration and testing.
Several LAN card developments are currently under way and should provide 100 Mbps 
connectivity. These include [Schn-00]
> JPL X2000 FireWire interface board
>  Goddard space flight centre (GSFC) and ESA have SpaceWire activities underway, flight 
qualified network interface card (NIC) are currently in work.
Provision of Internet Access to Spacecraft: NASA/GSFC operating missions as nodes of the 
Internet (OMNI) project is aimed at the demonstration of the use of standard Internet protocols 
for spacecraft communication systems [Rash-00]. This capability will become increasingly 
significant in the years to come as both Earth and space science missions fly more and more 
sensors and the present labour-intensive, mission-specific techniques for processing and routing 
become prohibitively expensive. Therefore, it is important to define an architecture that allows 
science missions to be deployed “faster, better and cheaper” by using the technologies that have 
been extremely successful in terrestrial Internet. In principle IP could be transported across 
standard serial or even MIL-STD-1553 interfaces [Schn-00]. Most of the recent SSTL missions 
carry the capability of communicating to ground over user datagram protocol (UDP)/Intemet 
Protocol (IP).
Reusability: An OBDH architecture should be capable of supporting more than one space mission 
by simply plugging in different payloads.
Scalability: A system, whose performance improves after adding hardware, proportionally to the 
capacity added, is said to be a scalable system.
Fault Tolerance: System should be fault tolerant and should be as much autonomously 
recoverable as possible and this is particularly so for deep space missions. In order to provide 
autonomous recovery, the system will require non-volatile storage to hold back-up code, 
configurations etc. Floating gate memories are available in large densities, which make them 
attractive to serve this purpose.
5.2 Candidate LANs for Spaceflight
The avionics bus MIL-STD-1553 and its optical derivative, 1773, is commonly used between 
spacecraft components. The raw data rate is 1 Mbps. It is a master-slave architecture. For point- 
to-point connections that do not require the complexity of a 1553/1773 connection, a serial 
connection such as RS-422/23 with a bit rate around 1 Mbps is typically used.
63
SSTL has a long history of flying CAN LAN bus on-board. It is a serial protocol with data rates 
up to 1 Mbps. It has been used for telemetry and telecommad purposes only. For fast data 
transfers between units, SSTL uses point-to-point LVDS links.
LAN options, which are particularly of interest for space use, are compared in table 5.1 [Buch-03, 
Stak-01].
Table 5-1 Comparison of LANs
Features SpaceWire 
IEEE 1355
FireWire 
IEEE 1394
Etherent 
IEEE 802.3
Data Rate (Mbps) 400 400 100
Topology Point-to-point (full 
duplex)
Tree (half duplex) Point-to-point (full 
duplex)
Cable length (m) 10 4.5 10
Control Router-nodes Master-slave Router-nodes
Power consumption Low Higher than 
SpaceWire, lower than 
Ethernet
Higher than both 
SpaceWire and 
Firewire
Scalable Yes No Yes
Radiation tested Yes Yes No
Fault tolerance Parity error, escape 
sequence error, 
character sequence 
error, credit error, 
disconnect error, error 
end of packet (EEP) 
received, invalid 
destination address, 
link establishment, 
packet transmit, packet 
receive time-out errors
CRC error, parity 
error, arbitration and 
acknowledgement 
time-our errors
CRC error detection
This work has explored the SpaceWire standard because of its attractive features (such as high 
speed, low power consumption, scalability, reusability, router-based architecture, low packet 
overheads, fault tolerance features etc.) and its wide acceptance from space industry.
5.2.1 SpaceWire
SpaceWire is based on two existing commercial standards, IEEE-1355 and LVDS that have been 
combined and adapted for use on-board spacecraft [Park-99]. SpaceWire provides for very fast 
data transfer (a minimum of 2 Mbps, with a capability up to 400 Mbps) at low power 
(~5mW/Mbps @ 100Mbps). It is scalable, and provides a high degree of isolation between 
systems -  avoiding problems such as powering up from interconnections between systems etc.
64
JPL have similarly identified SpaceWire, along with the IEEE 1394 FireWire to be of interest for 
their X2000 COTS-technology based data handling architecture [Chau-01].
SpaceWire is a full-duplex serial point-to-point network, in which nodes are interconnected by 
routing-switches. It provides a coherent interface to processors, mass memoiy units and sensors 
etc. Whilst there are similarities to the IEEE 1394 bus standard at the physical layer, the higher 
layers of the SpaceWire protocol are much simplified in comparison. This makes implementation 
easier (a basic SpaceWire link can be implemented in 5000 gates), acceptable overheads on even 
smaller packets (<35%), but it does preclude some useful functions -  such as the ability to 
support broadcast or multicast.
5.2.2 Radiation Response of SpaceWire
The SpaceWire standard describes the hardware and software necessary for implementation of the 
protocol. The radiation response including SEUs and total dose effects would depend on the 
protocol implementation chips.
The proposed supervisory approach does not guard against destructive radiation effects such as 
SEL, SEGR and TID. Therefore, the choice of a FPGA to act as the interface FPGA will be 
driven by its robustness against these destructive effects for a particular mission type. For 
example, test results of the latest commercial Actel SX series have shown no destructive events 
up to an LET of 100 MeVcm2/m g '' [Koga-00, Facc-00]. Actel reported a TID of 100 krad for this 
family of FPGAs [Cron-98]. Antifuse FPGAs from this series have been flown on Surrey’s DMC 
satellites to act as a coder decoder (CODEC) FPGA, which codes and decodes UDP/IP packets. 
For severe radiation environments or missions with increased robustness requirements, radiation 
tolerant or rad-hard devices can be used.
5.2.3 The SpaceWire Router
SpaceWire is a full-duplex, bi-directional, serial, point-to-point data link. It encodes data using 
two differential signal pairs in each direction. That is a total of eight signal wires, four in each 
direction [Park-99]. A router is therefore recommended to connect units on the network to reduce 
harness mass [Guas-99]. A router-based architecture has many advantages over a bus network 
topology [Walk-01]. However, it is prone to a single point failure and therefore, the router’s 
reliability is an important issue. Several rad-hard router designs have been developed [Esa-03].
65
For example, an Atmel ASIC-based SpaceWire router with 10 ports is reported to have a total 
dose performance of 300 krad, and SEU and latchup immunity up to lOOMeV. This router is 
capable of running at a maximum baud rate of 200 Mbps and its power consumption is 4 watts at 
maximum data rate [Fisc-04].
In addition to rad-hard designs, redundancy-based solutions have also been put forward. For 
example, a dual redundant SpaceWire network is presented in figure 5-1. Redundancy is provided 
in this example network by the use of redundant links and a pair of routing switches. If data are 
being sent from Sensor 1 to Memory 1 via Router 1 and the link between the sensor and the router 
fails then data can be sent from Sensor 1 to Memory 1 via Router 2 [Park-03].
Figure 5-1 A Dual Redundant SpaceWire Network, After [Park-03]
As described in chapter 4, dual redundancy has been used at Surrey as well to tolerate permanent 
faults in devices and interfaces.
5.3 Proposed Fault-Tolerant Architecture
An advanced architecture reflecting SOTA trends in constructing OBDH architectures and device 
technologies is shown in figure 5-2a. Inclusion of an intelligent rad-hard supervisor in this 
architecture makes this a novel solution to the reliability problems of the increasingly complex 
COTS-based spacecraft (Figure 5-2b). The final system will actually be a dual redundant system
66
as shown in figure 5-1. There will be two router devices; where one will be primary and other will 
be called redundant. The supervisor will be controlling these routers. For sake o f  simplicity, dual 
routers and connects are not shown in the figure 5-2. The choice o f  router device again depends 
on a particular mission type. System level redundancy, i.e. layer 4 o f  X2000 fault tolerance model 
and which has also been used at Surrey, can be included for higher reliability requirements. Again 
for simplicity, it is assumed in this work that there will be no permanent fault in any unit. In order 
to cope with permanent faults, the system will require at least cold redundant units. In such a 
situation, the supervisor will be updated to hold all system configurations.
supervisor
Figure 5-2a (Left): High Performance OBDH Architecture, Figure 5-2b (Right): Proposed OBDH
Architecture
The system shown in figure 5-2 provides the core OBDH functions, and comprises different 
device technologies. The supervisor will take care o f  network interfaces for all units. The system 
nodes are either intelligent (containing a microprocessor within the unit, e.g. OBC and payload 
systems) or non-intelligent (memory units, e.g. SSDR and code store). In intelligent nodes, 3 
components are considered as possible fault sources including microprocessor, its program 
memory and network link, which connects this unit to the supervisor. In non-intelligent nodes, 
two components will be monitored including memory devices and network link. The health o f  a
67
network link will depend upon the both interface nodes of supervisor and other units in question. 
Other units including attitude determination and control (ADC), transmitter/receiver and power 
system are also categorized as non-intelligent units. The ADC and power system are based on 
various sensors, whereas transmitter/receiver unit is responsible for ground communication. 
These units will be tested for the health of their network interfaces only.
As depicted in Figure 5-2b, the supervisor does not have a direct connection with OBDH devices. 
Instead, it communicates with the OBDH units. The communication means are the network 
packets, hi order to collect health information of target devices, the supervisor will expect to 
receive periodic packets from the OBDH units. These packets need to have information, which 
will be used by the supervisor to detect/diagnose a problem in a unit. Because of its role, such a 
packet will be called a detection and diagnosis (DAD) packet.
5.3.1 Requirements for the Supervisor
The supervisor requires two types of information.
1. Designer inputs: These are different threshold values, which are set by the designer of the 
architecture, and recovery methods and their priorities.
2. On-line data: This includes parameters, which will be monitored during operating period 
of the satellite. Mainly, it will include DAD packets from each unit. In addition, the 
supervisor can be made responsible for monitoring the radiation environment too (section 
5.5).
The supervisor needs to perform the following duties
> Collecting periodic DAD packets from OBDH units
>  Comparing parameters contained in DAD packets with designer inputs
> Based on these comparisons, issuing appropriate recovery commands
> Keeping records of the recoveries applied
> Coordinating the use of the code store among underlying units
Because of the supervisor’s capability of intervening with OBDH units (particularly its capability 
to invoke a unit’ reset or power cycling), it is necessary to ensure that the supervisor itself is 
prone to catastrophic errors. Chapter 6 and 7 goes into details of the supervisor and it is 
demonstrated that the supervisor will actually require sufficiently low computing resources. A 
8051 microprocessor is expected to meet the requirements of the supervisory features, presented 
in this thesis. The following are the possible rad-hard microprocessors options, which can be 
considered for implementation of the supervisor.
68
5.3.1.1 Radiation Hardened Microprocessors
Radiation-tolerant and radiation-hardened MCS8051 instruction-compatible microprocessors 
have been reported in [Laud-05], These microprocessors are implemented in Actel radiation 
tolerant RTAX-S and Aeroflex radiation-hardened UT6325 FPGAs respectively. These 
microprocessors include 256 bytes of TMR internal data RAM, and 1,536 bytes of on-chip, TMR 
extended data (XDATA) RAM. Both of these can be directly interfaced with Honeywell’s 64k x 
16 radiation-hardened random-access memory yielding an unprecedented two-chip, radiation- 
hardened/tolerant microcontroller solution capable of reliable operation in 300 krads (Si) TID 
environments. At its maximum operational frequency of 28 MHz, the radiation tolerant computer 
consumes about 280 mW and can execute instructions at a rate of up to 7 MIPS. The rad-hard 
version has a maximum operation frequency of 16 MHz at about 350 mW and an instruction 
execution rate of up to 4 MIPS. Each unit is priced about $8K.
The Dynex Semiconductor MA31750 is a single-chip 16-bit microprocessor that implements the 
full MIL-STD-1750A instruction set architecture, hi addition to its fabrication on silicon-on- 
sapphire (SOS), which is a hetero-epitaxial process for IC manufacturing and is primarily used 
in aerospace and military applications because of its inherent resistance to radiation, the 
MA31750 has on-chip parity generation and checking to enhance system integrity. A 
comprehensive built-in self-test has also been incorporated, allowing processor functionality to 
be verified at any time. This rad-hard microprocessor consumes 400mW power and is priced at 
about £9K [Summ-06].
Another approach to build a rad-hard computer is through the use of fault tolerance techniques on 
commercial microprocessors. A single board computer built from TMR of three commercial 
PowerPC 750, where voter logic is implemented in Actel radiation tolerant FPGA has been 
reported to have radiation performance comparable to or better than the rad-hard RAD6000 
microprocessor [Hill-03]. This approach is actually targeted at providing high performance rad- 
hard computing. Therefore this approach can be adopted if one wishes to integrate more functions 
into the supervisor. For example, increased check on network interfaces using more frequent “I 
am okay messages”, reading back of critical design sections of interface FPGAs and any 
reconfigurable FPGAs in the system and comparing with correct values stored at the supervisor, 
applying a test pattern to SSDRs, holding configuration of the system and so on.
69
5.3.2 Role of Interface Node
Each OBDH unit is connected to the SpaceWire network through an interface node. The interface 
node consists of a programmable device to handle software aspects (protocol) of the SpaceWire 
network, whereas hardware specifications for data transfers are achieved through LVDS drivers. 
This thesis does not go into details of these hardware specifications and therefore, focus is on the 
programmable part. In addition to the SpaceWire protocol, the programmable part is required to 
handle UDP/IP as well. Generally, an OTP FPGA or an application specific integrated circuit 
(ASIC) or a microcontroller is used to provide SpaceWire protocol handling. As mentioned 
earlier, SpaceWire does not require large computing (or gates on FPGA) resources. Therefore, it 
may be possible to make a single programmable device responsible for both SpaceWire and 
UDP/IP. However, it is not adopted in cases where network interface is provided with 
commercially available communication controllers/ASICs/FPGAs. For example, Surrey’s UK 
DMC satellite employs UDP/IP over high level data link control (HDLC) for communication with 
ground station. A commercial microcontroller is used to handle HDLC and a separate OTP FPGA 
is used as UDP/IP CODEC [Plan-05]. Another example is presented in chapter 7 of this thesis, 
where SMCSlite is used to act as SpaceWire interface controller with a separate UDP/IP CODEC 
FPGA. In short, interface node contain either an interface FPGA or a CODEC FPGA with an 
interface controller. From now onwards it is assumed that the interface node consists of a 
communication controller and a CODEC FPGA. hi these two components, the CODEC FPGA 
will be used for implementation of any special functions to assist the supervisory diagnosis and 
recoveiy.
The interface node is required to serve following purposes
>  Capability to interpret a network packet.
>  For systems that lack a processor (and therefore are not naturally executing tasks), the 
CODEC FPGA will run a system-monitoring task, involving, for example, the washing of 
memories to establish SEU-rates, etc.
> For intelligent nodes, it needs to interpret only those packets, which are intended for it.
>  It needs to interpret recoveiy requests from the supervisor and should have a physical 
connection to apply an actual recovery. For instance, it needs to be connected to interrupt 
(INTR), non-maskable interrupt (NMI) and reset pins of the OBC processor. Details will 
be presented in following chapters.
5.3.3 Role of Code Store
As mentioned in Chapter 1, one of the objectives of this research is to improve the availability of 
spacecraft data handling subsystem. Also, it was mentioned that system recovery after a SEFI 
usually requires reloading any configurations, operating system etc. Considering the example of a 
SSTL OBC, operating system and user tasks are stored into volatile RAM and needs to be 
reloaded after each processor-reset or power reboot. In order to ensure a speedy recovery, one 
needs to provide this recovery information to the OBC as soon as possible. This requires the 
inclusion of on-board non-volatile storage. Because of their large densities, low power 
consumption and low prices, floating gate memories are a natural choice. However, these devices 
are prone to SEFIs and need to be monitored too.
Information stored in the code store is usually large. For instance, at SSTL the size of a typical 
OBC task is about 45kbytes [Jack-05], and usually about 10 tasks are designed for OBC resulting 
in information size of about 450 kbytes. Uploading this information after an OBC crash from the 
ground takes about 20 minutes, therefore two ground passes will be required to complete this 
process for a LEO spacecraft.
If 450 KB of infonnation are to be transferred on a CAN bus with 1 Mbps speed, at least 3.6 s 
will be required to complete this process. In fact, SSTL uses an extended protocol called CAN for 
spacecraft use (CAN-SU), which has 40 bits of packetization overhead with a maximum data 
field of 32 bits/4 bytes [Wood-04], which will result in 21.6 s for the same data transfer. On the 
other hand, a fast network such as SpaceWire with a speed of 200Mbps and a packet header of 1 
byte with an arbitrarily large packet would take 18.04 ms for same amount of data. The 
connection speed between the OBDH units and code store is therefore an important issue.
Either a centralized/global code store can be used, which naturally presents benefits of reusability 
and centralized fault tolerance, or it can be distributed within each unit according to its 
requirements. In the latter case, the DAD task running in each unit will be required to take care of 
non volatile memory too. Power cycling requirements associated with floating gate memories will 
result in more frequent cold rebooting of each OBDH unit. The former approach requires a fast 
data network between code store and other OBDH units, and has been adopted for the proposed 
SpaceWire based architecture.
71
5.3.4 Requirements for an OBDH Unit
This scheme will require other kinds of mitigations for each unit for following reasons
>  The proposed mitigation scheme addresses SEFIs. It will not protect against normal 
SEUs, e.g. SEUs in memories, and SEE, where current can exceed current rating of a 
device instantaneously and therefore periodic monitoring cannot be accepted.
>  As the supervisor is not directly connected with the target devices, it will require a way of 
collecting the required information from each unit. Mitigation techniques adopted within 
a unit can be helpful for providing such diagnostic information. For instance, a SEFI in a 
memory device can be detected by monitoring its SEU count, as the number of upsets 
caused by a SEFI is huge compared to the nomial SEUs behaviour. In order to produce a 
SEU count for memory devices, some kind of error detecting code is required.
In addition to above stated mitigation techniques, each OBDH unit is required to have following 
functionality
> Each unit needs to support a DAD task, which will take care of all supervisory related 
tests in that unit. This task will run on processing units of the computer-based units in the 
system, whilst in non-intelligent nodes interface nodes will be required to perform this 
task.
>  Each OBDH unit needs to be isolated from back power to make power reboot of a unit 
effective.
>  Because of the power cycling requirements associated with SEFI recovery, to improve 
reusability and to allow shared memories, the OBDH units are recommended to be small. 
For instance, in NASA’s X2000 avionics architecture, the OBDH units comprises of 
small computer and memory slots, which can be plugged/unplugged with data network 
according to requirements of the mission.
5.4 Overview of the Supervisory Protocol
It is important to note that the supervisor will be monitoring sub-systems with different classes of 
device -  e.g. microprocessors, DRAMs, floating gate memories, FPGAs etc. - which may all 
exhibit several different signatures as the result of a SEFI. Recovery strategies will also be 
different in each case, and the priority accorded to different signatures might also differ.
72
The supervisory protocol is defined as the set of rules, which make realization of the supervisory 
mitigation possible. It requires determining sources of error in a target unit. It then looks into the 
possible fault signatures and way of detecting these. Finally it presents possible recovery 
methods. Chapter 6 will elaborate the supervisory protocol concept with a case study. An OBC 
unit monitoring will be investigated.
5.4.1 OBC
Considering a typical multiprogramming environment, where all the software including the 
operating system is organized into a number of processes that can execute in parallel. Each 
process is loaded into EDAC protected memory as a separate executable file. One of the 
processes will be a special process, called the DAD task. This process will contain a test 
sequence. The DAD task will be invoked periodically, which will execute this test sequence. The 
current consumption of the unit will be monitored during execution of the test sequence. The 
DAD task will also collect the SEU count for the program memory. All of these pieces of 
information will be sent to the supervisor where these will be tested against expected values. If 
there is a mismatch, the supervisor will start its recovery cycle.
5.4.2 Payload System
A payload system can be a commercial computer-based unit like the OBC. However, a 
reconfigurable computer-based payload is assumed here. The approach take for the OBC can be 
adopted for this unit as well. A packet time-out/non responsive FPGA will be indicative of an 
system upset or configuration upsets. Whereas the mismatch of the test sequence results can be a 
result of the configuration or user logic upsets.
The body of literature on mitigation techniques for reconfigurable FPGAs has focused almost 
exclusively on the paired tactics of TMR and configuration scrubbing. Xilinx argues that the costs 
of TMR and configuration scrubbing must be individually weighed against their benefits, taking 
into consideration a range of mission characteristics. Some scenarios will call for one mitigation 
technique, but not the other; still other scenarios will call for neither mitigation technique. Other 
less expensive mitigations such as periodic reset and periodic full reconfiguration should also be 
considered [Brid-05].
hi this work, packet time-out/non-responsive FPGA and mismatch of test sequence will be used 
to detect a fault in a reconfigurable payload computer. After detecting a fault, a reset followed by
73
a full reconfiguration can be applied. Depending upon mission requirements, scrubbing can be 
used instead of full reconfiguration.
5.4.3 SSDR
SSDR will comprise SDRAM devices. The memories will be organised in banks with separate 
power switches, so that in the event that power has to be cycled through a device, only a 
proportion of data will be lost. As with previous Surrey satellites, these bulk memories will be 
protected by Reed-Solomon EDAC codes. An error count higher than expected will be used to 
detect a SEFI.
It is reported that the probability of SEFI occurrence can be reduced by periodically rewriting the 
mode register in SDRAM devices [Koga-01, Guer-04]. Therefore a useful measure can be to 
configure supervisor to periodically rewrite mode registers in memoiy banks.
5.4.4 Code Store
The code store will be based on floating gate memory and will hold the back-up code, the 
operating system for the OBC, as well as checkpointed computer tasks states, and the 
configurations for any reconfigurable FPGAs.
The supervisor will be used to coordinate the use of the code store as well as monitoring time-out 
conditions for each transaction to the code store. Whenever a unit needs access to the code store, 
it will send a code store service request to the supervisor. The supervisor will allow access of the 
code store to any one unit at a time. After granting access to a unit, the supervisor will set up a 
timer corresponding to maximum duration for a code store transaction. Before time-out, the 
supervisor will expect to receive a request clear message to indicate end of transaction and then 
the supervisor can allocate the code store to next request in the queue. Otherwise, it will start its 
diagnostic and recovery on the code store. In normal conditions, the supervisor will receive a 
request clear packet from the unit, which was previously granted an access, before a time-out 
occurs, hr this way, it will monitor the code store for non- responding conditions.
hi addition to looking for time-out conditions, tire code store’s SEU rate will also be collected and 
a SEFI will be declared for a count greater than a threshold. In order to produce SEU count an 
error detecting code will be used.
74
5.5 Environment Monitoring
As well as monitoring the systems on the network, the supervisor will have access to a radiation 
environment monitor such as Surrey’s CRE/CEDEX payload [Unde-94, Unde-02]. These 
instruments are capable of producing counts of the incident particles on the spacecraft electronics. 
When the count rate exceeds some pre-set threshold, a flag is raised that informs the supervisor 
that the environment is hostile.
For a satellite in LEO, the probability of getting an SEE may be much higher when traversing the 
SAA (Figure 5-3), or when crossing the polar-regions during a solar particle event. Similarly, for 
a satellite in GEO, a normally benign SEE environment might suddenly become much more 
severe due to a solar particle event. Thus, real-time knowledge of the environment can enable the 
supervisor to take special, adaptive, measures in harsh radiation conditions. Device health checks 
can be made more often (e.g. increased washing of memories) and other precautionary measures 
can be taken (e.g. the avoidance of read/write operations on flash memories).
KITSOT-1 COSMIC ROY EXPERIMENT
32 d ays : 1 / 5 /3 4  -  1 /  6 /9 4
Channel 2 .  LET Range : 5 9 .5  -  119 M eU /(g /cn A2 )
C - V - n '
.
P  > fj  \  S i ! '
Longi tude
Sm oot1
I 5 10 50 100 500 le 3  5e3  le 4  5e4 le 5  5e5  lc 6  Counts
150 secon d  I n te g r a t io n  Tine
UN IUERSITY OF SURREY -  SSTL -  K0IST S0TREC -  SERC -  DR0
Figure 5.3a: Proton Environment Measured by the KITSAT-1 Cosmic-Ray Experiment (CRE) at 1330
km Altitude, After | Unde-96]
75
59
45 
0
- 4 5
- 5 0
- I B B  - 9 0  0  9 0  1 0 0
L u i i u  1 l u d e :
Figure 5.3b S80/T  Program Memory Upsets at 1330 km Altitude, After (Unde-96|
5.6 Comparison with Other SEFI Mitigation Approaches
A system level SEFI tolerance technique has been presented, which is the best suited to high 
performance SOTA data handling architectures. An intelligent supervisor is responsible for 
monitoring the heterogeneous OBDH units, with diverse device technologies.
The proposed scheme offers an improvement on simple watchdog timers. Although it mainly 
works on timers, this scheme expands error detection capabilities of the system. In contrast to the 
periodic non concurrent fault coverage of the proposed scheme, lockstep or voting strategies offer 
concurrent fault masking. However, synchronization of commercial processors in lockstep and 
voting is a challenging task. Voting can be implemented at different levels. In order to avoid 
synchronization issues, it may well be possible to apply voting at task level, i.e. three or more 
processors will be assigned the same task and results will be compared instead of synchronizing 
their clocks and voting on output of each instruction. However, the resource overhead will 
increase drastically with an increasing number of components/units in the OBDH architecture. 
Most importantly, all mission scenarios may not require such a high level of fault coverage and 
therefore it will be desirable to cut down the price of mitigation.
SEFI mitigations for reconfigurable FPGAs are also a topic of ongoing research. As in the case of 
microprocessors, the focus is on two broad classes of mitigation, scrubbing of the configuration 
bitsrtream and inclusion of redundancy to make a design robust. This research presents the novel 
concept of monitoring a reconfigurable FPGA on the basis of its functionality. It proposes to run 
periodic DAD tasks to test the health status of the FPGAs. A recovery procedure will be invoked 
on detecting an anomaly.
76
In the context of data networks, the supervisor’s function corresponds to layer 2 of the JPL fault 
tolerance model. Having been equipped with the supervisor, the system will be able to cope with 
SEFIs in data networks. Depending upon the network topology, it may or may not require design 
diversity to support detection and diagnosis, e.g. if SpaceWire is considered design diversity is 
not necessary as a redundant link can be used if a link fails. Adoption of Layer 4, which consists 
in duplicating the data network, is necessary to assist the diagnosis and recovery of an interface 
node and to tolerate a permanent fault in the system.
77
CHAPTER 6
SUPERVISORY PROTOCOL FOR THE OBC
Chapter 5 presented the proposed fault tolerance strategy. The supervisor sits on a data network 
and expects to receive periodic packets from underlying units to evaluate their health. However, 
the packet type, format and contents are yet to be described. Also, how collected information will 
be used to diagnose a fault, what recoveries will be applied and what will be their order needs to 
be defined in order to evaluate the practicality o f the proposed approach. This chapter addresses 
these issues. An OBC has been chosen to act as a case study for demonstration of the proposed 
scheme.
6.1 OBC Unit
The SSTL 386 OBC subsystem is shown in the figure 6-1. It consists of a processing unit 
(including a 386 processor and a 387 coprocessor), EDAC protected 4 Mbytes program memoiy 
(in this case its EDAC is implemented using a TMR design), software EDAC protected 128 
Mbytes data recorder, EPROM-based bootloader, CAN and synchronous serial channels (SCC), 
to support HDLC link to downlink for external connectivity, and bus controllers [Sstl-05].
To ta l 4 ..128Mbyte HL
+5V H~2-* PSE
T C
+28V— r H
Ul
c a n o t J ----
CAN1 *t_Jj
-Speed 
Select 1 'R eset
CH0..3
C L K -"V S Y N C ISOGen iU iS JE L
8.10.17.25 M -tz
B u s  
Controller
L.y  I I ^  I
n
r~Mgn
CLK0..7 CH4..7
H S T !  I ISO I
38GEX 387SL
I____1
M JX M U X
 ^ 1
1SC35
S CC + D M A
16C35 
SCC + D M A
B U S
Arbiter
I
S E B  B u s  
Controller
C A N  N O D E T C  
■ SP
*5V
LOGIC
+5V
CAN
£
TM R
2M*8 2M*8
2M*8 112M*8~
2M«8 112M*B~
P A G E
JLi
Lo ca l Bus
EPRO M
32K*8
r j r h
DC/DC p r ijc D
j  ,—  t l n t n  f A u
AY NETW
] 1 Data B us
A d d re ss  B u s
^  S E B  B us
T E> r—*CANO C A N !
1
CAN
G lue
Figure 6-1: SSTL OBC-386, After fSstl-05]
78
A modified and simplified OBC is shown in figure 6-2. The data recorder has been removed from 
the design and it is assumed that the system will combine all data recording to the shared SSDRs. 
The CAN and SCC interface have been replaced by a SpaceWire interface node.
Figure 6-2: The OBC Unit
6.1.1 Spacecraft Operating System
It is assumed that the OBC processor is running the spacecraft operating system (SCOS), which 
has previously been flown on SSTL missions. It is a multitasking operating system. This is in 
control of the processor at the lowest level and controls its hardware interrupts and timers. The 
operating system performs system level functions such as operating the task scheduler, task 
switching and inter-task communication. The operating system provides a pre-emptive 
multitasking environment with a pre-emption period of 10 ms and time slicing of 100 ms [Bekt- 
92].
6.2 Assumptions
• The SpaceWire standard does not specify a limit on the packet length. In order to bound 
communication delays, a maximum packet size of 1024 bytes is assumed.
• It is assumed that the supervisor is part of the router unit, and therefore potentially has a 
dedicated link to all other nodes in the system.
• The supervisor is assumed to be radiation-resilient and therefore there is negligible 
probability of getting an error on the supervisor.
• A UDP/IP over SpaceWire communication is assumed between the OBC and the 
supervisor.
79
The interface node consists of a SpaceWire communication controller and a CODEC 
FPGA
6.3 SEFI Detection Policy
In order to detect a SEFI the supervisor will require following data
• A periodic “I am okay” message from the OBC to check that the processing unit and
SpaceWire link is still alive.
• Using screech and state recovery task to detect the crashing of an OBC task.
• Any diagnostic data, e.g. SEU count of the memory unit, indication of any calculation 
error etc.
• Current consumption of the OBC unit
As mentioned earlier, the OBC runs a multitasking operating system where each tasks is executed 
for a time slice before switching to the next task. There are three ways that can be used by the 
supervisory approach.
1. Sending “I am okay” packet to the supervisor whenever the OBC) starts a new task and to 
collect some health information during execution of each task
2. Adding a special task to the existing tasks that is specifically designed to perform 
diagnosis on the unit
3. A hybrid scheme. Sending “I am okay packet” whenever a task gets into execution, a 
special task is still added which performs diagnosis on the node
All three schemes are compared in table 6-1.
80
Table 6-1 Comparison of the OBC Monitoring Schemes
Scheme # 1 Scheme # 2 Scheme # 3
S  Low detection latency X Higher detection latency S  Low detection latency
X Cannot detect calculation 
errors
S  Can detect calculation errors 
during execution of the 
special task
S  Can detect calculation 
errors during execution 
of the special task
x Cannot detect current 
consumption variations 
(cannot be meaningful in a 
dynamically changing 
environment)
S  Can observe current 
consumption behaviour 
during execution of the 
special task (interrupts are 
disabled)
Can observe current 
consumption behaviour 
during execution of the 
special task (interrupts 
are disabled)
X Each task will require to be 
modified to send a start of task 
packet whenever it starts 
execution on the processor
S  Does not require
modification of the user 
tasks
X Each task will require to 
be modified to send a 
start of task packet 
whenever it starts 
execution on the 
processor
x Operating system task require 
to be modified too
^  Does not require
modification to operating 
system task
X Operating system task 
require to be modified 
too
Scheme # 1 is merely a watchdog timer, which needs to be refreshed whenever a task is 
scheduled for the processor time. Scheme # 3 offers an enhanced fault detection capability at the 
cost of adding a special task to the system. However, modification of the user tasks to meet 
requirements of the supervisory approach is complex and therefore is undesirable. Modification 
of the operating system is almost impossible from users. Therefore scheme #2 is favoured i.e. 
adding a separate task to act as the DAD task.
The supervisory protocol requires the OBC to perform two types of activities.
1. Generation of periodic DAD packets
2. Writing updated OBC tasks to code store
These two activities can either be combined in one task or can be divided into two. The DAD 
packet generation is time critical as in case of a time-out the supervisor will start inteivention 
with the OBC. On the other hand, updating of the OBC tasks can take longer and it will be 
difficult to precisely bound delays associated with this procedure. Therefore, it is proposed that 
the OBC will be running two special tasks. These will be called the DAD and state recovery (SR) 
tasks respectively. The DAD task will be responsible for the generation of DAD packets, whilst 
SR will perform any coordination with the supervisor for use of the code store and will 
upload/download individual tasks from the code store.
81
6.3.1 Fault Sources and Associated Signatures
The possible sources of faults in the OBC unit can be represented by the processor, memory, and 
interface node resulting in the network problems.
The SEFI signatures for microprocessors have been presented in previous chapters. A 
microprocessor can also exhibit faulty behaviour if it executes an erroneous piece of code and 
therefore it is important to take into consideration error performance of the program memory. The 
housekeeping task running on the OBC performs memory washes periodically and establishes an 
error count. For a LEO spacecraft, upset rate for a program RAM memory is 10'6 SEU/bit/day, 
i.e. for 1 Mbits memory there will be one SEU per day. This rate increases to 1CT4 SEU/bit/day, 
while the spacecraft passes through the SSA [Unde-96]. As mentioned in chapter 3, a SEFI can 
significantly increase number of upsets on a memory device. Therefore, an upset rate higher than 
expected has been adopted as an indication of fault conditions for the program memory.
In the context o f data networks, the following are the most common or critical failure modes 
that have been identified by JPL/NASA [Chau-99], and are summarized below:
• Invalid Packets: These are the packets, which have invalid data.
• Non-Responsive node: An anticipated response to a message does not occur before it 
times-out.
• Babbling: Communication among nodes is blocked or interrupted by uncontrolled data 
stream.
• Conflict of Node Addresses: More than one node has the same identification.
This work uses these failure modes as the main sources of network errors and suggests possible 
remedies in context of the proposed architecture.
The SpaceWire standard describes error reporting and handling at three different levels namely 
exchange level errors, network level errors, and application level error handling [Ecss-03]. The 
exchange level errors include disconnect error, parity error (parity errors occurring within a data 
or control characters), escape sequence error, character sequence error and credit error. The credit 
error is related to the flow control. An interface node keeps track of the number of characters that 
can be received. The allowed number of characters is referred to as the credit. An interface node 
sends flow control tokens (FCTs) to the other end to signal that it can receive more characters. An 
FCT allows sending of eight more characters. Upon receiving an FCT, the sender adds eight to its
82
credit. Furthermore, there is an upper limit to the credit, namely 56. If the upper limit is exceeded 
or the interface receives characters when it expects none, a credit error has occurred.
Disconnect error occurs when there is no traffic on the link for link disconnect time-out period 
that is 850ns. If there are no data or control characters to send, the interface will send NULL 
characters to keep the link alive. An escape error occurs when an escape character is received 
unexpectedly. All the errors will result in connection re-establishment.
Network level errors include error end of packet (EEP) received and invalid destination address 
eiTor. A logical address whose routing table entry refers to a non-existent output port is regarded 
as an invalid address. Application level error handling will be described in section 6.7.2.
An invalid packet will be detected by above mentioned error conditions. Also, both the UDP and 
IP headers include checksums to ensure integrity of the received packet.
The SpaceWire standard offers six addressing schemes [Ecss-03], which are summarized below:
• Path addressing: The destination address is specified as a sequence of router output port 
numbers used to guide the packet across the network
• Logical addressing: Each destination has a unique number or logical address associated 
with it. These numbers can be assigned arbitrarily to nodes. To support logical 
addressing, the router is provided with a routing table. This tells the router the output port 
corresponding to each logical address.
• Regional logical addressing and interval labelling are meant for transmission of packets 
to addresses in remote regions, which are connected to the source using two or more 
routers.
• Group adaptive routing is a means of routing packets to a requested destination over 
different paths through a network.
Logical addressing could be beneficial to avoid problems such as conflicting node addresses. In 
this scheme each node is assigned an address arbitrarily, reducing chances of change of address to 
the address of another node in the network from radiation effects. If radiation causes the address 
of a node to change, it will eventually result in an invalid destination address error and hence can 
be detected. SpaceWire allows 1 byte for the packet address, resulting in 256 possible 
combinations.
83
Babbling is a situation where a node sends out packets in an infinite loop. Similar to the conflict 
of node addresses, babbling detection can be easier with the proposed architecture. The 
supervisor expects to receive periodic packets from all OBDH units. Babbling will cause 
communication to be blocked between two or more nodes, leading to expected packets timeout at 
the supervisor.
6.3.2 Monitoring the Current Consumption of the OBC
At SSTL fast acting power switches are used to mitigate against SELs. The current consumption 
of each OBDH unit is monitored against a preset threshold. If it is exceeded, the unit supply will 
be turned off. These switches are housed within the power system. The current consumption of 
the OBDH units is periodically recorded as a part of the spacecraft’s telemetry data. The OBC is 
responsible for collecting this data. Whenever, the OBC needs to collect the current consumption 
of a unit, it sends a request to the power system controller via CAN. The power system controller 
then reads that value and sends it in a CAN packet to the OBC. It is assumed that the OBC will 
use the same mechanism to collect its current consumption using the SapceWire. The current 
consumption of the OBC is required to be monitored while it’s executing a test sequence.
A timer at the supervisor will be used to indicate the DAD invocation time at the OBC. The 
supervisor will send a OBC’s current consumption monitoring request to the power system. On 
receiving this request, the power system controller will raise a flag to the OBC to indicate that it 
is ready. Once in the execution, the DAD task at the OBC will test this flag and if it finds the 
power system controller ready, it will raise a second pin to indicate that it is about to start the test 
sequence. Both flags will be implemented using two pins directly connected between the OBC 
and the power system controller and a buffer will be used to isolate both components from the 
back-power.
There are three issues, which need to be addressed, regarding above mentioned coordination of 
the OBC and the power system.
>  A request packet from the supervisor places the power system microcontroller into ready 
state and it waits for the OBC flag. There can be a problem with the OBC and therefore, 
the power system microcontroller should not be held waiting for long. Therefore, the 
power system microcontroller should be released either it has received a flag from the 
OBC or a time-out has occurred.
84
>  Similarly, the OBC will not wait for power system ready flag for longer than a threshold 
period. If it exceeds this period, it will continue its test sequence and will inform the 
supervisor of the time-out condition.
>  Although it is proposed to turn off interrupts at the OBC during execution of the test 
sequence, non-maskable interrupts are still possible. In order to ensure that no interrupt 
occurred during execution of the test sequence and hence, that the current monitored truly 
represents current consumption of the unit during execution of the test sequence, it is 
proposed to use a timer starting just before the start of test sequence and stopping just 
after completion of the execution.
6.4 OBC State Recovery
The OBC program memory consists of volatile RAM. Thus after each reset or power cycling of 
the OBC unit, software tasks need to be uploaded to the OBC from the code store. Traditionally 
for space missions, such a situation has resulted in loss of any computation done prior to the fault 
event, which required reset or power cycling for recovery.
As mentioned in chapter 4, roll forward techniques are based on hardware redundancy and 
therefore, roll back checkpoiniting will be explored for this work. This uses a non-volatile storage 
device to save recoveiy information periodically during failure free execution. Upon failure, a 
failed process uses the saved information to restart the computation from an intermediate state, 
thereby reducing the amount of lost computation. The recovery information includes at a 
minimum the states of the participating tasks, called checkpoints. Upon a failure, check point 
based rollback recovery restores the system state to the most recent consistent set of checkpoints,
i.e the recoveiy line.
The checkpointing mechanism takes a snapshot of the system state and stores the data on some 
non-volatile storage medium. Chapter 4 presented the TREMOR computer, which is capable of 
saving its state. However in a uniprocessor system, the state cannot be saved (as it was done in 
TREMOR) if there is a fault on the processor of system. Taking a periodic snapshot of the system 
is not practical either. Whilst the state of a system is being checkpointed, it needs to be frozen to 
get back a consistent and synchronized system out of this state information. In real time systems, 
it is impractical to allow continuous long delays for state updates. Also, SCOS does not support 
rollback recovery and it will be extremely difficult for the user to ensure integrity of the system 
with such a scheme. It is therefore proposed to bring checkpointing within user tasks.
85
6.4.1 OBC Software Architecture
Figure 6-3 depicts the OBC software architecture used at SSTL. Table 6-2 lists the function of 
each task. The flight software has been designed around a client-server model, where each task 
performs a specific function (and is called an application) and a task, which performs a function 
for other tasks is called a server. The server has another property that it is used to act as an 
interface between hardware and software. For example, QAX25 is a server task that implements 
communication between OBC and the ground station. It programs and controls an input and an 
output channel in each of the OBC’s SCC devices to operate in HDLC mode. This task can be 
accessed from HIT, loader and FTLO tasks.
On a system reset, if this server is not initialized and instead is rolled back to an intermediate state 
the system will assume that SCC devices have been initialized, which will result in loss of 
synchronization in the system and therefore is highly undesirable. Instead it is assumed that each 
application task using server tasks is required to ensure that it has completed its interaction with 
any server task before taking a checkpoint. It is reported that OBC tasks can typically be 
considered as loops running forever [Jack-05]. Therefore, it may be possible to place a checkpoint 
at the end of the loop. In order to elaborate this idea, an example of state recovery task is 
presented in section 6.4.3.
Figure 6.3: SSTL OBC Software Structure, After [Jack-051
86
Table 6.2: Functions of OBC Tasks
Task Function
QAX25 This task implements protocol handler for the AX.25 communication protocol, 
which uses HDLC framing at physical layer level. AX.25 is a link layer protocol. 
This task implements communication between the mission control centre and on­
board tasks. Each task has its own address on the uplink/downlink, and can be 
separately addressed on the link. In past missions, the address features of AX.25 
protocol were used. More recently, the UoSAT-12 OMNI tests used UDP port 
features for this purpose [Rash-00].
CAN The CAN server controls the spacecraft telemetry and telecommand bus. The 
CAN server allows access to spacecraft telemetry values. Other tasks can access 
the current telemetry values by requesting a conversion for a specified number of 
channels and when complete, the server returns the channel with their values. The 
server also allows other tasks to send telecommands.
MFile It maintains a file-system structure and supports file operations on it. Other tasks 
may create, open, read, write and delete files using this server.
HIT Housekeeping integration task include, periodic program memory washes, 
periodic telemetry collection and downlink to ground, accepting telecommands 
from ground using QAX25 server.
Loader The program loader is used to upload and start new OBC tasks.
ADCS This task performs attitude determination and control for the spacecraft.
FTLO This task is responsible for transferring files between the spacecraft and the 
ground using QAX25 server.
6.4.2 Checkpointing an OBC Task
Depending upon its role in the system and its checkpointing requirements, the OBC tasks can be 
divided into following categories.
Non-checkpointing server tasks: each of these tasks is assumed to be responsible for interfacing 
with at least one hardware device. These will hold service request queues for corresponding 
devices. Typical examples are communication servers. These will not be checkpointed. Instead 
each individual checkpointing task is responsible to ensure that it has completed its interaction 
with these servers (e.g. any send or receive has been accomplished or if it was using a channel, 
that has been released) prior to requesting its checkpoint.
87
Checkpointing server tasks: these are server tasks that hold some information about the system 
that needs to be recorded for a consistent system state. For example, the MFile server holds a file 
directory, which will be updated when a new file has been created by a task. However, if such a 
server program interacts with a hardware, it is necessary to ensure that it reinitializes the 
corresponding hardware after a reset.
Independent checkpointing tasks; these include application tasks, which do not interact with any 
other checkpointing tasks, e.g.the loader task. The loader task uses the QAX25 communication 
server to upload new tasks. The new program is uploaded in blocks and the sending of blocks 
may be interrupted without resending blocks already uploaded. This task can use uncoordinated 
checkpointing.
Cooperating checkpointing tasks: This categoiy can be further divided into two classes. 
Frequently cooperating tasks, e.g. application tasks using the MFile server on regular basis, and 
occasionally cooperating tasks, e.g. HIT, which passes telecommands to FTLO and ADCS tasks.
When a task calls the SR task to take its checkpoint, SR may be busy with another task. In order 
to address this possibility, a handshake synchronization protocol is put forward (figure 6-4) When 
a task is ready for checkpointing, it will call the SR task. The SR task will respond to this task 
when it is ready to take checkpoint of the requesting task. After receiving a signal from SR, the 
requesting task will flag SR that it is just going into sleep mode. After receiving this flag, SR will 
assume that the requesting task has been taken off from execution and will start making its copy.
In order to increase availability of the SR task to the other OBC tasks and hence, to reduce the 
blocking time of a participating task, a time-division-multiple-access (TDMA) approach is 
proposed. At end of each SR loop, the SR task will calculate new time values for each 
participating task. The SR will subtract each time stamp from its expected value and will 
determine the delay experienced in the current round. The time stamp for the next round will be 
calculated by adding the current round task checkpoint time stamp plus a fixed time 
corresponding to the expected execution time for that task and the longest delay 
calculated.
The advantage of taking a task off from execution is that its state will be frozen, while it is being 
copied. In order to reduce the blocking time for a task, SR will be holding temporary buffers in 
program memory. SR will immediately copy a target task into one of the temporary buffers and 
will release that task (if it is not a cooperating task). Then, SR will request a code store service
from the supervisor and will copy the checkpointed task into the code store. If it is a cooperating 
task, it has to be blocked until other tasks related to this task has reached their checkpoints and 
have been saved. The number of temporary buffers therefore corresponds to largest number of 
cooperating tasks.
Checkpointing Task State Recovery Task
SCOS provides the qcf_wait instruction, which can be used by a task to put itself into sleep mode. 
A wakeup function (qcf_posti) is provided to wake a sleeping task. A wakeup call can be made 
by another task or SCOS to flag completion of any operating system request or expiration of any 
software timer. When a task is placed into sleep mode for checkpointing, it will be programmed 
to wake up under two conditions. Usually, SR will set a shared variable (say x) to an agreed value 
(say 15) and will call qcf_posti(target task). In case of a problem with SR, a software timer will 
expire and operating system will wake this task up. However, after a system reset it will always 
be responsibility of the SR task to wake this task. Target task program will have a format 
While (software timer does not expire or x ^15) 
qcf_wait()
Therefore, it is necessary for SR to ensure that it completes checkpointing of the target task 
before it times out and get back into execution. A software timer can be used within SR to check 
whether it completed the copy of tar get task within the specified time or not. If it could not meet 
the deadline, it will disregard this checkpoint.
It is proposed to allow space for two recovery lines in the code store. This is recommended for 
two reasons. Firstly, as mentioned above, SR sometimes may not be able to get a copy of
89
checkpointed task within the specified deadline and therefore will be forced to roll back to a 
previous checkpoint. If it is a cooperating task, all other tasks in the set have to roll back too. It 
may be possible that the problem occurred while one or more than one tasks in the set have 
already been moved to the code store. If there is no previous recovery line available, tasks have to 
be started from the initial state. Secondly, it may be possible that SR is working properly and has 
moved a few checkpointed tasks to the code store and is progressing with rest, and the processor 
gets crashed. In this case, the system will require starting with a previous consistent recovery line. 
Having space for two recovery lines ensures that system will always have a previous recovery 
line available, while it is constructing the most recent recovery line.
6.4.2.1 Checkpointing with Occasionally Cooperating Tasks
Application programs in the OBC software are designed to perform their specific tasks and 
therefore there may not be frequent interaction between these tasks. However, there will be an 
interaction under certain circumstances. For example, HIT is used to command FTLO to allow all 
users to access its facilities, or close it for command station use only, hi order to ensure that the 
system state remains consistent, it is necessary that if a command has been sent from HIT in a 
checkpoint, the FTLO state is saved when it has received that command.
As mentioned in preceding section, the SR task will allocate time slices to each task at the 
beginning of the round. The SR task does not have any information as to whether an application 
task will interact with any other application or not. Therefore, it is the responsibility of the 
involved applications to send this information to SR.
The fixed time spacing of checkpoints for any application is based on two parameters. Firstly, the 
length of the program loop execution (considering a checkpoint is made at the end of program 
loop) and secondly on any margin added to reduce overheads. For instance, one may wish to 
checkpoint a task after two or more loops instead at end of each loop.
Whenever an application wants to call another application that is included in recovery line, the 
calling application would need to provide the SR target task number prior to calling that target. 
Once SR has received this information, it knows these two tasks are cooperating and it will 
recalculate the checkpoint time for both with the effort to bring their checkpoint times closer to 
each other. New time will be posted to both tasks.
90
Assuming that one of the tasks has called SR and its copy has been made, it can be resumed with 
the assumption that it will not interact again with the other task until the other task has been 
checkpointed. However, if it does so, it will inform SR as usual. SR will abort the current 
recovery line. If a failure occurs before the next state update, the system has to be rolled back to 
the previous recovery line.
6.4.2.2 Checkpointing with Frequently Cooperating Tasks
If two or more tasks interact on regular basis (say in each execution loop), they all need to be 
checkpointed before moving to the next round. In the current discussion, the MFile task is likely 
to be called from more than one task in each round. The procedure will remain the same as 
explained in the preceding section with two exceptions. Firstly, this time SR already knows that 
there will be cooperating tasks and therefore it will allocate them as close as possible checkpoint 
times. Secondly, MFile is the task, which is likely to be changed by other tasks and therefore, it 
will be checkpointed after checkpointing of other tasks in question.
6.4.3 State Recovery Task
The SR task is aimed at providing OBC state recovery. This task is responsible for periodic state 
updates and it assists system recovery to the last good recovery line. It itself interacts with all 
other application programs to collect their state information.
The state of SR needs to be saved too, and must be consistent with rest of the system. In this 
section, details of this task will be provided. A flow chart for the task is shown in the figure 6-5. 
This task holds three tables to trace the system’s state. These tables are the ‘send’ and ‘sent’ 
tables, which show tasks record for current recovery line, i.e. those being checkpointed in the 
current round and a ‘previous recovery line’ table to hold information of the previous completed 
recovery line. These tables will be used by the SR task on each system restart to upload 
checkpointed tasks from the code store and to ensure a consistent system state. The ‘send’ table 
entries will also be used by the supervisor to locate a crashed task (section 6.4.5).
91
Table 6-3 Send Table Entries
Task number
Cooperating task number (if any)
Record o f calls made to cooperating task/tasks
Time assigned= Previous checkpoint time + task loop execution time + margin (multiple o f execution time) 
Time when task was ready for checkpointing 
Time when copy into temporary buffer completed 
Status (Copied into temporary buffer)
Buffer number
Code store service request sent
Code store grant received
Number o f KB transmitted
Status (transmission to code store completed)
Table 6-4 Sent Table Entries_________________________________________________________________
Task number
Time stamp or recovery line number
Code store buffer number
Recovery line status (in progress or completed)
Table 6-5 Record of Previous Recovery Line_____________ ______________________________________
Task number
Time stamp or recovery line number 
Code store buffer number
6.4.3.1 Incremental Checkpointing of State Recovery Task
The flow chart for system recovery task is shown in figure 6-5. As the system recovery task holds 
recovery line information, it needs to be saved too. Between full snapshots, or even in place of all 
but the first complete shot, only that state which has changed may be saved. This is known as 
incremental checkpointing [Plan-96, DeVa-99]. hi the case of the state recovery task, the only 
information, which gets updated, consists of send, sent and previous recovery line tables. 
Therefore, instead of copying the complete task only these tables need to be saved.
92
Completion of 
state recovery 
task is flagged to 
DAD task
Figure 6-5 Flow Chart for SR Task
93
6.4.4 Initialization of the System after a Reset
After a reset, the bootloader will first load the operating system, server programs and state 
recovery task from the code store. The state recovery task will use a flag to let itself know 
whether it needs to update its tables with saved information. Once these tables are restored, the 
state recoveiy task will use this information to load the last good recovery line and hence, to 
restore state of the system.
This technique provides protection against the transient faults. Typically upon state restoration 
the system will continue processing in an identical manner as it did previously. This will tolerate 
any transient fault, however if the fault was caused by a design error, then the system will 
continue to fail and recover endlessly.
In this thesis, checkpointing of the other units in the OBDH architecture has not been described. 
However, if there are any cooperating tasks running on two units, both will be required to be 
checkpointed for a global consistent state.
6.4.5 Monitoring the State Recovery Task
The state recovery task is a software process and may also crash. The supervisor will be used to 
monitor this task and hence, to increase system fault tolerance.
Chapter 7 will present latency bound calculations of the supervisory scheme. It will be shown that 
the DAD packet generation and processing will require about a second. Getting a DAD packet to 
the supervisor in time is crucial as a delay will result in an intervention from the supervisor.
The state recoveiy task execution time depends on the execution time for application tasks that 
will participate in checkpointing. The dynamic real time environment can cause variations in time 
for its execution and therefore, the designer will need to allow a margin for its execution time 
estimate. More importantly, checkpointing the system often will impose overheads on system 
mission performance. Therefore, it may not to possible to produce this information on every DAD 
invocation period.
We propose using a flag in DAD packet to indicate signaling of state recovery task, i.e. the state 
recoveiy task has completed a loop of execution. The supervisor will use this flag to monitor the 
execution time for the state recoveiy task. It will hold a threshold value for the state recovery task
94
execution time and it will compare it with the time when it received last state recovery signaling 
flag. If it does not receive an indication of another signaling of state recovery task before it 
timesout, it will request send table entries from the state reeoveiy task. Either the state recovery 
task is crashed or one of the checkpointing tasks is blocking it. If the state recovery task is not 
crashed, it will send back the requested information. The supervisor will look into this table to 
locate the task number, which has caused this delay. Assuming that the threshold provided to 
supervisor is sufficiently large to cover any possible delays in the system, this timeout will be 
considered as an indication of a crashed task. Therefore, the supervisor will request the OBC 
operating system to reload the faulty task.
6.5 Sequence of Events in the DAD Task
During its execution the DAD task is required to performs following functions
1. Wait for the power system flag
2. Raise a flag to indicate that the OBS is starting the test sequence
3. Disable interrupts
4. Setting up a timer
5. Test sequence execution
6. Stop timer
7. Enable interrupts
8. Check timer value to estimate time elapsed between steps 4 to 6.
9. Check whether the SEU count of the program memoiy is available. A flag will be used 
for coordination of the DAD task and the housekeeping task, which establishes SEU 
count of the program memoiy.
10. Read SEU count
11. Generate the DAD packet
12. Command to send the DAD packet
6.5.1 DAD Packet Format
The DAD packet format is shown in the figure 6-6.
1 20 8 1 3
SpaceWire Header IP Header UDP Header Flags Diagnostic Health Data
Figure 6-6 The Packet Format
95
The SpaceWire header specifies the hardware addresses of the supervisor and the OBC units. 
Each component in a unit will be associated with an IP address. The UDP port numbers establish 
communication between two programs running on two IP sources. For example, this is to pass the 
received data to the right task in the supervisor, i.e. whether it requires the DAD packet 
processing or it is a packet from the OBC SR task and needs access to the code store. Diagnostic 
health data (DHD) contains the ED AC error count and test sequence results. The summary of the 
flags field is presented in the table 6-6.
Table 6-6 Description of the Flags Field
Flags Description
BitO DAD or screech packet
Bit 1 SEU count to be found in DHD field
Bit 2 Test sequence completed in the predefined time
Bit 3 State recovery task signalled
Bit 4 Power system controller signalled within the threshold time-period
Bit 5-7 Reserved
6.6 Screech Packet
SCOS provides a process, which is called screech. A screech is a report of an unusual error or 
fatal program trap, one where recovery is impossible or undesirable. A screech will cause the 
operating system to place an error code in a known location relative to the task header [Bekt-92]. 
Currently in SSTL missions, a screech causes the OBC to enter the bootload read only memory 
(ROM), permitting a dump of memory to ground station for examination. The spacecraft is placed 
into ‘safe’ mode until the operating system and other user tasks are uploaded from the ground 
station. However, it might well be possible to suspend the affected task, while permitting other 
tasks to continue.
It is proposed in this thesis that the operating system reports to the supervisor any screech. It 
suspends execution of that task and it waits for the supervisor response. The code store will hold 
storage for three copies of the program memory. One is the original copy, which was written 
initially to the OBC memory, second copy is the checkpointed user tasks, and third storage for the 
memory, which has been dumped by the operating system in the case of a screech. Once the 
supervisor is informed of a screech, it will send a command to the OBC to dump memory into the
96
code store. After completion of the memory dump, the supervisor will replace the affected task in 
the OBC with a copy of that task, which will be read from the code store.
6.7 Recovery Procedures
The most common method of recovering the processor from an SEFI is to power cycle the entire 
computer after detecting the system is hung. A serious drawback of this solution is that the whole 
system is inoperable for the entire duration from the SEFI, through detection until power is cycled 
and the processor is rebooted.
The programmable nature of the supervisor makes it possible to apply more than one recoveiy 
procedure, depending upon the fault signature observed. Once the fault has been identified, the 
appropriate recovery method can be invoked. After restoring basic functionality of the system, the 
information stored in the code store will be used to attempt recovery of the state of the system to 
that prior to the fault occurring. Table 6-7 lists fault modes against their recovery procedures.
Table 6-7 Fault Types and Recovery Procedures
Fault Type Recovery # Recovery Procedure
Crashing of an OBC task Recovery # 1 Section 6. 7.1
Packet time-out (Network 
problems)
Recovery # 2 Section 6.7.2
Packet time_out (Processor 
Problem)
Recovery # 3 Section 6.7.3
Current consumption 
exceeding expected threshold
Recovery # 4 Power cycling through 
OBC unit, followed by 
the state recovery
SEU count exceeding 
threshold
Recovery # 5 Resetting the OBC and 
state recovery
Test sequence result 
mismatch
Recoveiy # 6 Resetting the OBC and 
state recoveiy
6.7.1 Recovery# 1
When a screech packet is received, the supervisor will send back a packet to the OBC to initiate 
memory dump to the code store. The supervisor will be holding following designer input that will 
be used to make a decision.
97
Table 6-8 Inputs for Recovery of the OBC after a Screech
Task ID Screech Error Code Recovery Priority
Original Updated Complete Reloading
1 A Yes
B Yes
C Yes
2 X Yes
•  m m Y Yes
Z Yes
SCOS provides several traps that are reported using the screech mechanism; the facility is also 
available to the user. The designer of the OBC software can choose which kind of recovery is 
wanted with a particular task and an error code. One may choose to remove the faulty task in the 
OBC memory with the original or updated copy of the task. A complete reload of the OBC 
memoiy with original or updated tasks can also be opted. Similarly, a task crash observed through 
the SR task can be recovered.
6.7.2 Recovery # 2
The supervisor can detect fault on the network connection in two ways:
Case 1:
As mentioned earlier, the SpaceWire standard defines error detection at two levels of the 
protocol. In addition the CODEC FPGA at the supervisor will decode this packet and will be able 
to detect erroneous UDP or IP checksums. If the OBC starts transfer of the DAD packet and any 
of the errors is reported to the supervisor, it will initiate its recovery designed for the network.
Case 2:
When a packet time-out occurs, the supervisor can suspect both interface nodes (network link) 
and the OBC processor. The first suspicion is the network interface.
As soon as an invalid packet is received or a packet time-out occurs, the supervisor will poll the 
OBC interface node to send back an “I am okay” packet within a time slot. If it does not receive a 
response, or again it receives an erroneous packet, it needs to reset the faulty interface node.
Recovery # 2a
When the supervisor attempts to open a link, it will set a time-out period for link connection. If 
the link connection cannot be established within the specified timeout period, it will be assumed
98
that the link between two nodes is blocked because of babbling or non-responsive OBC interface. 
The supervisor will switch to the redundant SpaceWire link.
In order to cope with a non-responding interface node, a redundant link has been provided in the 
system. Chapter 7 uses a SMCSlite SpaceWire interface microcontroller for the OBC interface 
node. The same microcontroller is assumed for this discussion. The SMCSlite provides only one 
SpaceWire link and for redundancy another microcontroller will be required. The SMCSlite 
microntroller can either be programmed by a host microprocessor (the OBC-processor) or via the 
SpaceWire link [Chri-01]. One of the microcontrollers will be programmed by the OBC- 
processor, while the redundant one will be programmed via link and hence, by the supervisor. An 
interrupt signal from this redundant microcontroller will be tied with the OBC-processor.
An interface node consists of two programmable components: a SMCS lite microcontroller (to 
code and decode SpaceWire packets) and a CODEC FPGA to handle UDP/IP details. A network 
problem can be caused by any of these components. The OBC-processor is therefore required to 
reset both of these. None of the Actel's current FPGA device contains a built-in reset pin, which 
can be applied to register cells in the user’s design. However, a user defined reset procedure can 
be achieved [Acte-01].
If it has been recovered successfully by a reset, a connection will be established with the 
supervisor to send an “I am okay” packet. Finally, if this does not work then the affected link will 
be disabled and the redundant SMCS/zte microcontroller will be programmed to act as the 
main/running link. The SMCS/z7e microcontroller provides an 8-bit semaphore register [Chri-01]. 
The supervisor will write a value in it and on application of the interrupt, the OBC will read this 
value and will know whether it needs to reset the interface node or to reprogram the redundant 
microcontroller.
Recovery # 2b
When the supervisor tries to write a packet, it will set a time-out period for the transmission of the 
packet. If the complete packet is not transferred during the time-out period, then the supervisor 
interface is assumed to be blocked. The link interface of the supervisor is then disabled to cause a 
disconnect error and reset of the link, and then it is enabled again to allow the link to start and 
reconnect.
99
If the above recovery procedure does not repair communication, then the affected link will be 
disabled and a redundant link would require to be enabled. The same procedure will be repeated 
if the supervisor starts reception of the “I am okay packet” but could not finish before it is timed 
out.
6.7.3 Recovery # 3
As mentioned earlier, the supervisor will attempt to poll the OBC interface node once a DAD 
packet timeout has occurred. After receiving an “I am okay” packet from the interface, it will 
send a command to the interface node to interrupt the OBC processing unit and will expect to 
receive a response in some time t. If it receives a packet from the OBC processor, it will reset the 
DAD timer. However, if  it does not receive the expected packet it will command the OBC 
interface to place a NMI to the processor. If it does not recover, the processor will be reset 
following by reloading the OBC memory with its last known good recovery line in the code store. 
If the next expected DAD packet is timed out, the supervisor will power cycle through the OBC 
and then load the OBC memory copy from the code store to the OBC memory. In the case of a 
timeout of the next DAD packet, it will place the OBC in safe hold mode and will wait for the 
ground intervention.
6.8 Availability of the OBC Node
It is important to note that in spacecraft systems few tasks can be critical, i.e. these tasks have 
stringent completion deadline requirements. These tasks include maintenance of attitude, firing of 
thrusters, communication with the ground stations etc. the detection and diagnosis performed by 
the supervisor can create some conflicts. For example, the DAD test sequence requires disabling 
interrupts for a while. The OBC might need to perform a more urgent task in that time. It is 
required to take all these factors in consideration while designing the supervisory protocol. The 
following measures can be adopted to avoid such situations.
6.8.1 Prioritize interrupts
The simplest priority-based system has two modes in which a CPU can run: with interrupts 
enabled, meaning that a requested interrupt will be invoked immediately after the completion of 
the current instruction, and with interrupts disabled, meaning that no interrupt requests are 
accepted. More flexibility can be obtained by having several modes, which are organized 
hierarchically, based on a priority structure. When the processor executes at a certain priority, the 
current and all lower levels are disabled, while all higher levels are enabled.
100
6.8.2 Validate Execution Time of the Test sequence
The current consumption of the OBC unit is monitored during execution of the test sequence. 
Using a prioritiy interrupt approach, there is a possibility that an interrupt may occur during 
execution of the test sequence. Also, a non-maskable interrupt cannot be masked and can occur at 
any time. Therefore, it is important to validate that the current consumption was actually 
measured when the OBC was running the test sequence. The DAD task can include a timer to 
measure the interval from the moment of disabling the interrupt to the time when execution of test 
sequence is completed. If the measured time is higher than expected, the supervisor will be 
informed by a flag in the DAD packet.
6.9 Recovery Recommendations
The following recommendations are put forward to reduce risk of unnecessary recovery 
applications.
>  The processor recovery goes from mild to severe recoveries; i.e starts from maskable 
interrupt, non-maskable interrupt, reset and finally power cycling. The OBC should be 
allowed sufficient time before application of the succeeding recovery. For example, it is 
possible that the OBC is processing an interrupt request, as the CPU turns off any further 
interrupts when it enters an interrupt service routine (ISR), and this has caused a lost 
supervisor interrupt request. Therefore, the time-out period for maskable interrupt should 
be long enough to allow the OBC to complete any ongoing interrupt request.
>  In case of a processor reset and power cycle, the OBC should be allowed sufficient time 
for reinitialization and stabilization of the processor and I/Os.
> The supervisor needs to keep a record of recoveries applied. For instance, after 
application of a reset request for the OBC, the supervisor will expect to receive a normal 
DAD packet in next DAD invocation period. If it still finds deviation from expected 
results, this time it needs to apply a power cycling.
>  Consecutive recovery cycles need to be avoided. For example, after the application of a 
complete recovery cycle starting from maskable interrupt to power cycling, the 
supervisor will expect to receive a normal DAD packet in the next invocation period. If it 
finds deviation from expected results, it needs to place the spacecraft in safe hold mode 
and needs to wait for ground intervention.
101
CHAPTER 7
LATENCY BOUND OF THE SCHEME
The proposed SEFI tolerance scheme is based on the concept of periodic testing, which falls 
under the category of nonconcurrent test strategy that trades between fault-detection latency and 
performance overheads. The scheme requires the OBC to run a special DAD task to support 
supervisory fault detection and diagnosis. In such a case, the DAD task is another process that has 
to compete with user processes for system resources, CPU cycles, memory and communication 
I/O. To alleviate the system’s overhead, the DAD task should run in the minimum possible 
number of clock cycles. Fault detection latency depends on the time interval between two 
consecutive DAD-task executions, as well as, the DAD task execution time. The time interval is 
specified as a trade-off between user-program performance and fault detection latency with the 
capability of the system to detect SEFIs.
This chapter addresses these issues. The amount of money and time required for the development 
of a real hardware test bed was large and the system was modelled theoretically. In the following 
sections of this chapter, firstly device technologies adopted in system model will be described, 
followed by latency bound calculations of the scheme.
7.1 System Model
The OBC subsystem is shown in figure 6-2. It is reproduced in this section with the replacement 
of the interface FPGA with an existent SpaceWire interface controller, SMCSlite and router has 
been replaced with SMCS332 [Chri-99, Chri-01]. This device technology has been chosen 
because of the availability of its information required for this analysis. Figure 7-1 shows the 
system model. The supervisor uses router to communicate with the OBC.
102
Main 1zi d d
n n nSMCS332 |  DPR AM I  C° ^  I
Router
O B C Code Store
Figure 7-1 System Model
7.1.1 Intel 386EX CPU
A CPU is a multipurpose, programmable, clock-driven, register-based electronic device that reads 
instructions from the memory unit, accepts data from input and processes data according to those 
instructions, and provides results as output.
The Intel 386EX, introduced in August 1994, is a highly integrated, 32-bit static CPU, i.e. it may 
run as slowly (and thus, power efficiently) as desired down to full halt, and it is optimised for 
embedded multitasking operating systems. It has been flown on many of SSTL missions and has 
been adopted for NASA FlightLinux project too [Answ-05]. Depending on the version of 386EX, 
a frequency from 20 to 33MHz can be obtained [Inte-96]. This work will be based on a 
frequency of 25MHz, which is the current maximum for SSTL 3860BC [Sstl-05].
The clock signal, which is produced by a simple oscillator on the computer board, sequences 
processor operations to give the circuits time to complete one operation before proceeding to the 
next. Logic takes a while to do anything. The system clock insures each operation finishes before 
the next takes place. A 25 MHz oscillator gives 40 nanoseconds per clock cycle. Each clock 
period (the time required for one complete cycle) is called a "T state", and is the basic unit of 
processor timing. Nothing completes in less than a T state, though propagation times through 
individual components will generally be faster than the T state time. An entire memory read/write
103
cycle requires 2 01* more T states, depending on the processor selected. A "machine cycle/bus 
cycle" is the entire time - 2, 3, or more T states - required to perform a single read or write. The 
386 is a 2-T state machine [Gans-05].
An external oscillator must provide an input signal to CLK2, which provides the fundamental 
timing for the 386 processor. The clock generation circuitry divides clock from CLK2 by 
2 and the resultant clock signal is fed to the core and peripherals [Inte-96]. This results in 
an effective T-state o f  80ns.
Memoiy access time is defined as the time taken by a memory device to locate a single piece of 
information and make it available to the computer for processing [Webp-05]. The system will 
require insertion of wait states if the memory, which is interfaced to the microprocessor, is slower 
than the microprocessor. A wait state is an extra clocking period, which is inserted to lengthen the 
bus cycle.
The 386 microprocessor includes an on-chip direct memory access (DMA) controller. The DMA 
controller improves system performance by allowing external or internal peripherals to directly 
transfer information to or from the system. The DMA controller can transfer data between any 
combination of memory and I/O, with any combination of data path widths (8 or 16 bits) [Inte- 
96]. A DMA process is the execution of a programmed DMA task from beginning to end. Each 
DMA process requires initial programming by the Intel386 EX processor. DMA transfers data 
between a requester and a target. An external device or an internal peripheral request is serviced 
by activating a channel’s request input (DREQ/z). The requester is the one that activated DREQzz. 
The requester can be in external device I/O space, in internal peripheral I/O space, or memoiy 
mapped I/O. The requester either deposits data to or fetches data from the target. There are two 
bus cycle options for transferring data, fly-by and two-cycle.
Fly-by allows data to be transferred in one bus cycle. It requires that the requester be in external 
I/O and the target be in memory. The fly-by option perfonns either a memory write or a memory 
read bus cycle. A write cycle transfers data from the requester to the target (memoiy), and a read 
cycle transfers data from the target (memory) to the requester. When a data transfer is initiated, 
the DMA places the memory address of the target on the bus and selects the requester by 
asserting the DACK/z# signal. The requester then either deposits the transfer data on the data bus 
or fetches the transfer data off the data bus, depending on the transfer direction. Since the
104
requester is selected via the DACKh# signal the requester address is not meaningful in a fly-by 
mode transfer. Support logic (either external or built in to the I/O device) must be designed to 
monitor the DACKh# signal and accordingly generate the correct control signals to the I/O 
device, since all processor signals are used to access memory. This means that if it is an I/O to 
memory transfer, this logic generates an I/O read cycle and the processor generates the memory 
write cycle. If it is a memory to I/O transfer, the logic generates an I/O write cycle and the 
processor generates the memory read cycle.
The two-cycle option allows data to be transferred between any combinations of memory and I/O 
through the use of a four-byte temporary buffer. The amount of data and the data bus widths 
determine the number of bus cycles required to transfer data.
7.1.2 Operating System
An operating system provides the environment within which programs are executed. This section 
will first describe basic concepts of interest for this discussion followed by a description of SCOS 
[Kurz-84, Silb-98, Bekt-92].
Broadly operating systems can be divided into two categories, serial batch processing and 
multiprogramming. The serial batch processing system is characterized by the fact that only one 
user’s program may be executing within the computer system at a given time. The objective of 
multiprogramming is to have some process running at all time to maximize CPU utilization. Time 
sharing is used to switch the CPU among processes so frequently that users can interact with each 
program while it is running. Therefore, current day computer systems allow multiple programs to 
execute concurrently. This is not to say that simultaneous operation is possible; parallel 
processing can only occur where there is a possibility of more than one instruction being executed 
at the same time, such as in multiprocessing systems. However, contending programs may very 
easily be in the midst of execution at the same time as they alternately use the CPU.
Multiprogramming operating systems are based on processes. Informally, a process is a program 
in execution. A process is more than the program code. A process generally includes the current 
activity of the process, the process stack (containing temporary data such as subroutine 
parameters, return address, and temporary variables), and a data section containing global 
variables. The process states are shown in the figure below.
105
Each processor is represented in the operating system by a process control block (PCB). It 
includes process ID, process state, contents of program counter and other CPU registers, CPU 
scheduling information (process priority, pointers to scheduling queues and any other scheduling 
parameters), memory management information and I/O status information (the list of I/O devices 
allocated to this process, list of open files and so on).
As processes enter the system, they are put into a job queue. The processes that are residing in 
main memory and are ready and waiting to execute are kept on a list called the ready queue. 
There are also other queues in the system. When a process is allocated the CPU, it executes for a 
while and eventually quits, is interrupted, or waits for the occurrence of a particular event, such as 
the completion of an I/O request. The list of processes waiting for a particular I/O device is called 
a device queue.
A process migrates between the various scheduling queues throughout its lifetime. The operating 
system must select, for scheduling purposes, processes from these queues in some fashion. The 
selection process is carried out by an appropriate scheduler.
Whenever the CPU becomes idle, the operating system must select one of the processes in the 
ready queue. A ready queue may be implemented with various scheduling algorithms. 
Conceptually, however, all the processes in the ready queue are lined up waiting for a chance to 
run on the CPU. The records in queues are generally PCBs of the processes. The CPU scheduling 
decisions may take place under the following four circumstances:
106
• When a process switches from the running state to the waiting state
• When a process switches from the running state to the ready state (for example, when 
an interrupt occurs)
• When a process switches from the waiting state to the ready state (for example, 
completion of an I/O)
• When a process terminates
For circumstances 1 and 4, there is no choice in terms of scheduling. A new process must be 
selected for execution. There is a choice, however, for circumstances 2 and 3. If scheduling takes 
place only under circumstances 1 and 4, the scheduling scheme is called nonpreemptive; 
otherwise the scheduling scheme is preemptive.
SCOS provides a pre-emptive multitasking environment. A small unit of time, called a time 
quantum, or a time slice, is defined. A time quantum is generally from 10 -  100 ms [Silb-98]. 
SCOS has a time slice of 100 ms with a scheduler tick period of 10 ms. It associates priorities 
with each process. At each 10 ms tick, the operating system will check to see if a task of higher 
priority than the currently task is waiting CPU time. If no higher priority task is waiting CPU 
time, the operating system allows each task to run for 100 ms before switching to the next waiting 
task. A task can (and generally will) give up the remainder of the current time slice in which case 
the operating system will allocate it to the next task. If there are no tasks waiting CPU time, the 
operating system will execute the idle instruction thus placing the OBC into a low power state 
[Jack-05].
This kind of scheduling adds real time functionality to the operating system. It offers higher 
priority to critical tasks over less critical ones. Adding real time functionality to a time sharing 
system may cause an unfair allocation of resources and may result in longer delays, or even 
starvation, for some tasks. Generally most of the tasks are allocated the same priority to avoid 
such a situation.
Some of the attributes of SCOS, which have relevance to this discussion, are described below.
• Sleep/wakeup: A task can put itself to sleep when it has nothing to do. This allows 
other tasks in ready queue to run. A wake up function is provided to wake a sleeping 
task.
• Timers: the operating system provides “countdown” type timers. These timers can be 
set to expire some number of timer ticks in the future, or at a particular coordinated
107
universal time (UTC) time. The user task can request any number of timers, starting 
and stopping them independently. When a timer expires, an interrupt routine 
specified by the user is called.
7.1.3 SMCS332 Communication Controiler
The SMCS332 provides an interface between a data strobe (DS) link according to the IEEE 
standard 1355 specification with a data processing node consisting of a CPU and communication 
memory [Chri-99]. It supports features such as reduction in power consumption with reduced 
transmit data transfer rate.
Each DS pair carries a data/control token and an encoded clock. Data tokens are 10-bit long, 
which include a parity bit, a flag, which is set to 0 to indicate a data token and 8 bits of data. 
Control tokens are 4-bit long, one parity bit, a flag and 2 bits to indicate type of the token.
Token level flow control is performed in each DS link unit, and the additional flow control tokens 
used are not visible to the higher-level packet protocol. The token level flow control mechanism 
prevents a sender from overrunning the input buffer of the receiving link. Each receiving link 
input contains a buffer for at least 8 tokens (SMCS actually provides buffering for 20 tokens). 
Whenever the link input has sufficient buffering available to consume a further 8 tokens, a FCT is 
transmitted on the associated link output, and this FCT gives the sender permission to transmit a 
further 8 tokens. Once the sender has transmitted a further 8 tokens it waits until it receives 
another FCT before transmitting any more tokens. The provision of more than 8 tokens of 
buffering on each link input ensures that in practice the next FCT is received before the previous 
block of 8 tokens has been fully transmitted, so the token-level flow control does not restrict the 
maximum bandwidth of the link. The SMCS332 architecture includes the following functional 
blocks.
3 bidirectional link channels: all comprising a DS macro cell transmit and receive and a 
protocol processing unit for interprocessor communication protocol. Each channel is capable of 
full duplex communication upto a speed of 200 Mbps in each direction.
Communication memoiy interface (COMI): performs autonomous accesses to the 
communication memory of the unit to store data received on the link or to read data to be 
transmitted on the link. The COMI address bus is 16-bit wide to allow a dual port RAM 
(DPRAM) of 64Kbytes to be used as the communication memory. The COMI consists of
108
individual memory address generators for the receive and transmit direction of each DS link 
channel. The access to memory is controlled via an arbitration unit providing a fair arbitration 
scheme.
Host control interface (HOCI) gives read and write access to the SMCS configuration registers 
and to the DS-channels for the controlling CPU. Packets can directly be transferred from/to 
HOCI, in this case DPRAM is not required. However, in this case packet size should be limited to 
avoid frequent CPU interaction. The data bus width is again flexible (8/16/32).
Protocol command interface is used for implementation of interprocessor communication 
protocol, which is not of interest for this discussion and therefore will not be described.
The SMCS controller can be configured as a system node or as a router, hi the router mode, it 
performs wormhole routing. Each of the three links and the SMCS itself can be assigned an eight- 
bit address. When the routing is enabled, first byte of the received packet will be treated as the 
destination address. It will be checked against addresses of rest of two channels and the SMCS 
address. If it matches, it is forwarded to the appropriate channel or internal FIFO. If it does not 
match an address in the system, it will be written to internal FIFO and an error interrupt 
(maskable) will be raised. Anything after the header (destination address) is treated as the packet 
body until end of packet (EOP) marker has been received. This enables the SMCS332 to transmit 
packet of arbitrary length.
The SpaceWire standard supports the wormhole routing concept, in which a routing decision is 
taken as soon as the routing information that is contained in the packet header has been received. 
Therefore a channel corresponding to destination of the packet has been determined; the SMCS 
332 will receive packet bytes and will pass it on to corresponding channel. Wormhole routing is 
invisible as far as the sender and receivers of packets are concerned. It has been introduced to 
minimize the latency in message transmission. But it has a down side too. As it allows a packet of 
an arbitrary length, a packet can occupy two channels (being received on one channel and 
transmitted on the other) for arbitrary long period. In order to make network delays more 
predictable and to ensure that each node in the system receives a fair amount of network 
bandwidth, a maximum size of packet can be defined in a system. If a packet arrives on a port of 
the router and its required destination port on the router is busy then the packet is buffered.
109
In order to program a SMCS332 device (read/write appropriate registers), a local CPU, 
microcontroller or FPGA is required or it can be controlled remotely (being configured using one 
of its link). In the proposed system configuration, the router and the supervisor are combined in 
one node and therefore, the supervisor microprocessor controls this device.
The SMCS 332 device takes two clock inputs, CLK and CLK10. The maximum frequency for 
CLK input is 25 MHz and this clock specifies COMI timing. The CLK10 provides clock for DS 
links (typical value is 10 MHz), it can be reconfigured and is application specific.
7.1.4 SMCSIite
The SMCS332 with its three links may not be required on each node on the system. Thus a 
smaller (one link only) with more peripherals (timers, ADC, DAC, general purpose I/O, host 
interface and so on) was introduced and was named SMCSIite. It provides one IEEE 1355 serial 
communication link with data transfer rates upto 200 Mbps.
A host interface is required to program and control the controller by a local host processor or it 
can be programmed remotely through its network link. When a specific interrupt is enabled 
(corresponding bit set to one) by interrupt mask registers, the signal HINTR of the host interface 
will be activated.
The SMCSIite can be operated into two modes: using an external FIFO and without an external 
FIFO. In former mode, the SM CSIite FIFO controller reads and writes packet 
received/transmitted over the IEEE 1355 link to an external FIFO. This FIFO is shown as dual 
port RAM (DPRAM) in figure 7.1. The FIFO interface provides the control signals full and write 
(write received data from the link to the FIFO), and empty and read (read data from the FIFO to 
transmit on the link). Data received from the FIFO interface is sent over the IEEE-1355 link 
grouped in packets. A 16-bit internal counter is provided to hold length of a packet (in bytes).
hi the latter mode, an internal host FIFO is used to read and write data from the host to 
transmit/receive on the IEEE 1355 link. It requires the host to write 8-bits of data at a time to the 
transmit data register for transmission on the IEEE-1355 link resulting in a character-oriented 
communication between host and communication controller. Another transmit register is provided 
at the SMCSIite, which is required to be programmed by the host when it desires to send an EOP
110
marker. Similar to transmission, there are two 8-bit host FIFO registers for reception on the IEEE- 
1355 link [Chri-01].
7.1.5 Interface between OBC and Communication Controller
A device communicates with a computer system via a connection point termed a port. If one or 
more devices use a common set of wires, the connection is called a bus. A controller is a 
collection of electronics that can operate a port, a bus, or a device. The controller has one or more 
registers for data and control signals. One way that this communication can occur is through the 
use of special I/O instructions that specify the transfer of a byte or word to an I/O port address. 
Alternatively, the device controller can support memory-mapped I/O. In this case, the device­
control registers are mapped into the memory address space of the processor. Some systems use 
both techniques. In this system model, the I/O port addresses have been adopted to access 
SMCS//te communications. The CODEC FPGA is mainly responsible for handling 
communication between the processor and SMCSlite. The CODEC FPGA has been assigned this 
role for two reasons
1. In case of a SEFI at the processor, the FPGA is still able to communicate with the 
network and therefore can send back “I am okay” packet on polling request of the 
supervisor.
2. In this way, the system can take benefit of one-cycle fly-by DMA operation
Figure 7-3 shows main connections between devices on the OBC unit.
Figure 7-3 Interface between the OBC and SMCS332
Computers operate on a great many kinds of devices. The approach here involves abstraction, 
encapsulation, and software layering. Specifically, the detailed differences can be abstracted away
111
by identifying a few general kinds. Each of these general kinds is accessed through a standardized 
set of functions -  an interface. The actual differences are encapsulated into operating system units 
called device drivers. Each device drivers handles one device type or one class of closely related 
devices [Silb-98],
The SCOS provides I/O handler tasks to support I/O applications. Two types of communication 
interfaces are supported, frame oriented and character oriented. The SpaceWire communication 
actually requires a character oriented communication. In this case, the CPU can use internal FIFO 
of the SMCSlite. However, it is inefficient in terms of the CPU usage. Inclusion of a DPRAM 
will allow a packet or frame oriented interface between the CPU and the communication node. 
Currently, SCOS does not include an I/O handler specifically designed for SpaceWire. However, 
it presents a sample HDLC I/O handler, which can be modified to support a wide range of 
applications. Details of an HDLC I/O handler will be presented here to act as an example of a 
frame oriented communication interface.
The I/O handler task has 3 main responsibilities
• Initialisation of the I/O driver (the desired operating parameters such as baud rate, parity, 
clocking options and encoding methods are selected by the values in the initialisation 
structure. Once the initialisation request structure is complete it is passed to the I/O 
driver, which performs the actual initialisation. After a successful initialization procedure, 
the I/O driver will be ready to begin I/O). The I/O driver is loaded with empty buffers to 
begin the data input process.
• Buffer maintenance for the I/O driver. Several empty buffers are usually provided to the 
I/O handler to permit fast switching to the next frame.
• Interface to application and I/O driver. After the I/O handler has completed initialisation 
and has provided the I/O driver with a few empty frames, the I/O handler goes to sleep 
and waits for one of the following three events to wake it.
1. The I/O driver completes the reception of a new frame
2. The I/O driver finishes transmitting a frame and returns the used buffer to the 
sent queue
3. The application sends data for transmission
The basic data element used by the SCOS frame-oriented I/O driver is the frame. The frame is 
managed by use of a data structure called the frame control block (FCB). The FCB contains all of
112
the information necessary to completely describe a frame such as the location and size of each of 
its buffer, its priority, its status and its relation to other frames. The data buffer described by the 
FCB ranges in size from zero bytes to 64 kbytes. The size fields of the FCBs used for receiving 
are initially set by the handler task to the size of the empty buffer when the buffer is allocated. 
The receive FCB’s size field is modified by the I/O driver to reflect the actual number of bytes 
contained in the buffer after the buffer is filled. The size fields of FCBs used for transmitting are 
set by the I/O handler task to indicate the number of bytes of data for the I/O driver to transmit.
The driver works by receiving FCBs on one queue, and returning them on another queue. The 
queues used are summarized below
• Empty queue: This queue is a list of FCBs which, points to empty buffers. These buffers
are to be used by the I/O driver’s receive routine to store received data from the
communication port.
• Full queue: This queue is a list of FCBs, which contains received data.
• Send queue: This queue holds FCBs, which are to be transmitted.
• Sent queue: This queue holds FCBs, which have been transmitted by the I/O driver. 
Typically each of above described queues has a length of 5 FCBs in the SSTL OBC [Jack-05].
7.1.6 8051 Architecture
The 8051 series of microcontrollers was originally developed by Intel in 1980. The 8051 
architecture presents an 8-bit microprocessor. The 8051 microprocessor architecture does not 
exist on its own. Instead, there are several microcontrollers built around this architecture known 
as 8051 derivatives. All 8051 derivatives utilize the 8051 core as their CPU and operate using the 
8051 instruction set. The basic architecture of 8051 core is shown in figure 7-4.
113
Figure 7-4 Block Diagram of the Intel 8051 Family of Microcontrollers, After [Inte-94]
The 8051 architecture has an on-chip oscillator, which can be used as the clock source for the 
CPU. A machine cycle consists of a sequence of 6 states. Each state lasts for two oscillator 
periods. Thus a machine cycle takes 12 oscillator periods or 1 ps if the oscillator frequency is 12 
MHz. Each instruction in the instruction set requires from 1 to 4 machine cycles for its execution.
The 8051 architecture can address up to 64K bytes of Data Memory external to the chip. The 
“MOVX” instruction is used to access the external data memory. The 8051 has 128bytes of on- 
chip RAM plus a number of special function registers (SFRs). The lower 128bytes of RAM can 
be accessed either by direct addressing (MOV data addr) or by indirect addressing (MOV@Ri).
The 8051 has two timer/counter registers: timerO and timer 1. In the “Timer” function, the register 
is incremented every machine cycle. One can think of it as counting machine cycles. Since a 
machine cycle consists of 12 oscillator periods, the count rate is 1/12 of the oscillator frequency. 
As the count rolls over from all Is to all Os, it sets the timer interrupt flag [Inte-94].
7.2 Definitions and Illustrations
CC: Clock cycles
CCT: Time required for 1 clock tick
114
DAD-PS: DAD packet size in words
f: Frequency
LS: Network link speed
P: The period of DAD packet generation.
PDD: Packet decoding delay
PED: Packet encoding delay
PS: Maximum packet size in words
PSC: Pipeline stall cycles
PT: Execution time of a CPU
PTc: Processing time at the CODEC FPGA .
PTs: Processing time at the supervisor.
PTo: Processing time at the OBC.
PTo (INTR/NMI): Processing time at the OBC for an INTR or NMI request 
MSC: Memory stall cycles
MTT: Maximum time to travel interval that elapses from the time a packet is sent from the OBC 
microprocessor to the time it gets in execution at the supervisor.
MITs: Maximum incoming message turnaround time, i.e. maximum amount of time that elapses 
from the arrival of a packet to the input queue of the supervisor to the time at which the 
supervisor actually starts its processing.
MOTo: Maximum outgoing packet turnaround time of a node, i.e. maximum amount of time that 
elapses from the time of arrival of an item at the send queue of the OBC device driver to the time 
at which device driver starts sending it out.
MTMR: Maximum time for memory reload, i.e. the maximum time required for reloading 
memory from the code store to the OBC memory unit.
NPS: Number of pipeline stages
RQW: Ready queue wait time starts when the DAD task is submitted to the ready queue to the 
instant when it is switched into running state by the operating system.
SQL: Send queue length
DAD1 or DAD turnaround time: Turnaround time is the interval from the time of submission of 
a process to its completion. It is the sum of periods spent waiting in the ready queue, executing on 
the CPU and doing I/O. In this work, it includes RQW, PTo and MOTo.
T l: Time out period for a DAD packet.
T2: Link establishment time-out 
T3: Packet transmit timeout
115
T4: Packet receive time-out
T5: Time-out period for response of the supervisor’s polling request to the OBC CODEC FPGA 
T6: Time-out period for response of an INTR request for the OBC processor.
T7: Time-out for NMI request 
WCFDL: Worst case fault detection latency 
WCRL: Worst case recovery latency
Figure 7-5 Illustration of Definitions
7.3 Latency Bound Calculations
This section uses information described in the preceding sections to present an estimate of latency 
bounds associated with this scheme and highlights factors affecting these latencies. It is assumed 
that the FCB’s queues length, presented in SCOS discussion, will be used for this analysis.
7.3.1 DAD Trunaround Time
The DAD turnaround time can be calculated as follows.
DAD 1 — RQW  + PT0 + MOTo (Equation No. 7.1)
As mentioned earlier in this chapter, the operating system running on the OBC is a priority-based 
pre-emptive operating system. Generally all tasks are run at the same priority. Either the DAD 
task can be assigned the same priority or a higher priority. In the former case, RQW should be 
number of tasks multiplied with the SCOS time slice. This scheme will significantly increase the 
DAD turn around time and hence, its invocation period. Also, all of the OBC tasks are generally 
not in ready queue at the same time and therefore this delay can be reduced. Secondly, the 
purpose of a priority-based pre-emptive operating system is to offer priority for time critical tasks 
and to ensure their timely execution. Therefore, it is assumed that the DAD task has a priority
116
over other tasks and will be executed with no delay. When a timer (hardware/software) expires, it 
causes an interrupt to the CPU. It causes the CPU to forcibly remove the currently executing task 
and goes into an ISR. A wake up call will be made to the DAD task and the CPU will return from 
the interrupt. Now the DAD task is in ready queue. A process switch will be made to get next 
highest priority task into execution, resulting in switching DAD task from ready to running state. 
Therefore the RQW time will be negligible. If there is any other process at the same or higher 
priority than the DAD task, the RQW will need to take that into account.
Chapter 6 describes the tasks to be perfomied by the DAD task. To alleviate the OBC’s overhead, 
the DAD task should run in the minimum possible number of CPU clock cycles. Ideally, it should 
not exceed one time slice.
At the start of the DAD task execution, it is required to test the power system microcontroller 
flag. It needs to wait for a while before continuing with its execution. In current SSTL design, an 
Infineon 515 CAN controller is used as the power system microcontroller [Keil-06, Plan-05]. In 
this work, it is assumed that a SMCSlite microcontroller will be responsible for this. At the DAD 
task invocation time, the supervisor will send a request to the power system SMCS lite. The 
section 7.4.4 calculates the longest time that may elapse before the supervisor could send its 
packet to any of the OBDH units. This time is required because the requested port of the router 
may be engaged in a packet transfer and the supervisor may need to wait until this port is free. 
This time is found to be 0.061 ms. Once the requested port is available, a short request from the 
supervisor will get to the power system quickly. Taking into account the programming time for 
the SMCS lite ADC peripheral, a 1 ms time interval is proposed as the DAD task waiting time for 
the power system flag.
The execution of the test sequence constitutes the next step of the DAD task execution. Software- 
based self-test (SBST) algorithms can be used for the periodic testing of the OBC processor 
[Pasc-05]. The SBST technique tests the processor components using the processor’s instruction 
set. The key concept of the SBST is the generation of an efficient self-test program that achieves 
high fault coverage without processor modifications. The SBST algorithms are based on the 
divide-and-conquer approach, processor components and corresponding component operations 
are identified. Then for every component under test within the processor and for eveiy operation 
of that component, test patterns are generated. After that, the test patterns are transformed to self­
117
test routines (consisting of processor instruction sequences). All self-test routines together 
constitute a test sequence.
A SBST technique is generally involved. Alternatively a simple test program, such as a CRC, can 
be used. The selection of the test sequence is based on a trade-off between task simplicity, 
execution time and fault coverage. In this work, a CRC calculation program was developed in C 
language. The generator polynomial used is based on CRC-16 (X16 + X 12 + X5 + 1). The 
execution time for the program was found to be 10 ps. The test PC, in this case is an eMachines 
etower | 400i, which contains an Intel P II processor with 400 MHz frequency. This machine 
contains 128KB cache memory. The Intel P II has a 12-stage pipelined architecture [Inte-97]. A 
program’s execution time can generally be described by the equation no. 7.2 [Pasc-05].
PT  = CCT * (CC + PSC + MSC) (Equation No. 7.2)
This equation shows that the CPU execution time depends not only on the clock time but also on 
existence of pipelines and cache in the architecture. In a pipelined architecture, the pipelining can 
improve performance by the depth of pipeline stages provided there is no pipeline stall [Prab-05]. 
The inclusion of cache memory in the architecture increases its speed if the cache memory is 
faster than the main memory. If main RAM is fast enough that the CPU does not need to insert a 
wait state while accessing this memory, any additional quickness will not speed anything up, 
because the CPU is already going as fast as it can. In this analysis, it is assumed that the main 
memory is fast enough for a 386EX with a 80 ns bus cycle time and therefore MSC can be 
ignored.
As information on PSC is not available, it is assumed an ideal pipeline and therefore CPU clock 
cycles for CRC program are calculated in equation no. 7.3 and 7.4. The 386 microprocessor’s 
execution time for this CRC sequence is calculated using equation no. 7.2 (both MSC and PSC 
have been ignored) and result is presented in equation no. 7.5.
As the 386EX does not have a pipelined architecture, it will be slower than PH. The effect of 
pipeline stages of the PII processor has been included in equation no. 7.6 to produce the final 
result.
P T (386) = 160 x NPS (PII) = 160*12 = 1920 jus = 1.92 ms (Equation No. 7.6)
PT (PII) = 10 jus = (1/400* 106)  x CC 
CC = 4000
P T (386) = (1/25 *106) *4000 = 160 (xs
(Equation No. 7.3) 
(Equation No. 7.4) 
(Equation No. 7.5)
118
The execution of the test sequence constitutes major part of the DAD task execution on the OBC- 
processor and therefore, it seems reasonable to assume 10 ms for the DAD execution.
When the DAD task will request to send a DAD packet to the supervisor, it will be allocated a 
FCB and it will be placed in the send queue. The maximum number of FCB’s for a send queue is 
5 [Jack-05]. Assuming a worst-ease scenario, there are four FCB’s ahead all with maximum 
packet length of 1024 bytes. The OBC uses DMA to transfer data. Assuming a fly-by transfer 
mode, 1 bus cycle/machine cycle would be required to complete one transfer between memory 
and CODEC FPGA. The CODEC FPGA will pass these data onto the communication DPRAM. 
The CODEC FPGA is assumed to run at a frequency of 25.175 MHz. The reason for choosing 
this frequency is that a Celoxica example program was run on a test bed [Celo-04], which will be 
described in chapter 8. A coding and decoding time for an IP packet was estimated using this 
program and was found to be 258 ps. A 3D plus radiation tolerant MMDP16480606JCCD device 
with an access time of 30 ns is assumed as the DPRAM. Both CODEC FPGA (25.175 MFIz) and 
SMCSIite (5 MHz) will require 1 clock cycle to read/write 16 bits of data from/to DPRAM. The 
time required by four packets with maximum length is calculated as follows:
MOTo = (SQL -  1) x PED + PS x(SQL -  1) x (l/fobcx4 + (V/fpga + 1/fsMcsiue)))(Equation No. 7.7) 
MOTo = 4 x0.258+ 1024/2 x4x (160 + 40 + 200) xJO 6 = 1.851 ms (Equation No. 7.8)
The CODEC FPGA passes a packet data item to other side while waiting for the next item and 
therefore there is an overlapping period. Thus above equation can be modified to equation no. 7.9. 
MOTo = 4x0.258 + 1024/2x4x(200 + 160)xl(J6 = 1.769 ms (Equation No. 7.9)
This time delay can be improved using the priority field of the FCB, as it will allow a better 
scheduling of the I/O requests based on their priorities. However, it is currently not in use and 
therefore, this discussion sticks with MOTo calculated above.
Therefore, DAD turnaround time is
DAD1 = PTo + MOTo =10 + 1.769 = 11.769 ms (Equation No. 7.10)
7.3.2 Communication Deiay Experienced by a DAD Packet
The communication delay experienced by a DAD packet or MTT is the time required by a DAD 
packet to travel from the OBC to the supervisor and it can be calculated as follows:
MTT = PED + DADJPSx (l/fobcx4 + l/fsh{Csute) + (PSxl6) /LS + DADJPS) x ( l/fSMCS332 + 
1//supervisor* 12) + PDD (Equation No. 7.11)
Where packet data size is 4 bytes and UDP/IP header will require 28 bytes.
119
M T T =  0.258 + 16* (160 + 200) *1(T6+ 32*8/(200*106)  + 16* (40 + 1000)ia6 + 0.258 
= 0.541 ms (Equation No. 7.12)
7.3.3 Supervisor Processing Time
A DAD packet from the OBC will be processed as follows. The CODEC FPGA, and hence
communication DPRAM, is acting as a memory mapped I/O devices. The 8051 microprocessor
uses MOVX command to communicate with external data memory and will use the same
command for the transfer of packet data to and from the CODEC FPGA.
MOVX A,@DPTR //16-bit addressable external RAM data
byte to accumulator
MOV direct,A //Move accumulator to direct internal RAM
// Repeat these two commands N times to move N DAD data bytes from 
DPRAM to internal RAM
MOV A,direct0 //Move first byte of data in accumulator
ANL A,#01H
JZ Screech // Is it a DAD or Screech packet
MOV A,direct0
ANL A,#04H
JNZ current-flag // If the test sequence was not executed
in expected time, jump to current-flag to set a flag to disregard current 
consumption value received from the power system controller
MOV A,direct0 //Move first byte of data in accumulator
ANL A, #08H
JZ SR-test // Check how long it has been since SR was
signaled last time
MOV A,direct0 //Move first byte of data in accumulator
ANL A, #10H
JZ current-flagl // Did power system controller raised its flag
before time-out. If it did not, test its health
MOV A,direct 1
CJNE A,direct(CRC-result-high-byte),Recovery 1
MOV A,direct2
CJNE A,direct(CRC-result-low-byte),Recoveryl
MOV A,direct0
ANL A,#02H
JNZ SEU-Check // If packet holds SEU count, jump to SEU-
check
SEU-Check MOV A,direct3
CLR C
120
SUBB A,direct(SEU-Threshold)
JC Recovery2
The DAD packet consists of 4 bytes, where the ‘flags’ field is 1 byte long, 2 bytes for CRC
results and 1 byte for SEU count. Table 7-1 below list all the commands, which have been used in
the above code, their functions and execution period in terms of the number of machine cycles
required. It results in a supervisor processing time of 41 machine cycles or 41 ps (as 1 machine
cvcle completes in 1 us).
Table 7-1 8051 Instructions Details
Instruction Function Number of 
Times, 
Appeared in 
the Program
Machine
Cycles
ANL A, hnmediate- 
byte
AND Accumulator with 
immediate byte
5 1
CINE A, direct, Label Compare A with direct (internal) 
RAM byte and jump if not equal
2 2
CLR C Clear carry bit 1 1
JC Label Jump if carry bit is set 1 2
JNZ Label Jump if A is not zero 2 2
JZ Label Jump if A is zero 3 2
MOV A, direct Move direct RAM byte to A 8 1
MOV direct, A Move A to direct RAM byte 5 1
MOVX A,@DPTR Move 16-bit addressable 
external data RAM byte to A
5 2
SUBB A, direct Subtr act the direct byte from A 
with borrow (carry)
1 1
7.3.4 Timers Required at the Supervisor
This section aims at estimating different time-out conditions, which need to be established for 
latency bound calculations of the scheme. The DAD time-out period is based on parameters, 
which have been described in preceding sections, and it can be calculated as follows 
77 = DAD I + MTT + PTs = 12.351 ms (Equation No. 7.13)
As mentioned in Chapter 6, the SpaceWire standard recommends the use of timers at the 
application level to monitor health of a link. These timers include link-establishment, packet 
transmit and packet receive. A link should be initialised in 100 ps [Park-05]. Therefore 
T2 = 100 jus = 0.1 ms (Equation No. 7.14)
Both Packet transmit and packet receive time-out periods will depend upon the application 
(expected packet length) [Park-05]. As SMCS332 device does not include any internal timers, all 
3 time-out conditions need to be monitored at the supervisor. Interrupts supported by SMCS332
121
will allow interrupting the supervisor when a packet written by the supervisor into the 
communication memory has been transmitted or a packet received from the link has been written 
into the communication memoiy by the SMCS332. This implies that the supervisor needs to take 
care of link initialization and packet transmit time-out conditions. When it needs to write a 
polling request packet, it can first check status of the corresponding link (say the supervisor is 
connected to the OBC on channel 1). It will read the route control status register to check whether 
the channel 1 is empty, or it has a connection with the internal FIFO (communication memory) or 
it is connected to port 2 or port 3. If channel 1 is occupied the supervisor will need to wait until it 
is free. Assuming that it is transmitting/receiving a packet with maximum size on the channel 1, a 
time-out period can be calculated as follows
T8 = PS x 16/LS + P S //s m c s 3 3 2  = 0.061 ms (Equation No. 7.15)
Once, it is free, the supervisor will program it to send polling request and will set timer T2. After 
time-out, it will test the channel 1 status register to check whether the link is transmitting or not. 
If it finds it running, it will set packet transmit tinier T3. If it is not interrupted by SMCS332 to 
report transmission of the packet before expiring of this timer, it will assume that packet could not 
be transmitted in T3 time and hence recovery will be invoked.
T3= PS x 16/LS + PS/fSMCS332 -  29 x 8/200* 106 + 15* 1/25*106 = 0.002 ms
(Equation No. 7.16)
It is not possible to set a packet receive timer if the unit is using communication memoiy with 
SMCS332. This is because of the reason that the SMCS332 will interrupt the supervisor, only 
when it has received a complete packet, which has been sent into communication memory too. 
However, it may be used when there is no communication memoiy and the SMCS332 is 
programmed to pass packet bytes onto the supervisor as soon as it receives. The communication 
memory is recommended for two reasons
1. With large packets, the system will require controlling CPU intervention very often.
2. The communication memoiy will be required to buffer an incoming packet if its 
destination port is busy.
The time-out period for the response of the supervisor’s polling request to the OBC CODEC 
FPGA can be calculated as follows:
T5 = MTT + 2x PED+ 2x PSx(l/fFPGA + l / f SMCsii.e) = 1.064 ms (Equation No. 7.17)
Similar to the supervisor’s polling request, the supervisor will require sending an INTR or NMI 
request to the OBC unit. The time-out for INTR/NMI request will include processing time at the 
OBC and the MOTo for the response packet and it is calculated in equation no. 7.18 and 7.19.
122
T6 = T7 — T5 + PTo(INTR/NMI) + MOTo (Equation No. 7.18)
As mentioned in chapter 6, the OBC will be allowed 10 ms before the next recovery attempt, i.e 
PTo(INTR/NMI) = 10 ms, and hence
T6 -  T7 = 12.833 ms (Equation No. 7.19)
The number of timers required at the supervisor seems to be a bottleneck for the choice of an 
appropriate processing unit to act as the supervisor. For instance for monitoring the OBC, the 
supervisor requires to monitor 7 time-out conditions. However it may be possible to serialize use 
of these timers in a time slice, e.g. a timer will be first used to monitor T1 condition, once DAD 
packet received or time-out occurred the same timer will be used to monitor the next time out 
condition in the sequence. In fact, no two timers are required in parallel for monitoring of the 
OBC. From analogy, if there are more than one unit to be monitored by the supervisor, one timer 
will be required for each unit. In this work, the supervisor does not run any operating system. 
However, it may well be possible to adopt an operating system, which supports software timers. 
Alternatively a combination of simple microprocessors such as 8051 can be used with a 
programmable interval time such as 8254. If the supervisor is monitoring underlying units in 
series, it can use the same timer to monitor any one unit at a time and therefore, the number of 
timers is not an issue.
7.3.5 Supervisor Time Slice
The supervisor time slice is defined as the maximum time required for the DAD packet 
generation and its processing, and the maximum time required by the supervisor for application of 
the longest recoveiy in one supervisory cycle/invocation period.
In order to improve fault detection latency of the system, the supervisor is required to collect 
DAD packets more often. The supervisor will apply more than one recoveries depending upon the 
signatures observed. Recoveries, which involve resetting of the CPU or communication controller 
and power cycling of the OBC unit, can take longer as these will require reinitialization of the 
device/devices in the OBC unit. For instance, after a reset or power cycling of the OBC, the 
system will require reloading of the program memory from the code store and it will require 
reinitializing of all hardware devices. The exact time required for this system model is not known. 
An estimate has been made that will be presented in section 7.4.8. It is therefore proposed that the 
supervisor will offer a diagnostic of the OBC till sending a reset command for the OBC-processor 
in any one invocation period. The STS can therefore be calculated as follows:
123
STS = T1 + T2 + T3 + T5 + T6 + T7 + T8 -  39.244 ms ~40 ms (Equation No. 7.20)
7.3.6 Invocation Period
In order to have a better fault detection latency, the invocation of the supervisory round should be 
as frequent as possible. However, the following two factors will directly affect this period.
1. As the execution of a DAD tasks requires OBC computational as well as communication 
resources, the invocation period will be
P = STS + margin
The margin is required to trade for an acceptable OBC performance overhead. For this 
system model, it is assumed that this margin should be equal to number of OBC tasks 
multiplied by an operating system time slice. Assuming 9 tasks excluding the DAD task 
and a time slice of 100 ms the invocation period will be 940 ms —Is. Thus, each OBC 
task will be allowed execution at least once before the DAD task is invocated in next 
cycle. Also, a Is interval is currently being used at SSTL satellites to refresh a watchdog 
timer [Wood-05].
2. This analysis has considered a two node system, where the supervisor is monitoring the 
OBC. However, this mitigation scheme has actually been meant to address the possibility 
of SEFIs in more than one data handling units. Therefore, in actual implementation the 
supervisor will be monitoring more than one unit. The invocation period needs to take 
into account the effect of multiple nodes.
The supervisor communicates with the network nodes using simple DAD packets. As it can be 
seen from the OBC DAD packet, little processing is required at the supervisor. Most of the time, 
the supervisor is required to monitor different time-out conditions. The supervisor time slices can 
either be serialized or overlapped. Both schemes are compared in table 7-2.
124
Table 7-2 Time Sharing vs Overlapped STS
Time Sharing Overlapped
Simpler supervisor code Complexity will arise, if the supervisor receives another 
DAD packet while its processing previous packet
The supervisor will receive a 
packet while its expecting a 
packet and therefore no loss of 
any packets
The supervisor will be interrupted on arrival of each 
packet. A packet may arrive while the supervisor is in ISR 
for previous packet and therefore interrupt to signal arrival 
of a new packet will be lost and packet can be overwritten
Requires timers to mark start of 
the time slices.
Requires support for the multitasking environment
Requires less times. Timers used 
in one time slice can be used in 
next time slice
Requires more timers as underlying units are being 
monitored in parallel
Results in longer detection and 
recovery latencies
Shorter latencies
P >NxSTS where N is 
equal to number of nodes in the 
system and STS is assumed to 
be the same for all units.
P > Longest STS in the system
Less efficient in terms of usage 
of the supervisor computing 
resources
More efficient
7.3.7 Fault Detection Latency
The fault detection latency (FDL) is defined as the time interval between the fault occurring event 
at the OBC and its detection at the supervisor. As shown in the figure 7-6, a fault can occur 
anywhere between points A and C7. The latency will be lowest if the fault occurs between point A 
and B because it will be detected by the supervisor at point C. In a worst case scenario, a fault 
will be occurred at the OBC right after sending a DAD packet, i.e. at point B. The supervisor will 
not be able to detect this fault until it misses the next DAD packet at point C7. Therefore,
WCFDL -  P -D AD 1 + 77 = 1000.582 ms = 1.000582 s ~1 s (Equation No. 7.21)
125
r -n---------- Point A: DAD Invocation at the OBC
DAD 1 Point B: DAD Packet Sent from the OBC
  Point C: DAD Packet Timeout at the
I \
\ Point A.: DAD Invocation at the OBC 
Point B .: DAD Packet Sent from the OBC 
Point C : DAD Packet Timeout at the
T1
Figure 7-6: Worst Case Detection Latency
Therefore, the FDL depends on the invocation time of the DAD task, as well as, the DAD task 
execution time. Based on the discussion presented through section 7.3.1 to 7.3.6 the fault 
detection latency is proportional to following factors
STS consists of DAD packet generation and processing, and on length of recoveries string, which 
need to be addressed in one round. DAD packet generation and processing mainly constitutes 
DAD1 execution time as packet length is small, leading to short time to travel on network, and 
little processing is required at the supervisor. If DAD 1 time remains within 10 ms, it represents a 
good execution time as general range of a quantum for operating systems is 10-100 ms [Silb-98], 
and it implies that DAD1 lies within one quantum. Length of recovery strings, which need to be 
addressed in one STS presents a trade-off between detection and recovery latency.
The number of nodes in the system may significantly increase FDL where the STSs are serialized. 
Its effect will be less prominent in overlapped STSs.
7.3.8 Recovery Latency
Recovery attempts will start from expiration of T l. The worst case recovery latency (WCRL) will 
be experienced, if a SEFI occurs right after sending a DAD packet and it requires power cycling 
to get recovered with no observable current consumption signatures. In this case 
WCRL — P  -  DAD1 + P (Issuance o f reset command) + P (Issuance ofpower cycling command) 
+ MTMR (Equation No. 7.23)
FDL oc P oc STS, N (Equation No. 7.22)
126
An OBC reset or power reboot command will require a complete reload of program memoiy from 
code store to the OBC and, inilitialization and programming of all hardware devices within the 
OBC unit. Memoiy reload can be considered as the most time consuming procedure and therefore 
WCRL is calculated on the basis of this time period. SSTL currently uses floating gate memory 
with a read access time of 129 ns [Wood-05], and the same access time will be used for this 
analysis. The CODEC FPGA has a frequency of 25.175 MHz, thus resulting in a clock period of 
40 ns. Access to the above-mentioned floating gate memoiy with this FPGA will require 4 wait 
states to be inserted at the FPGA. Each network packet can have a maximum size of 1 kbytes. A 
450 kbytes program memory will require at least 450 network packets.
MTMR = 450 x (PED + P Sx (I/fFPGA x 4 + l/fSMCSllJ  + PS x 16/LS + PS  x (L/fobc*4 + 
1/fsMcsnte) + PED) = 416.52 ms (Equation No. 7.24)
WCRL -  3 x P -  DAD1 + MTMR = 3.405 s ~3.4 s (Equation No. 7.25)
7.4 Conclusions
As the proposed scheme trades lowered fault coverage with reduction in mitigation cost, fault 
detection and reeoveiy latency is an important criterion to compare its effectiveness against other 
schemes. Latencies analysis reveals that DAD execution time at the OBC is fairly short using 
CRC as the test sequence. It is shown that the OBC-processor will be occupied by the DAD task 
for about 10 ms in an invocation period of 1 s. The short length of the DAD packet required 0.541 
ms for transferring it from the OBC to the supervisor in the same invocation period. It 
demonstrates that execution of the DAD task on the OBC will not pose a huge overhead on OBC 
computing and network resources.
During an STS, the supervisor spends most of its time in monitoring time-out conditions. 
Processing of the DAD packet and issuing recovery commands does not require long execution 
times. The supervisory approach has actually been put forward to monitor more than one OBDH 
units. In order to use the supervisor’s resources more efficiently and to have a shorter invocation 
period, overlapped supervisor time slices are favoured.
The number of timers, required to monitor different time-out conditions, seems to be of problem. 
However, the supervisor monitors these time-out conditions in series and therefore will require 
only one tinier for an STS.
127
The scheme presents latency in detection of a SEFI. The worst case detection latency is estimated 
to be 1 s, which is similar to the watchdog timer interval at the SSTL OBC unit. Thus, the 
proposed scheme presents an extended SEFI detection capability with the same detection latency 
as the current watchdog timers at a 0.01% performance overhead for the OBC computing 
resources and 0.000541% overhead on communication network. The worst case recovery latency 
is estimated to be 3.4 s. It presents a huge improvement on the present SSTL spacecraft, which 
requires two ground passes to recover the OBC after a crash.
128
/CHAPTER 8 
SYNCHRONIZATION BETWEEN THE OBC AND THE 
SUPERVISOR
The supervisor and the OBC represent two computing nodes on a data network. These two will 
not share the same program memory and clock. As the supervisor needs to monitor OBC, 
coordination is required between two nodes. Clocks in different nodes in a distributed system 
tend to drift apart from each other with the passage of time, because they typically do not tick at 
exactly die same rate. That is, there exists some 8, such that at any instant these two nodes agree 
on the current time to within 8. Synchronization is necessary to establish a global ordering of
events. Real time systems also use this global time base to allow each node to determine both
/
when it must complete its own tasks, and when other nodes must complete theirs.
Synchronization may be accomplished by having all nodes use the same external time source, e.g. 
coordinated universal time (UTC). Alternatively, it may also be possible to allow each node to 
use its own clock, and to limit differences between them with a synchronization algorithm.
Synchronization algorithms may be classified according to the methods used by nodes to inform 
one another of their current clock values. Hardware algorithms use a dedicated network to 
broadcast each clock signal. Network algorithms send messages across the existing 
communications network. Hardware algorithms provide tight synchronization, but are expensive 
to implement. The extra number of communications lines needed is on the order of n2 for an n- 
node system. Network algorithms require no additional hardware, but tightness of 
synchronization is limited by uncertainty in communication delay. They also place an additional 
load on the communications network. [Olso-95].
However, the nature of the proposed supervisor approach does not impose tight synchronization 
requirements. The OBC and the supervisor need to exchange information periodically (period 
will be 1 s). A margin can be added in latency to account for any differences in clock frequency. 
In this part of the thesis, two simple coordination methods to coordinate activities of the 
supervisor and the OBC are presented and a synchronization margin is estimated for two 
computing units connected on a data network.
129
8.1 Coordination Methods
Two coordination methods are considered to coordinate the OBC and the supervisor.
8.1.2 Using One Timer Only
A simple way of coordinating two network nodes is polling. In this scheme, the supervisor will 
always expect a DAD packet in response of its polling request. Therefore, the OBC does not need 
to have any timer to observe an invocation period. After initialization, DAD task will start 
listening on the network until it receives a packet from the supervisor to produce a DAD packet.
After sending a polling request, the supervisor will expect to receive a DAD packet before a time­
out condition. Time out in this case will correspond to maximum time required for generation of 
the DAD task and its travel time.
This scheme is simple. Overhead comes in form of an extra network packet. As packet size is 
small and invocation period is not too short, it presents an acceptable solution.
8.1.2 Using Two Timers
Polling packet overhead can be avoided, if there is a way to synchronize timers at the supervisor 
and the OBC. One way of achieving this synchronization is the use of UTC timers. The GPS 
consists of a constellation of 24 satellites at an altitude of 20,000 km and can be used for 
positioning on land, at sea, in air or in space. An on-board GPS receiver can autonomously 
provide position, velocity and orbit determination, and accurate time synchronisation. SSTL flew 
the first microsatellite GPS receiver on PoSAT-1 in 1993. On request from a unit, the GPS 
receiver provides current UTC time, which will be updated every second with an interrupt from 
the GPS receiver [Plan-05]. It is therefore possible to set a timer to expire at a particular UTC 
time. In such a situation the supervisor will not poll the OBC and will wait for a periodic packet 
from the OBC within a time period which is equal to OBC-DAD-task invocation period + 
maximum possible DAD generation and travel time. The OBC timer will cause the generation of 
a DAD packet.
hi a situation where a global time source is not available, synchronization between two timers can 
be achieved by increasing time-out period at the supervisor corresponding to S. Overhead in this 
scheme is thus this increase in time-out period, leading to expectedly longer detection and 
recovery latencies.
130
8.2 Test Bed
A test bed was built to evaluate the above mentioned coordination methods, hi order to achieve 
some meaningful results from these experiments, it was necessary to have a test bed that has 
significant analogies with the actual system and particularly in the context of the experimental 
objectives.
In order to simulate coordination methods, it was required to establish a periodic packet 
communication between two processor-based systems that can act as the OBC and the supervisor. 
As mentioned previously, this research proposal favours the use of SpaceWire but it can be 
implemented on any fairly fast data network. For the test bed, first the feasibility of purchasing 
two SpaceWire interface nodes was investigated but the cost involved was too high to be afforded 
within the project funding. Therefore, it was decided to use Ethernet network instead. Similar to 
SpaceWire, Ethernet covers lower two layers of the OSI model. Also, this thesis proposes the use 
Of UDP/IP over SpaceWire for the proposed OBDH and adoption of UDP/TP makes the 
underlying network transparent to the application program. However, the physical topology of the 
two networks is not the same but it is not an issue in a network with two nodes.
In the OBC unit, two components contribute in a packet communication with the supervisor, 
these are OBC-computer and interface node. Therefore, the OBC can be thought as two 
individually programmable parts, computer system and network I/O. A standard PC architecture 
is more involved. Although it does provide Ethernet connection and means for implementing 
communications using it, it does not allow the user to program any thing except using CPU. Also, 
software supported by a PC represents high level of abstraction. For example, a PC supporting 
windows XP operating system, which contains TCP/IP protocol suite, uses sockets to implement 
UDP communication between two machines. . This software structure handles all aspects of 
communication and requires minimal intervention from user. In order to have a system closer to 
actual proposal, an FPGA-based board (Celoxica RC 203) was purchased to act as the interface 
FPGA [Celo-04].
This board is capable of providing communication on Ethernet and offers other interfaces, such as 
parallel port and RS-232 serial port, which can be used to connect it to a host PC. Another major 
advantage is the software environment provided with the board. This software includes the DK 
design suite, which is a Microsoft Windows based tool for the design and creation of FPGA 
designs using the Handel-C language. The DK design suite provides an integrated development
131
environment (IDE), which allows a user to edit and compile Handel-C source code files to 
produce EDIF netlists. The DK IDE includes a graphical user interface (GUI), Handel-C 
compiler, which translates Handel-C source code into hardware description that is further 
compiled into executable code. In addition to a DK compiler, Celoxica provides the platform 
deveolper’s kit (PDK), which is composed of tools and libraries to support DK. Two of the 
PDK’s components that were used during design of this test bed are the platform specific library 
(PSL) and platfonn abstraction layer (PAL). The PSL is general VO abstraction specific to a 
board. The PAL is a standardized application program interface (API), which provides access to 
the board via PSL implementation or device VO features. The cost of the board was high; it was 
not possible to have a board for both the OBC and the supervisor. Instead it was decided that this 
board will be used as an Ethernet interface for the OBC-processor.
As mentioned in section 7.1.5, the CODEC FPGA is connected to the OBC-processor on its 
memory data bus and is addressed and controlled using I/O pins, allowing 16 bits of data to be 
transferred in 1 OBC-processor’s clock cycle. The RC 203 board provides serial and parallel port 
interfacing with a computing unit. Software provided with the board makes it suitable to be used 
with a standard PC. A parallel port interface was adopted because of its higher speed as compare 
to RS-232 communication. Hence, it was decided to use a PC to act as the OBC. The supervisor 
was simulated on another PC. It was chosen for its availability and ease of programming. A 
Microsoft Visual C ++ software environment was adopted to program the machines for the 
required functionality. The UDP/IP protocol was adopted to provide communication between two 
programs running on the OBC-PC and the supervisor-PC. The experimental test bed is shown in 
Figure 8.1.
The test PC’s belong to DELL Optiplex GX110 product series. Both PC’s are similar and are 
based on Intel Pentium PHI processing units and operate at a frequency of 664 MHz. A 3 Com 
Linkbuilder FMS TP hub connects these PC’s and the RC203 board on Ethernet network, running 
at a speed of 10 Mbps.
132
"0O
Celoxica ps---------------
RC203 Io||oooooo||oo|
Ethernet Hub
Figure 8-1 Experimental Test Bed
The RC203 is a platform for evaluation and development o f  high-performance FPGA-based 
applications. Figure 8.2a shows the rear o f  the board and marks all key devices available on the 
board. Figure 8.2b shows the front-side o f  the board.
Figure 8-2 a Celoxica RC 203 (Rear of the Board)
133
Figure 8-2b Celoxica RC 203 (Front of the Board)
8.3 Functional Description
The development o f  this test bed required three programs running as the OBC-program, the 
supervisor-program and interface FPGA program. This section provides an overall functional 
description o f  the test bed.
The OBC-program has been developed on top o f  an example, which was provided with RC 203 
board. The program first configures the FPGA and initializes the send protocol. This protocol is 
part o f  the Celoxica software and it provides macros to transfer data between host PC (OBC-PC) 
and the FPGA on parallel port.
The test PC’s ran windows XP operating system, which contains TCP/IP protocol suite. This 
protocol suite allows adaptation o f  either TCP or UDP to connect two programs running on two 
machines. The supervisor program uses socket API to establish a communication on the UDP /IP 
protocol [Come-04]. This is an interface between the application program and the communication 
protocol using operating system services. The supervisor contains a thread, which sends a packet 
to the OBC on Ethernet and starts waiting using the receive-from socket for coordination method 
1 and it provides receive functionality only for coordination method 2. The WaitForSingleObject 
function waits for this thread. If the thread has not completed during the specified time, this
134
function causes a time-out and returns control to the main program. The WaitForSingleObject 
function takes two arguments, a handle to an object (thread) and a time period in milliseconds. 
The function blocks execution of the program until the specified object is in signalled state (i.e. it 
has completed its execution) or time interval specified has been elapsed. The value returned by 
the function can be used to determine what has caused this function to return. If an error occurs, 
the function returns wait failed condition, but it never happened in our experiments. If a time-out 
period of INFINITE is specified in the function then the only cause of returning of the function 
will be that the specified object is in signalled state.
The FPGA program uses two resources on the board, Ethernet and parallel port. The program 
starts with initialization, running and enabling the required resources. The software provided with 
the board defines macros for communication on the Ethernet network. It requires the user to 
provide Ethernet destination address, Ethernet type of the packet and data byte count. The user is 
not required to perform CRC calculations and packet encoding/decoding for the Ethernet. 
However, the program is required to take care of the UDP/IP aspects of the communication. Two 
types of packets were encountered during development of the test bed including address 
resolution protocol (ARP) packets and UDP packets.
8.3.1 Address Resolution Protocol
As mentioned previously, the PC-based programs use the socket API to communicate on the 
network. It requires specifying an IP address and a UDP port number to communicate with a 
remote program. However IP addresses are maintained by the software. When the underlying 
network hardware sees a destination IP address it does not understand, what hardware address it 
should use to send this packet out. Thus, before protocol software can send a packet across the 
physical network, the software must translate the IP address of the destination computer into an 
equivalent hardware address. Mapping between a protocol address and a hardware address is 
called address resolution [Come-04]. There are three techniques for address resolution.
1. In the table lookup technique mappings are stored in memoiy
2. The closed-form computation technique, where the protocol address assigned to a 
computer is chosen carefully so the computer’s hardware address can be computed from 
the protocol address using basic Boolean and arithmetic operations.
3. The message exchange technique where computers exchange messages across the 
network to resolve an address. One computer sends a message that requests an address
135
binding (translation), and another computer sends a reply that contains the requested 
information.
Which technique is used to translate a protocol address into a hardware address depends on the 
protocol and hardware-addressing scheme. TCP/IP protocol stack can use any one of the three 
methods; the method chosen for a particular network depends on the addressing scheme used by 
the underlying hardware. For a LAN with static addressing, message exchange method is used. 
To guarantee that all computers agree on the exact format and meaning of the message used to 
resolve the address, the TCP/IP protocol suite includes an address resolution protocol (ARP). The 
ARP standard defines two basic message types: a request and a response. A request message 
contains an IP address and requests the corresponding hardware address; a reply contains both the 
IP address, sent in the request, and the hardware address.
Whenever a PC, which contains TCP/IP suite, receives a packet from the application software, it 
broadcasts an ARP request message to all nodes in the network. A packet destined for a remote 
destination cannot be delivered until an ARP response has been received from the remote 
destination. Therefore, it was extremely important to enable interface FPGA to recognize an ARP 
request and send back an ARP response. Whenever the software places an ARP request or reply 
on an Ethernet frame, it needs to specify 0x0806 as the Ethernet type. Hence, examining the 
Ethernet type field can reveal that whether it is an ARP message.
8.3.2 User Datagram Protocol
A protocol that allows an individual application program to serve as the end-point of 
communication is known as an end-to-end or transport protocol. The UDP is a transport protocol. 
The UDP uses a connectionless communication paradigm. That is, an application using UDP does 
not need to establish communication before sending data, nor does the application need to 
terminate communication when finished. Furthermore, the UDP allows an application to delay an 
arbitrarily long time between the transmissions of two packets. More importantly, if two 
applications stop sending data, no other packets are exchanged. That is, UDP does not have any 
control messages; communication consists only of the data messages themselves. Hence, it has a 
low overhead.
136
ARP message
r
Header Ethernet Data CRC
Figure 8.3a ARP Packet, Ethernet Type is 0\0806
UDP Header UDP Data
IP Header IP Data
Header Ethernet Data CRC
Figure 8.3b UDP Packet, Ethernet Type is 0x0800
Both coordination methods are depicted in figure 8.4, where an invocation period of 1 s is 
assumed.
Figure 8-4 Left: Coordination method 1, Right: Coordination method 2
Flow charts for all three components of the test bed are shown through figure 8.5 to 8.7.
137
Figure 8.5 Flow C hart for Interface FPGA Program
138
Start
Execute a 
while (1) loop with 
following functions
Call
WaitForSingleObject 
function
Yes
Figure 8-6 Flow C hart for the Supervisor Program
139

8.4 Error Performance of the Test Bed
An error is a bound on the precision and accuracy of the result of a measurement. These can be 
classified into two types: statistical errors and systematic errors. Statistical errors are caused by 
random (and therefore inherently unpredictable) fluctuations in the measurement approach, 
whereas systematic errors are caused by an unknown but nonrandom fluctuation.
In order to evaluate the error performance of the test bed, a set of measurements were taken. A 
continuous two-way communication was established between the OBC and the supervisor 
programs. The supervisor send-receive thread was allowed to complete its execution without any 
time limit. This was achieved by setting time-out period in WaitForSingleObject function to be 
infinite.
8.4.1 Components of the System
For experimental measurement, the Ethereal GUI network protocol analyzer was used to monitor 
traffic on the network. It allows the user to browse packet data from a live network or from a 
previously saved capture file. It shows IP source and destination, protocol type, and source and 
destination port for all captured packets. Selecting a packet from the list of captured packets 
shows total bytes captured on the network medium, Ethernet source and destination addresses, 
and number of data bytes in the packet. It is also possible to view complete contents of the packet. 
The tool also allows viewing time elapsed between each captured packet.
An Ethereal capture was started prior to execution of the OBC and the supervisor programs. The 
execution of these programs resulted into packet communication. All of these packets were 
captured by the Ethereal. Once, sufficient number of packets was obtained, the capture was 
stopped and measurements were made on this capture file. Time was measured from the moment 
it captured packet sent by the supervisor to the point when it captured a return packet from the 
OBC. One of the capture files is shown in figure 8-8.
141
Figure8-8 Ethereal Graphical User Interface
The coordination method 1 was tested in 2 configurations, which are shown in figure 8-9.
Parallel
Figure 8-9a: Configuration 1
142
Parallel
Figure 8-9b: Configuration 2
As mentioned previously, the packet data content does not matter to any program in the test bed. 
The IP header constitutes of 20 bytes, UDP header requires another 8 bytes. The minimum length 
of the packet required for an Ethernet communication is 46 bytes and maximum is 1500 bytes.
For a UDP/EP packet with a length of 46 bytes, there are 28 bytes of header and 18 bytes of data. 
Packets of similar length were sent from both the supervisor and the OBC program using the test 
bed. The test was initiated with 18 bytes of data. This number was then increased to 36, 72, 100 
and finally 500 data bytes. In each test ran, each program was allowed to send about 50 packets 
resulting in a total of about 100 packets, and an average time was calculated between a supervisor 
packet and an OBC response packet.
In order to estimate statistical errors, time required for 1 byte to travel through the test bed was 
calculated for multiple test runs. As mentioned earlier, the test set was started with 18 bytes of 
data. In second experiment, the data bytes number was increased to 36. Average time between a 
pair of packets for first experiment was subtracted from time achieved in second experiment. This 
difference was then divided by 18. In summary, increase in time from experiment 1 to 2 was 
divided by increase in number of packet bytes. Results are summarized in table 8-1.
143
Table 8-1: Experimental Results: Configuration 1
Data bytes Average time between a pair 
of packets 
(ms)
Time required for 1 byte 
(ps)
18 101.065 (101.264-101.065)/18 = 11.05
36 101.264 (101.822-101.264)/36 = 15.5
72 101.822 (102.229-101,822)/28 = 14.53
100 102.229 (103.677-102.229)7400= 14.85
500 103.677
8.4.2 Systematic Errors
As it can be seen from the results, the average time between a pair of packet is 101.065 ms for a 
packet with 18 data bytes. This time slightly increased for a packet with 36 data bytes and so on. 
This implies that the system has presented systematic errors in measurements. In this section, the 
possible causes of these errors will be discussed.
As mentioned earlier, the interface FPGA receives the packet on Ethernet and then the packet is 
passed on to the OBC program via parallel port. The OBC program uses RC200GetBlockStall 
function to receive data from the board. This function requires the user to specify how many 
bytes are to be received. Therefore, the OBC program first expects to receive two bytes, which 
are used to calculate length of the packet, and then this count is specified in next call of the 
function to collect a packet from the interface FPGA. A call to RC200GetBlockStall function 
causes the program to block until the number of bytes, specified in the call, has been received. On 
the interface FPGA side, PAL macro procedures are used to read and write on both parallel port 
and Ethernet. These macros further incorporate PSL macro procedures to access board specific 
I/Os. The time taken by a macro can be variable, e.g. an Ethernet read request 
(EthernetReadBegin macro) checks if there is any packet ready to be read, and if there is no 
packet it will wait until a packet is available. There is a time-out of 0.5 s if there is no packet 
available by that time.
Two further experiments were performed to locate the part of the program, which is mainly 
responsible for the systematic errors seen in these experiments. Firstly, a program was loaded to 
the interface FPGA that performs read and write on Ethernet. It does not provide any
144
communication on the parallel port. A ping command was executed on the supervisor PC to ping 
the IP address of the interface FPGA. This resulted in an ARP request to the board. An ARP 
response was received from the board after 246 ps of transmission of the ARP request. After 
these ARP messages, there were 8 ICMP ping ‘request’ and ‘reply’ packets communicated 
between the supervisor PC and the board. On average, the board took 270 ps to read a ping 
request, process it and write a ping response packet on the network. It demonstrates that the board 
takes fairly short time for reading and writing on Ethernet.
In the second experiment, the OBC program was tested. The system was configured in 
configuration 1, which has been shown in figure 8-9a. . The RC200GetBlockStall function was 
replaced by RC200GetBlock function. Both functions perfomi the same job, except that the 
former blocks execution of the program until valid data have been received. On average, the time 
required for the supervisor packet to flow from the system, and reception of the OBC packet at 
the supervisor, was significantly reduced. However in contrast to previous experiments where a 
consistent time was observed, in this experiment a jump of time from 1175 ps to 61.327 ms was 
observed between two consecutive pairs of packets. The RC200GetBlockStall function was 
required because of the nature of the OBC program and therefore rest of the discussion will 
include this function. These experiments verify that communication on the parallel port is largely 
responsible for the observed systematic errors.
The multiprogramming environment of the OBC PC has possibly caused these nonrandom 
systematic delays. This is particularly so in the case of RC200GetBlockStall function. The 
execution of this function will cause the switching of the program into a wait state until the 
requested number of data bytes is available. Once, these data bytes have arrived at the OBC PC, 
the OBC program will be placed in the ready queue by the operating system and these ready 
queue wait time will add delays to the OBC program execution.
8.4.3 Statistical Errors
The random fluctuations are reflected through different byte travel times shown in the table 8-1. 
The possible reasons for these statistical errors include the effects of the multiprogramming 
environment (e.g. when the OBC program has been placed into the ready queue, how many 
programs will be allocated CPU time before switching the OBC program into execution. As these 
number of programs may vary in a PC environment, it will cause variable delays), network traffic
145
(e.g. in this test bed, the Ethernet has been used. One PC may have to wait if the network link is 
in use by another node in the network).
hi order to minimize statistical errors, two measures were taken. First, it was ensured that the test 
PC’s were not running any other user programs during execution of the test. Secondly, 
unnecessary network traffic was minimized, i.e. the OBC node was not sending out network 
packets until it was required to do so. This resulted in a 95% confidence interval o f 10.81 to 17.16 
ps for the mean
8.5 Experimental Results
The experimental results are organized into two parts. This section is a continuation of section 
8.4, i.e. the experimental set up remains the same. In section 8.6, a sleep instruction will be 
included in programs.
The coordination method 1 was tested in both configuration 1 and 2. A packet length of 46 was 
adopted, i.e. 18 bytes of data. A continuous packet communication was established between the 
OBC and the supervisor programs. These packets were captured using the Ethereal capture 
facility. In addition to calculation of an average time between two packets in a pair, the standard 
deviation of each test run was calculated.
Standard deviation is defined as a measure of how much the data in a certain collection are 
scattered around the mean or average value. A low standard deviation means that the data are 
more clustered around the mean; a high standard deviation means that they are widely scattered. 
Standard deviation of measurements was calculated and has been presented in table 2 and 3[Stat- 
05]. It is used to demonstrate the probability of variance in each coordination method. Results are 
summarized in table 8-2.
146
Table 8-2 Results Summary for Coordination M ethod 1
Experiment Time measured between a 
supervisor packet and a 
response packet from the 
OBC
(|iS)
Standard Deviation 
(fis)
Configuration 1 with 
RC200GetBlockStall function
101065 65.2
Configuration 1 with 
RC200GetBlock function
6667 1.813E+04
Configuration 2 with 
RC200GetBlockStall function
101844 1.865E+03
For coordination method 2, both OBC and interface programs were much simplified because only 
unidirectional communication was developed. The OBC was configured to send continuous 
packets, which were received on the supervisor. The time between two consecutive packets was 
measured. Results are presented in the table below for a packet with 18 bytes of data.
Table 8-3 Results for Coordination Method 2
Experiment Time measured between two 
consecutive OBC packets
(|IS)
Standard Deviation 
(fis)
Configuration 1 636 3.63
8.5.1 Discussion
The following observations can be made from experimental results presented in this section.
1 A packet from/to the OBC to/from communication network transverses through two 
stages. Stage 1 constitutes packet transfer between OBC and interface FPGA while stage 
2 consists of packet traveling on network from the OBC unit to the supervisor. The 
experimental results depict that stage 1 takes longer and causes more deviation from a 
mean or average value in a 2-node system.
147
2 The way, in which a program is written, can affect both delay and standard deviation.
3 A unidirectional program, e.g. the OBC program in the coordination method 2, is simpler, 
has a lower standard deviation and holds communication link for shorter time compared 
with a bidirectional program.
Observation # 1
In this test bed, stage 1 consists of the packet transfer on the parallel port between the OBC PC 
and the RC 203 board. A protocol called SendProtocol is provided from Celoxica to perform data 
transfers between host PC and the FPGA on parallel port. There are many layers of protocol 
between the host and FPGA when performing SendProtocol reads and writes. The timing of a 
transfer will also be affected if there is any other application running on the host PC.
In the system model, stage 1 performs the packet transfer from the OBC to the OBC interface 
node. When a data send is requested by a task, the operating system will create a corresponding 
FCB and it will be passed on to the device driver send queue. The transmission o f the packet will 
be delayed until the device driver is scheduled for execution on CPU and will also depend on 
number of FCBs ahead of the requested FCB. Therefore, variable delay is likely to be seen. This 
situation can be improved by allocating priority to FCBs. Hence, allowing less delay for higher 
priority packets.
In a multi-node system, an Ethernet network can exhibit an unbounded communication delay. 
Similar is true for a SpaceWire network (here the arbitrary length of packets can cause two nodes 
of the network to occupy two ports of the router for arbitrarily long period). In such a situation, 
stage 2 can also cause variable long delays. However, in the proposed architecture, an assumed 
maximum network packet length and the supervisor’s periodic monitoring of network nodes will 
make the network behaviour more deterministic. The adoption of a maximum length for a 
network packet will ensure that two communicating nodes will not remain engage longer than a 
time period (time period = 1/(maximum packet length). The supervisor monitoring will guarantee 
that any network node will not be blocked, because of a fault, longer than the longest recovery 
applied by the supervisor (as it is assumed that the supervisor will examine all underlying units 
and hence, their network interfaces once in a supervisory round and will be able to get any faulty 
node back at the latest before or equal to the longest recover time).
148
Observation # 2
There are two types of I/O calls, blocking and non-blocking. When an application issues a 
blocking I/O call, the execution of the application is suspended. The GetBlockStall function 
represents an example of a blocking call. The GetBlock function, on the other hand, returns 
quickly, with a return value that indicates how many bytes were received and therefore represents 
a non-blocking I/O call. The selection of blocking or non-blocking calls will therefore affect the 
delay experienced. However, the nature of the coordination method 1 requires blocking I/O.
Observation # 3
A computing unit needs a communication interface either to transmit or send data to the external 
world or to collect or receive data. There is a natural difference between a send and receive call. 
An application will always call a send function if  it has data ready to be sent. On the other hand, 
the receive function might not be able to return some data immediately; instead it involves 
waiting for data to be available.
In coordination method 1 (configuration 1), the time measured between a supervisor and a 
response OBC packet with GetBlock function was found to be 1201 ( j l s ,  which is closer to the 
expected double time of coordination method 2 (configuration 1). As mentioned previously, the 
implementation of coordination method 1 requires functionality of GetBlockStall and in this case 
the time measurement is significantly higher than the time required for coordination method 2.
8.4 Resynchronization after an OBC-Program Crash
As mentioned in chapter 6, some SEFI recoveries may require longer than an invocation period. 
In such a situation, the supervisor will not be able to receive a DAD packet in the same 
invocation period and must specify a time-out condition corresponding to the expected recovery 
time before intervening OBC again. With this test bed, it was possible to compare which of the 
two coordination methods will result in a quicker resynchronization after an OBC crash.
For the test run, both the OBC and the supervisor programs were allowed to run in 
synchronization for a few seconds and then a SEFI was simulated on the OBC by manually 
closing the execution window of the OBC program and another event was created by clearing the 
interface FPGA using the Celoxica file transfer utility (FTU) program (Figure 8-10). This 
program allows configuration or clearing of the FPGA.
149
Figure 8-10 File T ransfer Utility
When the FPGA is cleared while the OBC program is running, it crashes the OBC program and 
this program needs to be reinitialized in order to get it back into function.
Generation o f  both SEFIs resulted in a time-out condition at the supervisor. As soon as the “time­
out ” message was printed on the supervisor program’s execution window, the OBC program was 
manually reinitialized.
8.5.1 Coordination Method 1
Instead o f  running two timers on both subsystem, this scheme puts the OBC program into a wait 
state for a packet, once a packet is received it tests to check the source o f  the packet, if  this is a 
packet from the supervisor it sends a packet back to the supervisor.
As the supervisor program enters its send/receive thread and the thread is not signaled within an 
expected time, the WaitForSingleObject function will terminate the thread and the main program
150
will declare a time-out condition. Right after the appearance of the time-out on the display screen 
of the supervisor program, the OBC program was manually reinitialized and both programs got 
into synchronization again. The time required for pressing the execute button of the OBC 
program to the point, where both the interface FPGA and the OBC program are ready to 
communicate with the supervisor was manually measured using a stop watch and it was found to 
be 9s. The time required for the programs to get back into synchronization was measured on the 
Ethereal file from the moment the last packet was sent by the OBC prior to fault event to the first 
packet received from the OBC-program after reinitialization of the OBC-program. The 
synchronization time after an event, when the OBC-program execution window was closed and 
then reset was measured to be 12 s, 299 ms and 644 (.is, while the time required for 
resynchronization of the OBC and the supervisor program after clear ing of the interface FPGA 
and then resetting of the OBC program was found to be 15 s, 339ms and 501 ps.
8.5.2 Coordination Method 2
The OBC program was configured to send a packet every 1 s. Synchronization time after an 
event, when the OBC-program execution window was closed and then reset was measured to be 
11 s, 316 ms and 272 ps, while time required for resynchronization time after clearing of the 
interface FPGA and then resetting of the OBC program was found to be 14 s, 971ms and 559 ps.
8.6 Comparison of Coordination Methods
The result demonstrates that the coordination method 2 can have a better resynchronization 
period by up to 1 invocation period as compare to the coordination method 1. This is because the 
OBC program sends a DAD packet as soon as it is back into function in coordination method 2. 
On the other hand, in coordination method 1 it first waits for a packet from the supervisor. It may 
possible that the OBC program becomes ready right after missing a packet from the supervisor 
and will not be able to send a DAD packet until it receives a supervisor packet in the next 
invocation period.
The implementation of coordination method 2 was based on the latter approach presented in 
section 8.1.2. Both the OBC and the supervisor program did not use a global time source; instead 
both programs were allowed to communicate with a margin in time-out period at the supervisor. 
Section 8.4.2 reveals that on average the generation of a packet and its transfer to the supervisor 
was completed in 636 /is. Adding a sleep instruction at the OBC caused this delay to increase. 
After sending each packet, the OBC program slept for 1 s. The difference between any two OBC
151
packets was measured on Ethereal file. The maximum time between two consecutive OBC 
packets was found to be 1.001836 s, whereas 1.001010 s serves as a minimum datum with most 
of points clustered around 1.001440 s. The OBC program for coordination method 2 only 
included a sleep instruction in the program used for measurements in section 8.4.1, thus the 
difference between 636 /xs and 1010 /xs can be thought as a margin required to compensate for 
two clock sources running at two nodes of the network. The time-out period set at the supervisor 
was 1.002 s (as WaitForSingleObject function used at the supervisor allows only a whole number 
of milliseconds as the time-out period). Thus, 1 ms can be thought as the margin for 
synchronization.
Similar to coordination method 2, coordination method 1 does not have a common clock/time 
source. Theoretically, it should not be a problem as only one timer is used, which will initiate 
polling of the OBC. However, experimentally coordination method 1 required 9 ms margin to 
avoid false time-out conditions. The time difference between a polling request and the OBC 
response packet varied from a minimum of 101448 /xs to a maximum of 110175 /xs with most of 
data points lying around 101480 /xs. The range of variation is larger in coordination method 1 
(101448 to 110175 /xs, standard deviation 2.367E+03 /xs) in comparison with coordination 
method 2 (1010 to 1836 /xs, standard deviation 106 /xs).
An increase in standard deviation was observed for both schemes with the inclusion of a sleep 
instruction. It can be justified with the fact that both PC’s were running a multitasking operating 
system and by definition the sleep function causes a task to relinquish the remainder of its time 
slice and become unrunnable for at least the specified number of milliseconds, after which the 
task is ready to run [Msdn-05]. As mentioned in chapter 7, a task in the ready queue is not 
guaranteed to run immediately. Consequently, the task may not run until some time after the 
specified interval elapses. It therefore results in increase of jitter seen in the packet 
communication.
8.7 Implications of the Experimental Results
As mentioned in section 8.4, the experimental results are subjective to systematic and statistical 
errors. These errors are strongly dependent on the speed of the target machines, operating system 
running on these machines, number of programs in execution during test runs, network standard 
and traffic conditions. In addition to these errors, the calculation of the resynchronization period
152
after an OBC crash involved human intervention. All these factors make these results qualitative 
in nature.
8.8 Conclusions
While calculating the time-out for a DAD packet at the supervisor, a synchronization margin is 
required to account for clock drift. The clock synchronization is an involved task for 
lockstepping-based schemes. However, the supervisory scheme has fairly relaxed synchronization 
requirements. Two simple coordination methods were demonstrated on an experimental test bed. 
In coordination method 1, only 1 timer (sleep instruction) was used, that was at the supervisor. 
The supervisor PC periodically polled the OBC program. In coordination method 2, in addition to 
supervisor’s timer another timer was used at the OBC. Thus, the supervisor waited for a packet 
from the OBC before it timed out.
The experimental results have shown that coordination method 2 performs better than 
coordination method 1. It was simpler to develop a unidirectional communication for 
coordination method 2. It holds the communication link for far less time compared to 
coordination method 1. For example, on average coordination method 2 required 636 fis 
compared to 101065 jus for coordination method 1 for sending an OBC packet to the supervisor. 
It also exhibited lower standard deviation, i.e. a packet will experience less variation around the 
expected (mean) time. When a delay was added (using the sleep instruction), the standard 
deviation of both schemes increased. Coordination method 2 required a 1 ms synchronization 
margin to account for the above mentioned factors, whilst this margin was 9 ms for coordination 
method 1.
Two OBC crash events were simulated by closing down the OBC program execution window and 
by clearing the FPGA using FTU. In both events, coordination method 2 was bale to get back into 
synchronization earlier than coordination method 1.
153
CHAPTER 9
CONCLUSIONS AND FUTURE WORK
The results from an investigation into fault-tolerant, SOTA heavily COTS-based OBDH 
architectures are presented in this thesis. A novel SEFI tolerance approach has been presented, its 
requirements are established and its practicality has been investigated by addressing its latency 
bounds and synchronization issues. This chapter presents a summary of the discussion presented 
in this thesis and possible future directions are outlined.
9.1 Concluding Overview
The adoption of COTS devices and standards has become a prevailing practice for space 
applications. Over the last 20 years, COTS technology has been flown in space systems with a 
reasonable expectation of success -  as exemplified by Surrey’s 20+ COTS-based space missions. 
However, the complexity of COTS device technology has been increasing and functional 
interrupts have become serious threats to reliable COTS based on-board computing.
SEFIs are distinguished from other SEEs because of the inability to isolate exact location of fault 
and its widespread effect on system performance. A SEFI in a device means that potentially the 
device is no longer available to the system. This situation can completely disrupt a mission 
depending on the role played by the affected device. Although it is not possible to isolate the 
location of fault, its detection is still possible. A SEFI can exhibit itself through many different 
signatures.
Device level mitigations may not be very effective and therefore cannot be trusted while a device 
is suffering from a SEFI. Work to date has almost exclusively addressed SEFI mitigation on unit 
level with one target device in question. Developing dedicated mitigations, targeting one device 
technology at a time, has resulted in heavily redundancy based solutions and it becomes 
particularly true when these are combined in a complete OBDH architecture.
However, one may argue that all mission types do not require same level of availability and fault 
coverage as heavy redundancy solutions provide. Many current Earth orbiting satellites use 
watchdog timers to detect a SEFI on microprocessors leading to reset or power cycling of the unit 
on a time-out. No on-board SEFI mitigation is provided for memories. Reconfigurable computing 
platforms and LANs are not commonly used. Such architectures lack many desirable performance 
features and are prone to long down times.
154
9.2 Achievements of This Research
In order to reduce the cost associated with the SEFI mitigation, a system-level approach has been 
adopted. A rad-hard microprocessor acts as the supervisor and it monitors heterogeneous OBDH 
units. Instead of traditional approaches for SEFI mitigation, this thesis uses a single packet from a 
unit to indicate the health status of more than one device.
This scheme is particularly useful for small, high performance satellites OBDH architectures, 
where SOTA technology is desirable to meet the performance requirements of the mission and 
the system-level supervisor brings reliability. The combination of both aspects resulted in a 
reusable and scalable SEFI tolerant OBDH architecture.
A case study of the supervisory scheme was demonstrated, where an OBC unit was monitored by 
the supervisor. Traditionally, some kind of mitigations has been adopted against SEUs and SELs. 
The same mitigation resources were used to produce diagnostic information for a SEFI detection. 
A simple DAD task was added as an OBC task to collect the required information and to produce 
periodic DAD packets for the supervisor.
In order to increase the availability of the OBC, two measures were taken. Firstly, an on-board 
code store was adopted to hold any recovery infonnation in a non-volatile storage. In the case of 
an OBC crash, this information was used to bring the OBC back into operation. Secondly, a state 
reeoveiy methodology was put forward to avoid any loss of computation after a SEFI. The state 
recovery was also helpful in extending the detection capability of the supervisor and to enable it 
to detect if  there is any OBC task crash.
The latency bound of the supervisory scheme was analyzed on a system model. It was found that 
the scheme has the capability of detecting a SEFI with a worst case latency of 1 s. This detection 
latency is similar to SSTL’s current watchdog interval for on-board computers. However, the 
supervisory scheme offers enhanced diagnosis. The performance overheads are found to be 
nominal, which make this scheme an attractive candidate for future small satellite missions.
hi addition to better fault coverage, the supervisory scheme broadens the set of recovery 
procedures, hi contrast to the traditional reset approach, the supervisory scheme offers multiple 
recoveries. A particular reeoveiy procedure will be adopted according to the device under 
consideration, the fault signature observed and the recovery record of the involved unit. The
155
worst case recovery latency is found to be 3.4s, which offers a huge improvement on current 
recovery methods adopted at SSTL that require at least two ground passes to recover an OBC 
after a crash.
An experimental test bed was built and evaluated to establish synchronization and coordination 
requirements in the context of the supervisory scheme. Two coordination methods are explored 
that can be used with such a scheme. The first method consists of a periodic polling request from 
the supervisor to the OBC unit to invoke the DAD task. The second method uses a timer at the 
OBC as well and therefore, the supervisor does not need to send a polling request. As the OBC 
and the supervisor represents two units connected on a data network that d not share a common 
clock, some synchronization will be required to account for any difference in two clocks and/or 
any variability in packet generation and its communication. Instead of adopting an involved 
synchronization algorithm, a synchronization margin was calculated that will be required to be 
added into the supervisor’s DAD packet time-out (Tl) condition. It was found that adoption of 
coordination method 2 required lower synchronization margin and variance.
In contrast to the strict synchronization requirements of N-modular redundancy-based designs, 
the supervisory scheme has flexible and easily achievable coordination and synchronization 
requirements. A synchronization margin of 1 ms was found sufficient to synchronize two units 
on a data network with a packet invocation period of 1 s.
9.3 Future Work
The research presented in this thesis can be extended in following directions
>  Actually building the system model presented in this thesis, its radiation testing or space 
flight and comparing the performance of signature-based SEFI detection with simple 
watchdog SEFI detection and TMR-protected OBC to determine fault coverage of the 
scheme.
> This thesis goes into detail of OBC the supervisory protocol. Broad concepts are 
proposed for payload computer, SSDR and code store. A detailed supervisory protocol is 
required for all these units.
>  Again, in this thesis broad concepts are given for OBC state recovery task. This task is 
actually required to be developed to fully explore its usefulness and any implications 
associated with it.
156
REFERENCE
[Acte-01]
[Acte-03]
[Acte-05]
[Ahme-90]
[Aker-95]
[Angi-04]
[Answ-05]
[Asen-98]
[Aviz-97]
[Bekt-92]
[Bogr-05]
[Brid-05]
[Acte-97]
[Buch-03]
Actel Application Note, “Design techniques for radiation hardened FPGAs”, 
1997.
Actel Application Note, “Power-Up and Power-Down Behavior of 54SX and 
RT54SX Devices”, January 2001.
Actel Documentation, “SX-A Automotive Family FPGAs”, September 2003
Actel Application Note, “RTAX-S radiation tolerant feature and mitigation 
techniques”, 2005.
R. E. Ahmed, Robert C. Frazier, and Peter N. Marinos, “Cache-Aided 
Rollback Error Recovery (CARER) Algorithms for Shared-Memory 
Multiprocessor Systems’, Proceedings of the 20th International Symposium 
on Fault-Tolerant Computing Systems, pp. 82-88, June 1990.
L. D. Akers, “Microprocessor technology and single event upset 
susceptibility”, Proceedings o f the AIAA/Utah Small Satellie Conference, 
September 1995.
R. Angilly, “TREMOR: a triple modular redundant flight computer and fault 
tolerance test bed for the WPI pansat nanosatellite”, Proceedings of the 18th 
Annual AIAA/USU Conference on Small Satellites in Logan, Utah, August,
2004.
Answers.com, “List of Intel microprocessors”,
http://www.answers.com/topic/list-of-intel-microprocessors 
V. Asenek, “Predicting the Reliability of Electronic Subsystems and 
Commercial-Off-The-Shelf Microprocessors on Low-Cost Small Satellites”, 
PhD Thesis, University of Surrey, 1998.
Avizienis, A., “Towards Systematic Design of Fault-Tolerant Systems”, 
IEEE Computer Magazine, April, 1997.
BekTek, “SCOS reference manual, AMSAT-NA microsat and UoSat OBC 
186”, December 1992.
H. Bogrow, “The continued evolution of reconfigurable FPGAs for aerospace 
and defense strategic applications”, Proceeding of the MAPLD International 
Conference, 2005.
B. Bridgford and C. Carmichael, “SEU mitigation in reconfigurable FPGAs: 
picking the right tool for the job”, Proceedings of the MAPLD International 
Conference, 2005.
S. Buchner, “Evaluation of commercial communication network protocols for 
space applications”, Presented at SEEWG, Los Angeles November 2003.
http://radhome.gsfc.nasa.gov/radhome/papers/SBEWG03_Network.pdf
157
[Carm-99]
[Carm-01]
[Carm-04]
[Cami-05]
[Celo-04]
[Celo-05]
[Cesc-03]
[Chau-99]
[Chau-01]
[Chri-99]
[Chri-Ol]
[Cisc-05]
[Come-04]
[Cond-99]
[Buch-05] S. Buchner and E. Rodrigurz, “Pulsed laser test of Atmel chip”,
http://nepp.nasa.gov/DocUploads/C4AlCBBE-8DAF-4845-
943B1E851ADAFF3O/NRL 112103 ATMEL.pdf. accessed on 15-03-05.
Carmichael, C. et al., “SEU Mitigation Techniques for Virtex FPGAs in
Space Applications”, Proceeding of the MAPLD International Conference,
1999.
C. Carmichael, et al., “Proton testing of SEU mitigation methods for the 
Virtex FPGA,” Proceedings o f the MAPLD International Conference, 2001 
C. Carmichael, et al., “Proton Testing of SEU Mitigation Methods for the 
Virtex FPGA”, WWW Document,
http://klabs.org/richcontent/MAPLDConO 1/Papers/P/P6 Carmichael P.pdf. 
13-06-04.
C. Carmichael et al., “SEE validation of SEU mitigation methods for 
FPGAs”, Proceeding of the MAPLD International conference, 2005.
Celoxica, “Platform developer’s kit, PAL”, 2004.
Celoxica, “Platform developer’s kit, RC 200 hardware and PSL manual”,
2005.
M. Ceschia et al., “Identification and Classification of Single-Event Upsets in 
the Configuration Memory o f SRAM-Based FPGAs”, IEEE Trans, on Nucl. 
Sci., Vol-50, no. 6, pp. 2088-2094, Dec 2003.
Savio N. Chau, “Design of a fault-tolerant COTS-based bus architecture”, 
IEEE Tans. Nucl. Sci., Vol-48, Issue-4, pp. 351-359, December 1999. 
S.N.Chau, ;J. Smith, T. Tai, “A design-diversity based fault-tolerant COTS 
avionics bus network,” Proceedings Of IEEE Pacific Rim International 
Symposium on Dependable Computing, pp. 35-42, December 2001.
A. Z. Christen, “SMCSIite user manual”, Issue 1.1, 1999,
http://www.spacewire.esa.int/tectT/spacewire/products/index.htm accessed on 
June 2005.
A. Z. Christen et al., “SMCS332 user manual”, Issue 2, 2001, 
http://www.spacewire.esa.int/tech/sDacewire/products/index.htm accessed in 
June 2005.
Cisco Systems, “Internetworking technology handbook”, Nov. 2005 
http ://www. cisco .com/uni vercd/hoine/home. htm
D. E. Comer, “Computer networks and Internet”, 4th Ed., Prentice Hall, 2004 
Conde, R. et al., “Adaptive histument Module -  A Reconfigurable Processor 
for Spacecraft Applications”, The MAPLD Conference, 1999.
158
[Cris-91]
[Cron-98]
[Czaj-05]
[Darg-05]
[DeVa-04]
[Dodd-01]
[Dres-98]
[Dufo-92]
[Dyer-03]
[Ecss-03]
[Coss-99]
[Eia-96]
J.R.Coss et al., “Device SEE Susceptibility Update: 1996-1998”, IEEE 
NSREC Data Workshop Record, 1999.
F. Cristian and F. Jahanian, “A Timestamp-Based Checkpointing Protocol for 
Long-Lived Distributed Computations”, Proceedings of IEEE Symposium on 
Reliable Distributed Systems, pp. 12-20, 1991.
B. Cronquist et al, "Modifications of COTS FPGA Devices for Space 
Applications", Proceedings o f the MAPLD International Conference, 1998. 
David R. Czajkowski et al, “SEFI mitigation technique for COTS 
microprocessors-proton testing demonstration”, Proceedings o f the MAPLD 
International Conference, 2005.
T. Dargnies et al., “Radiation tolerant and intelligent memory for space”, 
Proceeding of the MAPLD International conference, 2005.
J. DeVale, “Traditional reliability”, WWW Document, 
http://www.ece.cmu.edu/~koopman/des s99/traditional reliability/. 06-04- 
2004
P.E. Dodd, M.R Shaneyfelt, E. Fuller, J.C Pickel, F.W. Sextan and P.S. 
Winokur “Impact of Substrate thickness on Single-Event Effects in Integrated 
Circuits”, IEEE Trans. Nucl. Sci. (2001).
Paul V. Dressendorfer “Basic Mechanisms for the New Millennium”, IEEE 
NSREC Short Course July, 1998.
C. Dufou et al. “Heavy ion induced single hard errors on submicronic 
memories”, IEEE Trans, on Nucl. Sci., Vol. 39, No. 6, pp. 1693-1697, 
December 1992.
C. S. Dyer and G. R. Hopkinson, “Space Radiation Effects For Future 
Technologies and Missions”, Technical Report, 2003. 
http://reat.space.qinetiq.com/Reat/wp 1 In/Document text.html 
ECSS, “SpaceWire - Links, nodes, routers and networks”, ECSS-E-50-12A, 
2003,
http://www.ecss.nl/forums/ecss/ templates/default.htm?targel=http.7/www.ec 
ss.nl/forums/ecss/dispatch.cgi/standards/docProfile/100203/d2003022110250 
4/No/tl 00203 .htm
EIA/JEDEC Standard, “Test Procedures for the Measurement of Single Event 
Effects in Semiconductor Devices from Heavy Ion Irradiation”, Electronic 
Industries Association, Engineering Department, Arlington, 1996.
159
[Esa-03]
[Estr-93]
[Facc-OO]
[Fisc-87]
[Fisc-04]
[Full-00]
[Gais-OO]
[Gais-02]
[Gans-05]
[Guas-99]
[Goor-89]
[Guer-04]
[Hill-03]
[Hens-99]
[Elno-99]
[Hodg-99]
M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson, “A Survey of 
Rollback-Recovery Protocols in Message-Passing Systems”, Technical 
Report CMU-CS-99-148, Carnegie Mellon University, June 1999.
ESA SpaceWire, “Products and technologies”, 2003
http://www.estcc.esa.n1/tech/spacewire/products/#routers
Estreme, F. et al., “SEU and Latchup Results for SPARC Processors”, IEEE
Radiation Effects Data Workshop, 1993.
Faccio, F., “COTS for the LHC Radiation Environment: The Rules of the 
game”, The 6th Workshop on Electronics for LHC Experiments, Krakow, 
Poland, 2000.
T. A. Fischer, “Heavy-ion-induced gate rupture in power MOSFETs”, IEEE 
Trans, on Nucl. Sci., Vol. 34, pp. 1786, 1987.
S. Fischer,S. Parkes. and G. Kempf , “Spacewire router”, ESA 
Microelectronics Presentation Days, Noordwijk, Holland, 4-5 February 2004. 
Fuller, E. et al., “Radiation Characterization, and SEU Mitigation, of the 
Virtex FPGA for Space-Based Reconfigurable Computing”, IEEE NSREC 
Conference, 2000.
J. Gaisler, “LEON-1 Processor - First Evaluation Results”, 
Proceedings of the European Space Components Conference, 
March 2000.
Gaisler, J., “Suitability o f Reprogrammable FPGAs in Space Applications”, 
ESA Technical Report, 2002.
J. Ganssle, “Bus cycles”, http://www.ganssle.com/articles/abuscvc.htm. 2005 
J. R. Guasch and S. Parkes, “From IEEE 1355 high-speed serial link to 
SpaceWire”, Proceedings o f DASIA Conference, 1999.
A. J. van de Goor, “Computer architecture and design”, Addison-Wesley, 
1989.
S.M. Guertin, J.D. Patterson and D.N. Nguyen, “Dynamic SDRAM SEFI 
detection and recovery test results”, IEEE Radiation Effects Data Workshop, 
pp. 62-67, 2004.
R. Hillman et al, “Space Processor Radiation Mitigation and Validation 
Techniques for an 1800 MIPS Processor Board”, Proceedings of RADECS 
Conference, 2003.
Henson, B.G. et al., “SDRAM Space Radiation Effects Measurements and 
Analysis”, IEEE Radiation Effects Data Workshop Record, 1999.
Hodgart, M.S. and Tiggeler, H., “Introduction to error Correcting Codes in 
Satellite Applications”, Technical Report, Surrey Space Centre, 1999.
160
[Hunt-87]
[Inte-94]
[Inte-96]
[Inte-97]
[hom-02]
[Irom-03]
[Jack-05]
[John-98]
[John-00]
[JSC-03]
[Katz-97]
[Katz-99]
[Kaya-03]
[Koga-85]
[Koga-91]
[Koga-97]
[Koga-OO]
[Howa-01] Howard, J.W. et al., “Total Dose and Single Event Effects Testing of the Intel 
Pentium III (P3) and AMD K7 Microprocessors”, NASA Publication, 2001.
D.B. Hunt and P.N. Marinos, “A General Purpose Cache-Aided Rollback 
Error Recovery (CARER) Technique”, Proceedings of the 17th International 
Symposium on Fault-Tolerant Computing Systems, 1987.
Intel, “8051 user manual”, 1994.
Intel, “386 EX User Manual”, 1996.
Intel, “Pentium II processor developer’s manual”, 1997.
Irom, F. et al., “Single-Event Upset in Commercial Silicon-On-Insulator 
PowerPC Microprocessors”, IEEE Trans, on Nucl. Sci., Vol. 49, 2002.
Irom, F. et al., “Single-Event Upset in Evolving Commercial Silicon-On- 
Insulator PowerPC Microprocessors”, IEEE NSREC Conference, 2003
C. Jackson, SSTL, Private Communication.
A.H. Johnston, “Radiation Effects in Advanced Microelectronis 
Technologies”, IEEE Trans. Nucl. Sci., 1998.
Johnston, A.H., “Scaling and Technology Issues for Soft Error Rates”, The 
4th Annual Research Conference on Reliability, 2000.
Johnson Space Centre Technical report, “Fault-Detection, Fault-Isolation and
Recovery (FDIR) Techniques”, accessed on September, 2003.
www.estec.esa.nl/gpqwww/traininq/materials/dfe7.pdf
Katz, et al., “Radiation effects on Current Field Programmable
Technologies”, IEEE NSREC Conference, 1997.
Katz, R. et al., “Logic design Pathology and Space Flight Electronics”, 
MAPLD International Conference Proceedings, 1999.
S. Kayali, “Space Radiation Effects on Microelectronics” JPL/NASA Technical 
Report, http://parts.ipl.nasa.gov/docs/Radcrs Final.pdf
Koga, R. et al., “Techniques of Microprocessor Testing and SEU rate 
Prediction”, IEEE Trans. On Nucl. Sci., 1985.
R. Koga et al., “On the suitability of non-hardened high density SRAMs for 
space applications”, IEEE Trans, on Nucl. Sci., Vol. 38, No. 6, pp.1507-1513, 
December 1991.
Koga, R. et al., “Single Event Functional Interrupt (SEFI) Sensitivity in 
Microcircuits”, The 4th European Conference on Radiation and its Effects on 
Components and Systems (RADECS) Proceedings, 1997.
R.Koga et al., “SEE Sensitivity of FPGAs with Amorphous Silicon 
Antifuse”, Presented at the 12th SEE Symposium, Manhattan Beach, Ca, 
April 2000.
161
[Koo-87]
[Ksch-91]
[Kurz-84]
[Labe-92]
[Labe-96a]
[Labe-96b]
[Labe-97]
[Labe-98]
[Laco-03]
[Laud-05]
[Layt-03]
[Leli-96]
[Lese-05]
[Koga-01] '
[Lima-02]
R. Koga et al, “Permanent Single Event Functional Interrupts (SEFIs) in 128- 
and 256-megabit Synchronous Dynamic Random Access Memories 
(SDRAMs)", IEEE Radiation Effects Data Workshop, 2001.
R. Koo and S. Toueg, “Checkpointing and Rollback-Recovery for Distributed 
Systems”, IEEE Trans, on Software Engineering, pp. 23-31, 1987. 
Kaschmitter, J.L. et al., “Operation of Commercial R3000 Processors in the 
LEO Space Environment”, IEEE Trans, on Nucl. Sci., Vol. 38, 1991.
S.A. Kuzban, T.S. Heines and A. P. Sayers, “Operating system priciples”, 2nd 
Edition, ISBN 0-442-25734-1, Van Nostrand Reinhold Company Inc. 1984. 
Label, K. et al., “SEU Tests o f 80386 Based Flight-Computer/Data Handling 
System and of Discrete PROM and EEPROM Devices and SEL tests of 
Discrete 80386, 80387,PROM, EEPROM, and ASICs”, IEEE Radiation 
Effects Data Workshop Record, 1992.
Label, K., “Single Event effect Criticality Analysis”, NASA, 1996.
Label, K. et al., “Radiation Effect Characterizationand Test Methods of 
Single Chip and Multi Chip Stacked 16Mbit DRAMs”, IEEE Trans, on Nucl. 
Electronics”, IEEE NSREC Conference, 1996.
Label, K., “Current Single Event Effect Test Results for Candidate Spacecraft 
Eectronics”, IEEE NSREC Data Workshop, 1997.
Label, K. et al., “Anatomy of an In-Flight Anomaly: Investigation of Proton- 
Induced SEE Test Results for the Stacked IBM DRAMs”, IEEE Trans, on 
Nucl. Sci., Vol 45, 1998.
Lacoe, R.C., “CMOS Scaling, design Principles and Hardening by Design 
Methodologies”, IEEE NSREC Short Course, 2003.
S. Laude, “Customizable microelectronics for aerospace, military and
industry”, 2005, http://www,si 1 iconlaude.com/products.html
Layton, P., “SEE Radiation Test Report”, Maxwell Technologies, 2003.
A.J.Lelis et a l., “Radiation response of advanced commercial SRAMs”, IEEE 
Trans, on Nucl. Science, Vol.43, No. 6, pp. 3103-3108, December, 1996.
A. Lesea et al, “The Rosetta Experiment: Atmospheric Soft Error Rate 
Testing in Differing Technology FPGAs”, IEEE Trans, on Device and 
materials Reliability, vol. 5, no. 3, pp. 317-328, Sept 2005.
Lima, F., “Designing Single Event Upset Mitigation Techniques for SRAM- 
Based FPGAs Devices”, Thesis Proposal, 2002 
www.inf.ufrqs.br/~fqlima/lima Iatw03.pdf
162
[Liu-02]
[Ma-89]
[Maki-OO]
[Maqb-04]
[Mass-96]
[Matt-01]
[Miya-98]
[Miya-03]
[Mora-95]
[Msdn-05]
[Nasa-03]
[Litz-97]
[Nasa-05]
M. Litzkow, T. Tannenbaum, J. Basney, and M. Livny, “Checkpoint and 
Migration of UNIX Processes in the Condor Distributed Processing System”, 
Technical Report 1346, Computer Sciences Department, University of 
Wisconsin-Madison, April 1997.
J. Liu, K.H. Kim and M.H. Kim, “An analysis of fault detection latency 
bounds o f the SNS scheme incorporated into an Ethernet based middleware 
system”, IEEE Symposium on reliable Distributed Computing, 2002.
T.P. Ma and P.V. Dressendorfer, “Ionizing radiation effects in MOS devices 
and circuits”, Wiley Inter-science, New York, 1989.
A. Makihara et al., “Analysis of single ion multiple bit upset in high density 
DRAMs”, IEEE Trans, on Nucl. Sci., Vol. 47, 2000.
S. Maqbool and C. Underwood, “An investigation into the suitability of 
COTS technology for aerospace missions: robust COTS-based architectures -  
experience and approaches from space-flight”, Spaesrane Project Report No. 
2, June 2004.
Massengill, L.W, “Cosmic and Terrestrial Single Event Radiation Effects in 
Dynamic Random Access Memories”, Trans. On Nucl. Sci, Vol. 42, 1996. 
Mattsson, S., “Single Event Upset tests of Commercial FPGA for Space 
Applications”, Workshop on Electronics for LHC Experiments, 2001. 
Miyahira, T. and Swift, G., “Evaluation of radiation effect in flash 
memories,” Proceedings MAPLD Conference, 1998.
Miyahira, T. and Swift, G., “Evaluation of Radiation Effect in Flash 
Memories”, JPL Publication, 2003.
http://klabs.org/richcontent/MAPLDCon98/Papers/c4 miyahira.pdf 
Moran, A. et al., “Single Event Effect testing of the Intel 80386 Family and 
the 80486 Microprocessor”, The 3rd European Conference on Radiation and 
its Effects on Components and Systems (RADECS) Proceedings, 1995 
Microsoft MSDN Library, “Sleep Function”, accessed on 06-01-06, 
http://msdn.microsoft.com/librarv/default.asp?url=/librarv/en- 
us/dllproc/base/slecp.asp
NASA, “SDRAM Technology” accessed in July 2003 
http ://klabs .or g/riehcontent/Tutorial/new modules/sdram.ndf 
NASA Office of Logic Design, “A scientific study of the problems of the 
digital engineering for spaceflight systems, with a view to their practical 
solution”, accessed in June 2005, http://klabs.org/historv/historv docs/sp- 
8070/ch2/2n2/2p2pl system organization.htm
163
[Nguy-99]
[Ocho-81]
[Oldh-03]
[Olso-95]
[Page-05]
[Park-99]
[Park-03]
[Park-05]
[Pasc-05]
[Pate-04]
[Pete-82]
[Pete-92]
[Plan-98]
[Nguy-02]
[Plan-05]
Nguyen, D.N and Scheick, L.Z., “SEE and TDD of Emerging Non-Volatile 
Memories”, JPL Publication, 2002.
Nguyen, D.N. et al., “Radiation effects on advanced Flash Memories”, IEEE 
Trans, on Nucl. Sci., Vol 46, 1999.
A. Ochoa, Jr., and P.V. Dressendorfer, “A discussion of the role distributed 
effects in latchup,” IEEE Trans. Nucl. Sci. (1981).
T. R. Oldham, “How device scaling affects single event effects sensitivity,” 
IEEE NSREC Short Course, July 2003.
A. Olson, K.G. Shin and B.J. Jambor, “Fault tolerant clock synchronization 
for distributed systems using continous synchronization messages”, 25th 
International Symposium on Fault-tolerant computing, pp. 154-163, 1995.
T. E. Page, Jr. and J. M. Benedetto, “Extreme Latchup Susceptibility in 
Modem Commercial-off-the-Shelf (COTS) Monolithic 1M and 4M CMOS 
Static Random-Access Memory (SRAM) Devices”, IEEE Trans, on. Nucl. 
Sci., December 2005.
S. Parkes, “SpaceWire: The Standard”, The DASIA Proceedings, 1999 
S. Parkes, “SpaceWire router requirements specifications”, ESM-006-Specs- 
1, 2003.
S.Parkes, University of Dundee, email communication.
A. Paschalis and D. Gizopoulos, “Effective software-based self-test strategies 
for on-line periodic testing o f embedded processors”, Trans. On Nucl. Sci., 
vol. 24, no. 1, pp. 88-99, January 2005 [Pate-04] D. Patel Et. al, “Space 
qualified radiation-hardened FPGAs, A successful collaboration continues”, 
The MAPLD Conference 2004.
http://klabs.org/m apld04/presentations/session p/p 165 patel s.pdf
D. Patel et al, “Space qualified radiation hardened FPGAs: a successful 
collaboration continues”, Proceeding of the MAPLD International 
Conference, 2004.
E. L. Petersen et al., “Calculations o f cosmic rays induced soft upset and 
scaling in VLSI devices”, IEEE Trans, on Nucl. Sci., Vol-29, No. 6, pp. 
2055-2063, December, 1982.
E. L. Petersen et al., “Rate prediction for single event effects -  a critique”, 
IEEE Trans, on Nucl. Sci., Vol. 39, No. 6, pp. 1557-1599, December 1999.
J. S. Plank, K. Li, and M. A. Puening, “Diskless Checkpointing”, IEEE 
Trans, on Parallel and Distributed Systems, October 1998.
T. Plant, SSTL, Private Communication.
164
[Prab-05]
[Rama-88]
[Rash-00]
[Rodg-01]
[Rose-05]
[Skla-88]
[Saks-84]
[Schn-OO]
[Schr-80]
[Schw-97]
[Seif-02]
[Serl-84]
[Shi-04]
[Poiv-98]
[Shir-01]
C.Poivey et al., “Radiation characterisation of commercially available 
lMbit/4Mbit SRAMs for space applications”, NSREC Data Workshop, 1998 
G. M. Prabhu, “Computer architecture tutorial”, 
http://www.cs.iastate.ed/~prabhu/Tutorial/titIe.html. 2005.
P. Ramanathan and K.G. Shin, “Checkpointing and Rollback Recoveiy in a 
Distributed System Using Common Time Base”, Proceedings of the 7th 
Symposium on Reliable Distributed Systems, pp. 13-21, Oct 1988.
J. Rash, C. Jackson and H. Price, “Internet access to spacecraft”, Proceedings 
of AIAA/USU, http://ipinspace.gsfc.nasa.gov/documents/Small-
Sat200QPaper-html/. 2000.
Rodgers, J et al., “Advanced Memories for Space Applications”, IEEE 
Microelectronics Reliability & Qualification Workshop, 2001.
J. Rosello and S. Parkes, “Animated presentation to SpaceWire”, 
http://www.estcc.esa.nl/tecli/spacewire/. 2005.
Sklar, B.,’’Digital communications : fundamentals and applications”,Prentice- 
Hall, 1988.
N.S. Saks, M.G. Ancona and J.A. Modola, “Radiation Effects in MOS 
Capacitors with Very Thin Oxides at 80K”, IEEE Trans. Nucl. Sci., 1984.
R. Schunerr, “GSFC SOMO technology development program, Space 
Internet work area overview”, September 2000, accessed in May 2003, 
http://ipinspace.gsfc.nasa.g0v/documents/Schnurr_SOMO_Peer_review.ppt# 
285,1,Richard Schnurr Electrical Systems Center FY01 Annual Review 27- 
28 September, 2000.
J.E. Schroeder, A. Ochoa, Jr., and P.V. Dressendorfer, “Latchup elimination 
in bulk CMOS LSI circuits,” IEEE Trans. Nucl. Sci., 1980 
Schwartz, H.R. et al., “Single Event Upset in Flash Memories”, IEEE Trans, 
on Nucl.Sci., Vol 44, 1997.
Seifert, N. et al., “Impact of Scaling on Soft-Error Rates in Commercial 
Microprocessors”, IEEE Tans, on Nucl. Sci., Vol. 49, 2002.
O. Serlin, “Fault-Tolerant Systems in Commercial Applications”, IEEE 
Computer, pp. 19-30, August 1984.
Ying Shi, “Fault tolerance computing—Draft”, WWW Document, 
http://www.ece.cmu.edu/~koopman/des s99/fault tolerant/. 06-04-2004. 
Philip P. Shirvani, “Fault-Tolerant Computing for Radiation Environments”, 
Center for Reliable Computing, Stanford University, California, Technical 
Report, June 2001, http://crc.stanford.edu/crc papers/GRC-TR-0 i -6.pdf
165
[Sied-95]
[Sied-02]
[Silb-98]
[Sori-02]
[Sstl-05]
[Stak-01]
[Swif-01]
[Summ-06]
[Toma-05]
[Unde-94]
[Unde-99]
[Shiv-02]
[Unde-96]
Shivakumar, P. et. al, “Modeling the Effect of Technology Trends on Soft 
Error Rate of Combinational Logic”, International Conference on 
Dependable Systems and Networks (DSN), 2002
C. Siedleck, “Single event effect flight data analysis o f multiple NASA 
spacecraft and experiments; implications to spacecraft electrical designs”, 
Proc. RADECS, pp. 581-587, September, 1995.
C. Seidleck et al, “Test methodology for characrterizing the SEE response of 
a commercial IEEE 1394 serial bus (firewire)”, IEEE Trans. On Nucl. Sci, 
vol. 49, no. 6, 2002.
A. Silberschatz and P. Galvin, “Operating system concepts”, 5th Ed., Addison 
Wesley Longman Inc., 1998.
D.J. Sorin, “Using lightweight checkpoint/recovery to improve the 
availability and designabaility of shared memory multiprocessors”, PhD 
Thesis, University o f Wisconsin, Madison, USA, 2002, 
http://www.cs.wisc.edu/multifacet/theses/daniel sorin ohd.pdf
SSTL Website, “On-board data handling: OBC 386”.
www.sstl.co.uk/documents/On- 
board%20Data%20Handling%20QBC%20386.pdf. 2005.
P. H. Stakem, “Flight Linux project”, Technical Report, QSS, Inc, May 2001, 
accessed in January 2004,
http://flightlinux.gsfc.nasa.gov/docs/onboard LAN.html 
Swift, G.M., and Guertin, S.M., “In-flight Observations o f Multiple Bit Upset 
in DRAMs”, IEEE Trans, on Nucl. Sci., Vol. 47, 2001.
G. Summers, Dynex Semiconductors, email communication, February 2006.
J. E. Tomayko, “Com puters in  Spaceflight: The N A SA  Experience”, 
NASA Contractor Report, Updated: July 15,2005.
C.I Underwood, D.J. Brock, P.S.Williams, S. Kim, R. Dilao, P.R. Santos, 
M.C. Brito, C.S. Dyer, and A.J. Sims (1994) “Radiation environment 
measurements with the cosmic-ray experiments on-board the KITSAT-1 and 
PoSAT-1 micro-satellites”. IEEE Trans, on Nucl. Sci., vol. 41, no. 6, pp. 
2353-2360, December 1994.
C.I Underwood, Observations on the reliability of COTS-device-based 
SSDRs operating in low earth orbit”, Fifth European Conference on 
Radiation and its Effects on Components and Systems (RADECS), pp. 387- 
393, September, 1999.
C. I. Underwood, “Single Event Effects in Commercial memory Devices in 
the Space Radiation Environment”, PhD Thesis, university of Surrey, 1996.
166
[Vela-92]
[Wake-01]
[Walk-01]
[Wang-95]
[Wang-97]
[Webp-05]
[Woer-98]
[Wood-04]
[Wood-05]
[Wu-90]
[Unde-02]
[Xili-OO]
[Xili-04]
C.I. Underwood, A. da Silva Curiel, and M.N. Sweeting, (2002) “In-orbt 
monitoring o f ‘Space Weather' and its effects on commercial-off-the-shelf 
(COTS) electronics - a decade of research using micro-satellites”. Paper 
presented at the 53rd International Astronautical Congress, October 10th -  19th 
2002 .
Velazco, R. et al., “SEU Testing 32-bit Microprocessors”, IEEE Radiation 
Effects Data Workshop Record, 1992.
Wakerly, J.F., “Digital design Principles & Practices”, 3rd Edition, Prentice 
Hall, 2001.
P. Walker, “Fault-Tolerant FPGA-Based Switch Fabric for SpaceWire: 
Minimal loss o f ports and throughput per chip lost”, Proceedings of the 
MAPLD International Conference, 2001.
Y. M. Wang et al, “Checkpointing and Its Applications”, Proceedings of the 
25th International Symposium on Fault-Tolerant Computing Systems, pp.22~ 
31, June 1995.
Y. M. Wang, E. Chung, Y. Huang, and E.N. Elnozahy, “Integrating 
Checkpointing with Transaction Processing”, Proceedings o f the 27th 
International Symposium on Fault-Tolerant Computing Systems, pp. 304- 
308, June 1997.
Webopedia, “What is acces time”, access-time, 
www.webnedia.com/TERM/a/access time.html
D. Woemer, “X2000 systems and technologies for missions to the outer 
planets”, 49th International Astronautical Congress, Melbourne, Australia, 
September, 1998.
A. Woodroffe and P. Madle, “Application and Experience o f CAN as a Low 
Cost OBDH Bus System", Proceedings o f the MAPLD International 
Conference, 2004.
A. Woodroffe, SSTL, Private Communication.
K. Wu, W. K. Fuchs, and J. H. Patel, “Error Recovery in Shared Memory 
Multiprocessors Using Private Caches”, IEEE Trans, on Parallel and 
Distributed Systems, pp. 231-240, April 1990.
Xilinx Documentation, “JTAG programmer guide”, 2000.
Xilinx,”XTMR Tool, the first TMR development tool for reconfigurable 
FPGAs”, 2004
http://www.xilinx.com/esp/mil aero/collateral/tmrtool sellsheet wr.pdf
167
APPENDIX A: LIST OF PUBLICATIONS
Papers
1. S. Maqbool and C. Underwood, “A Cost-Effective Robust Data Handling Architecture 
for Space Missions” Poster presented at IEEE Nuclear and Space Radiation Effects 
Conference, USA, July 2004
2. S. Maqbool and C. Underwood, “Mitigating SEFIs in Data Handling Architectures: A 
Case Study”, Proceedings o f the Data Systems In Aerospace (DASIA) Conference, 
June 2005
3. S. Maqbool and C. Underwood, “System-Level Mitigation o f SEFIs in Data Handling 
Architectures, A Solution for Small Satellites”, Proceedings of 19th Annual 
AIAA/USU Conference on Small Satellites, USA, August 2005, This paper was 
awarded an honorable mention with a prize of 2000 USD
4. S. Maqbool and C. Underwood, “Evaluation of System-Level Supervisory Approach 
for SEFIs Mitigation”, Proceedings o f the MAPLD International Conference, USA, 
September 2005
5. S. Maqbool and C.I. Underwood, “Checkpointing state recovery for on-board 
computers”, Accepted for the MAPLD International Conference, USA, September 2006
6. S. Maqbool and C. Underwood, “Latency Bound Estimation for the System Level 
Supervisory SEFI-Tolerance Strategy”, To be presented at IEEE Aerospace 
Conference, USA, March 2007
Technical Reports
1. S. Maqbool and C. Undeiwood, “An Investigation into the Suitability of COTS 
Technology for Aerospace Missions: Device Trends and Mitigation Strategies”, 
Spaesrane Project Report No. 1, March 2004
2. S. Maqbool and C. Underwood, “An Investigation into the Suitability of COTS 
Technology for Aerospace Missions: Robust COTS-Based Architectures -  Experience 
and Approaches from Space-Flight”, Spaesrane Project Report No. 2, June 2004
3. C. Underwood and S. Maqbool, “Using Commercial Off The Shelf (COTS) Digital 
Integrated Circuits (ICs) in a Low Earth Orbit (LEO) Radiation Environment”, Final 
Report for QinetiQ Contract CU009-037428, September 2004
4. S. Maqbool, “Technologies and Issues”, Surrey Space Centre Internal Report, 2003
5. S. Maqbool, “Investigation into Cost-Effective Robust Data Handling Architectures for 
Small Satellites”, Surrey Space Centre Internal Report, 2003
168
APPENDIX B: SOFTWARE CODE
169
lo 3
£> < < n 2O Cl.
-- o>
o O  £ 5
n_ ja
*Jj3
n  P
o S
3  3
2S 5. S. * '
B. H 5. &e s  n S‘
ss ■°" ■°" ^ *S. EL EL ,r~ J
E l l S  •"-s-as j
i?g-iH£■s =P|
P
si
§  g  i  g  §
8. S. S. S. I
00 —  -P >—
o ~ ” 2ip  O  O H  !f? & 8 £ !
el
§8
’ S £L
c ^ £‘ 2=- K P* E! 2- &
» a ° c^ BS w4§'
T* T1£L EL
o  Ge. 
o 5i ia cs £ 2- 2! 21 o
In
11 
2L n>
If*
3 ^
y  4-* a  oo tg-2X « 5>rt> ^ •—H
rr o  - •
■g sa g sj
II
p ' lS n <? I
 P  vj O'10 — co S =
’ S’'  D- 3  ft ft o  n  f t f ) O g P* D- C
> = ? §  9 2  =  * “ '  
i j t l i  |
Ft F
m
* x <* 2.
: . & £ •
• S’ § s> s
. s ? 1?
o OH q
£  a  S  3  g. 3.1? ■«
CO 3  -■
•S  3
o I
, S ’ *
' o N
MO ^
' S o
n> V n
I £S. S|  I
■S o si S.
• s  I .
Ft
<t
<S.
E  E?' & W £? ■<
* —  Ft
1 P  
P i
g» ¥  
p *  5 "  
o  p *  tn o>
?Sa  -■
n  yo » 
E  m
U f
■H
M3 >TJ
I I
o  o  
3- 3
3 . 3 .
F5 F?
U ©  5
I  S’ 2*
®  B  3H >!|? «?
P  5 5>
I |  
s fI i
** " ~" > p  o 
8 *
I5
' M3 M3-&> CJ
s 4 s *
*" 1 *§■
JS ET P  
M> o .  o .M O O ■
- 8  1 1 
" 5  | , [ g ' <i
— *— M3 *
I s
mo qu
I "
gi §
B * o
a  “§ E St CO
B*
n
I e
X 3t  a
static 
unsigned 
1 
Packet =0; 
}
unsigned 
8 
Data 1; 
while 
(1)
unsigned 
11 
Index 
1, lndex2; 
{
unsigned 
16 
crc; 
if 
(D
atal 
== 
O
xff)
1  • 
c u  :
>
5
II
n U y S 2, » K
r e § °L S.’g s?.05 O O
O ° - a o o
p  o  :
o
■S.U• “  W  to S.
a- a- a ia.tjk> ST m j-E1-S s>
o  £LR3 O [J -  R!— g ». a
m  T- C0 p  ^
f S
0> *"’□^ o  r-
^ • otn g
S' g*
ca
£
5 &  Ch Ch D
»  " " T7
n- „= T35 ^  >as. —
51- I "I
S’s ■-I iu »-0P
? e '
5 K s . li' H
m-t?
% & & £ >  
o
.R
; *oi 2 > [ •? n
trt H
a 5-"q 'a h *
SB I  i  S 
“si-B.S i
f f K . H Dn ~ I ©
* sS!
8 *■P'S a . o
si
52
’“T3 O*0 s 
* ff
1 3552. 2. <8 fi?
<T> p
^  CO oo ^
DO S '
X  8
~ n* 2L 
H  p  o
II
a .
5
g l ­
en g:
o ^3 9o p
I I
p  2
I*.
P* ’IT s - Sm it 
g. O ^P I f
m gp. CO
s  *§ 
3  £© L?
si
X* o *5fe 1 rai? oa
I ail
P
m
S.a?
* S' a
: 63 
■S'
S B
1 £5> □
2  n  11
3> m
3  g-
j? -o « * * 
*“  a  »  w
* I sI % m
7> K) 3§ w S*s Sia* ~j 3
S? m <  <  
a  a  a  a
s  i s i ­ft IIIa  _g £>,8 
■p £  A  s .3- 11- i
TETO B.
Send 
Packet Bytes 
m
acro 
expr 
ParallelPort = 
PalR
C
200ParallelPort;
'! 
m
acro 
expr Ethernet = 
PalEthem
etC
T 
(0);
PalEthem
etW
riteBegin 
(Ethernet, D
estination, Type, D
ataB
yteC
oim
t, &
Error); 
if (Error — 
0)
SEES 
zt. 52. 52. 52.
ti ro x  °  *s *s 'h  <g <g "<g <g -g
H D O O ^ n  o  oo a. I I!l p ??o l i a
§■ * -
B-s
I f
p a*
5  + .O N>
p. a a & cx (i a •— Q CO •— 4^ •—* —
q ! S’ddhS 1 2.SH |  S3 q 
—  a  & & • *
E? '«>co a
'I ^o wo •<§ » 
O
II
'-o S
° 35.
o  p
“ ~0 ft. B"O o
m S
|  5 
I <ft a. 
f?
® S?
S’ £
O
PalEthem
etW
rite 
(Ethernet, Data, &
Error); 
D
atal 
=0; 
C
ounterO
++; 
ln
d
exl=
0; 
}
EJ x E? sI- S'
•a ££j § S*"" 8 8° 3 3
I®CD TJ
ViOOOOOj^O
i ' i i i i i g i
. - f t o o n n S ’ n
p .  P .' S'
S’*
**s s*
*
•a
o
f><w
if
o' i 3
«’ <is- H-m B
? £*  
1 3 ?8 A =
. _. ft. O,'J? — |f
Hv;
M3 H'pj < VO ^ S? n|«s.?
B&B- gi
| ? ? 2 g  
?» § § ■§, f? 
o. 8 S «
| t J |  s~ a  s s
*6  ? i .  
i t  I ®  &5 1  
3  i
sr a
R, g
< < Q
n" S' S 8,SSg- 
a  O a 5
©  g. «
& S’I  Sc j3  5. era
I a o  g.
oV era' o' o qiqj  E  o
I^I O
ns *^  II 
to xj
p* 2-«  50
3  5© N>
2 §  H ’XJ
D- 1— (jo T7 jr* h* ^ p
*  ■“  ~o >  c
” 2  $  &g. *• ©  S ?
g. a  II 13 a.8 1 i? ;=; H
& <? o to
3  &
O O C3 •o -g s* i 
O O ’s . ’1 0 0 ^ ;
•XI a  m  P Jr. CO
£H. pT ™
3 ^ 1
§ ‘ fe<8
if  p  8:
g ’SO -
»  s  k
3  S. 8g & S l
8  "’ 5> 2,
Jr
* - S o
s f
>  i
O  2
Sff3 *■ a
>< R
■© 11
g  ’T3
a ?
© '>0 n
* Etn g
I 'o
■ ^  0  
.R
% % % % ■
> = 2
. U £
3  £p  £
-a |u~. to 
‘ &iH!
! s8: o  1
G ! JR 1
o * cn W P <: p  U P S E3 < .
| .3  3 3 3 =f* H I I A "
i S i f i ls i ? t ;3  i §  *trf o\ 3- to " ' 0
A
s a c 0 ,0  -  ^ x  x  o  o
03 03 03 y  p
n  n  o  00r>. p, rt. X >< r. ,., left CT*5  S2. o
M §  <S *’
I  B -&
1 5
.is.
=* c  
O n  o  DC
V, >
=• 73
-  oD3 y
o  r
P  O  D, Oj TJ
* ~  ^  1 -* *3
O
g> s 
& %
- = 73='*>  „  0  r  g .
°S §  §3 I73 hi ,“ . ° - a  a
!3 S ’i ff £■
-  o  CO 73
2  m ' g  o  
a S T o  ap =f § "
„  (i c
s  o  £■
§: w
o  
o  zr  & 3
" 3 J1gC3 *g
* i? 2i X §i 2- > *
5* 5. 5. 5. 5. j
^  Q- cL Q. Q. *
10 i f  5 , £ . *8 s» o a 5^  P. o  * 
*P  v  V  3  J
r  3
*§>73
a. *  
m §■
l l
Ptp
=r O)
3  a
^ 3 " '  g  a  
B tn
-■g O
" !
qg tg  <g og
—■— T3 o n o o 
w  cu p  p. p.
o  «  -  
b  n “ M  
E s D O
CO * *o  ,
Z  S .O
PJl O'
tl 13  
JO gj
O  n« I
©
a .a
&L ^
xj Qo  H
x
*<T £01 OCO, u>
**: x
o' m
-°-Q  
a  *
a ^
I f  
s  $  
f= sr
>T < * *
K  £ . H
S' n 8
i s  -or n  »
I=iS a
E?§ £
i=t* cl ci* "
is H^g q
= 3 - 8 ^ 3 ,I I ° 2
I P
tu w  a. '*•'*■ 31
3 & $m p, qL
g a a s
111 SO  r t  ^  CO
O  O co S 8 ?  £
3 3 3 ? ?
g* o  3  2  <1* a s.m 
8 •oAS
?  W h
S i =5 *§ §Kt <ti a  Jn ffS
S i | »
5> XI
a
B* §■§ on ia l i  
S? “  S*' . i  -a 0  > . a to o
•3 a s
■ I I  
2 . 2
■ I a
. E. O.• ^  B
1 q
D
ataByteC
ount = 
Index2; 
* 
Textual version 
of m
essages 
returned 
by 
R
C
200T
estPortO
Counter! 
= 
0; 
*/
D
estination 
= 
0x00b0d065eba6; 
static 
TCHAR 
*R
C
200T
estPortM
sg[5] =
Type 
= 
0x0800; 
{
XT U
c  §  a" Q
3 2  K  Si°a «?u, I n ^  j] S rt4- b K>
§ c ~
3 - 8
A
-,P- ^
I I 
* ?  
s §
12 -tf
8*HL
& V - 3  
3  3 -  3 -  
_  5 ,  H ,
g
II S'
OQO SO O o o
at at
'g  'g
O ?0
>  n
o>?oz
TJ -d T) -O 13 -8
3 3 3 3 3 3■o -o 13 tJ t3 -B *G
a p> G G G O G
® ,B  E E  P  B E
3 2 3 2 2 3TJ M3 *d *d *o *dk k k k k ks-o™ S  Q (» J  BS gc 2 § T a 8 .
o = 
0 ^ 3
2 §  s  Er  S  b  "
t i t =
lg I
a. a.
S 3m ?5
19CR ° : cr
1 1
a
3
" ° E
Q
•8
3,
g
^
1 1  I
fp I  -5a o  u-j
§ • 3■Ki " a
I " B*0 2 CT 
£ 0  "A g g 
D O E  
S a w
P !OM|1 ££ 
3. p?
t  a
• f  a 3- 8  m
E 3  w
s P  
.< *!
o 3
■ Q io s’
R 5?8Ae
eSS  E  to o -
^ ft £  - s a a.
% * £ ■  o “ £
1 ^ 1
a .* * * * * * m jt * *
<? Rp T
■a aB R& S
» « -S.S IP 1 -5-i2 3- JZ3
tq O
3 3
'—5 § "
%  §• 
o  j? 
fc1 2
B5 S' £■w ©
S 8> g
— Pn
Iff w 'Bo i
| -  5- 1 =*,
B  B § r
n r o j i
<§
£? s!
© o(O
S : Q5  o
» o
43 oHOQ O  
: o ' ^  o  
s §
£5
©o
-* » §
© 2" 
8 I3 £•
3 ® ft o
a r.
Bl 
■et - 0  g a.
? IHr© H 
«. I
a  II 
B  JO 
P  O
V-t N>
S 9
S© Q 2O I-V
I S
tK
cP
• H: -!S' A °A o  
8- Q
+ CO
+ p
£> 3 &
p  “ M
|  " R-
O m  "§ §g5 O -©
II
a1 £©
JC  00*o ^
+ crs: o° G.
i
g*
0  s 
g -7
B JO
QO^  o  
= o
hS 5" 2. 
O  £*
K> P  O «=t 
£  © a oo 
a! S
p. o*
2s*
O C/Qo  2
 ^1» Ur s f  •§
2  s  5*
N-
" 6 5
B: &
po ^
G QO ^  y* 
0  5 TJ
rtf 1S  5*f t
3 5.
R*
i xJ T3
i 3  3  *a tj
a  2
> Q
S’ §
05?o
p p  
3 3  
•a  -a
S: S: 
8 -• I «=B* ft
§  l a  - 18* q I a* s s.g.a-d |
S ^ 3 .  ~T 0 g “55s-&3i?S£i £L 3 61 
o' j i  S ’. St §• 
A  3  §■• B » '  
Bl<za J  2  8sli-il^ ^ 5  a ■§ ft 
m ;?■>■-: a* v'-
£ R: 3 ”i a j^ s SO © Cu ►1 ^
g £
B  g-
3 b *
3  I  & §
I  i
S' I
B
6
B
G
l _
*’* si si ca
"CC &
g §  1
= 3-
a c° ft ° E ft 
3  a
3 S IIft
m3 r  o  H
i * 3
r  n ^g  g a
ef h  ac  <5 M3R3 a 2O ?  3
3 II
T> T3 
II II X
3»l§c  c -tn tn
§*-t3O
e .
3 ^  J  $§
2 2 2 S Ef a o> n  a  „
i  $  S ' 3= I  s K « ? 5 
5' g =• " S’ S 5 § § 3 p  2  2  3  2  a j  s 2 a  a s  o  a  „
i ' l l !
i » §  §
3  »  B s  — o S 5 S'? 
3  B S* S 'i'5&1 . a |
|  03 °  S3,
t  8 X I»2 00 5 
B  G 5 ,(»__ a o cu
= * , 4 , 0  w
«  a  q  B
o  «  o  v 3■d O ft ^
a .
H  O  CDE* © °n> c po 3 t  M n p . o*
§ I s- 
S - S  *
u 3  S<=> p  8  ° o “ & o  ft*
I  §3
S. 2?
o  Q
z  13 r  *> <ii § 
* I’
* C/D
MS
S .  S '  
3* e :
S-I-.
j t  Rp
3 ■=
g - l
< »  _^p a
•3
S’
3,
r
-n
5
S

C
loseH
andle 
(m
y_tliread);
S5Q
3§1
P S - 5  
£  B
§ s
o <. 
g* g 
^ g >
=* S* 5* o
■rsVrJ
" •< !»
I  S'
I J^l O. 
s S Z R  2051“" ~■ « ) g
^  o
5 £
2B
o s
O  p.
■ tfl !? 3 ■a 2 M »rf —» O- Q-a g- B-°B 
I '5-a--?-> & t/3
^ B ‘ 8-
§ ■ > [ [ *  
R p  £• 5?
" S W & 
- JL *>,§
f i j o  fc* < (O .£• > *;
3  O
2? n
1 -9 y* po 5=1 sT w  s,05 ~3 &
** (S 22 O —
■8 IO E
