Future planetary exploration missions demand significant advances in on-board computing capabilities over current avionics architectures based on a single-core processing element. The state-of-the-art multi-core processor provides much promise in meeting such challenges while introducing new fault tolerance problems when applied to space missions. Software-based schemes are being presented in this paper that can achieve system-level fault mitigation beyond that provided by radiation-hard-by-design (RHBD). For mission and time critical applications such as the Terrain Relative Navigation (TRN) for planetary or small body navigation, and landing, a range of fault tolerance methods can be adapted by the application. The software methods being investigated include Error Correction Code (ECC) for data packet routing between cores, virtual network routing, Triple Modular Redundancy (TMR), and Algorithm-Based Fault Tolerance (ABFT). A robust fault tolerance framework that provides fail-operational behavior under hard real-time constraints and graceful degradation will be demonstrated using TRN executing on a commercial Tilera® processor with simulated fault injections.
I. Introduction
Planetary exploration missions conducted in the past four decades were implemented using single-core processing elements within the avionics architecture to meet the in-flight computational needs. The functionality implemented for such missions includes on-board vehicle guidance, navigation, control, uplink commanding, downlink telemetry processing, and many other infrastructure services for flight applications. This traditional approach, generally based on a multi-threaded software architecture on a single-core processor, is well proven by the many successful missions using this design. However, this design is rapidly approaching a key technical branch point with advances in several technologies that can offer significant benefits over the current approach. One such technology for flight applications is the multi-core processor such as the commercially available Tilera® (8x8) processor and the radiation-hardened version called the Maestro (7x7) processor. Other competing technologies consisting of in-flight reprogrammable FPGAs and ASICs integrated in a heterogeneous avionics architecture using standard compute elements are also strong contenders (though we will not address them in this paper). We will focus the discussions on the results of our research in applying multi-core processors for flight.
II. Multi-core Processor Architecture
The commercially available Tilera® processor is described in detail in the vendor published documentation [1] . It is a full-featured processor consisting of an array of 8x8 processing tiles which allow users to run existing C and C++ software on any individual tile using a variety of operating system such as Linux and VxWorks. Each tile in the array consists of a 32-bit Processing Element (PE) and a data router, or switch, engine with five channels for routing data packets between tiles and the I/O subsystems. A graphical representation of the Tile64 processor is provided in Fig. 1 . The five channels are labeled Static Tile Network (STN), User Dynamic Network (UDN), Memory Dynamic Network (MDN), I/O Dynamic Network (IDN), and a Tile Dynamic Network (TDN). The UDN is the only user accessible network for routing information between tiles. Each data packet contains a header consisting of the destination tile address, the packet size, and a tag word as in Fig. 2 . The tag word specifies routing of packets to different message queues at the receiver PE. 
III. Technical Challenges of Future Planetary Missions
The challenges of future planetary missions involving entry-descent-landing as well as small body (e.g., asteroids) proximity operations are multi-faceted. These challenges include the need for on-board image data processing for precision guidance and navigation, fail operational requirements during critical mission phases, fail operational with graceful degradation for the less critical functions, energy optimization, fault tolerance design to achieve overall robustness of the system, and enhanced autonomy with minimal ground operations support. Each of these attributes can be satisfied by the application of the multi-core processor as the primary compute element within the avionics architecture.
The multi-core processor theoretically offers as many times as the number of computing nodes of processing power. Thus the overall performance can be very much enhanced. An application that can be deployed and processed in a parallel fashion will be able to take advantage of such architecture for performance enhancement. The Terrain Relative Navigation (TRN) application (section IV) currently being developed at JPL for vehicle guidance and navigation during the planetary entry, descent, and landing (EDL) phase is designed with this feature in mind. An equally important requirement of the EDL function is fail operational. Due to the fast changing vehicle dynamics during descent, the EDL function must be fail operational to continuously track and control the vehicle states, implying that the software must be robust enough to operate without failing throughout this critical mission phase. Catastrophic failure can result if this criterion is not met. The multi-core processor provides redundancy by using several cores to execute identical EDL algorithms concurrently to achieve the Triple Modular Redundancy (TMR) design. The fail operational criterion can thus be satisfied without the mass and power penalty suffered with a single core processor design. The notion of fail operational with graceful degradation is also being supported by the parallel processing of the imaging data. An error encountered when processing a sub-frame of the imaging data by a single core can be discarded with the notion that the remaining cores will continue to provide valid and sufficient data for landmark identification of the current image frame. Thus, varying degrees of redundancy depending on the criticality of the software functions at different mission phases can be achieved through appropriate software design. Another benefit multi-core is the feasibility of shutting down unused cores and the corresponding algorithms for energy conservation during the less computational demanding phases of the mission. An energy management function can be envisioned to achieve this capability.
As a caveat, the above discussion in fault tolerance design is applicable only for errors occurring at the core level. Systematic errors occurring at the chip or board level will not be mitigated by the approaches as presented. For such faults, redundancy at the chip or board level will be required. This scenario will not be addressed in this paper.
IV. Terrain Relative Navigation Application Descriptions
Terrain Relative Navigation (TRN) is an application that estimates spacecraft position relative to a target that can be modeled in an a priori reference map, as shown below (Fig. 3 is extracted from Ref. 4) . Determining the spacecraft position relative to the landing site is an enabling function for planetary landing and autonomous primitive body exploration, since the information can be used to both support landing maneuvers and to avoid known hazards. TRN is implemented as a standalone sensor that integrates a wide field of view (FOV) camera, an Inertial Measurement Unit (IMU), a high performance multi-core processor and data processing algorithms. The TRN Sensor combines the gyro and accelerometer measurements from the IMU and the images from the camera (correlated to the a priori reference map) to determine relative to the reference map the six degree of freedom (6-DOF) position, and attitude as well as For the needed accuracy, the TRN image correlation function selects and matches numerous landmarks (40 to over 200) from the map and images, and as a result, drives the processor requirements. Fortunately, the image correlation function has a "natural" parallelization with different map regions and selected image features processed independently on different cores (our application requires roughly 40 Tilera® cores for image processing). The IMU data and the results from the image correlations are processed by a navigation filter to estimate the 6-DOF solution and map relative velocity, as well has propagate the 6-DOF solution using IMU only data between image updates or over very short image data outages. The navigation filter has computationally modest requirements, and on the multi-core processor requires only a single core. Since the navigation filter requires only a single core, but is critical for maintaining state information, it was a natural candidate for TMR, as illustrated in Fig 4. TRN is typically used in mission critical phases, such as landing on Mars, where processor induced resets or errors are fatal to the mission success, making fault tolerance a requirement. Fig. 4 shows how landmarks and image correlation are split among multiple cores and the navigation filter can be implemented in Triple Modular Redundancy. Each landmark-to-map correlation will result in a camera unit vector (w) and camera-to-map vector (V, in map coordinates). Each core will process several (2 to 10) landmarks per image. 
V. Software Based Fault Tolerance Methods
Prior study on a multi-core processor (Ref.
2) has identified several shortcomings that can impact overall system reliability when used for applications such as TRN. To achieve the robustness required for flight missions, our research focuses on defining software methods that can be applied to mitigate such deficiencies. A summary of the software based fault tolerance methods for the multi-core fault modes are described as follows.
A.
Network Routing Faults A fault occurring at the switching network and buffers can result in data packets being routed to the wrong destination or queue, as well as the data content being corrupted during transit. To mitigate this problem, a checksum is added to the first two words of each packet for routing protection. This is intended to protect the information regarding the destination of the packet. The receiving core will process the first two words for error detection and send appropriate messages to the sending core for retry, if necessary. An additional checksum word is also added to the last word of each data packet to protect the data content; the receiving core can request for resend if an error is detected.
This software based error checking scheme is applied only to the User Dynamic Network (UDN) for data routing. The MDN and the TDN which are controlled and managed entirely by the system hardware cannot be protected by this proposed software method.
B. Failed Network Switching Engine
When a switching engine fails permanently, the native data routing scheme between any source and destination core with the failed core on the routing path is not functional. A virtual network routing scheme is proposed that will bypass the failed core to achieve routing. In addition, a utility tool is proposed that can be executed in the background for detecting failed operations of the switching engines in order to invoke the virtual networking scheme.
C. Failed Processing Core
To detect a failed processing core, a utility tool executing a specific algorithm in the background at each core is proposed. The resulting output is compared against its immediate neighbors by a voting scheme for error detection. A consistently failed core will be recorded and taken out of operation. A redundant core can then be initialized with the data routing path re-configured for replacement.
The above are failure mechanisms, and solutions, introduced by multicore systems. We use similar methods to provide fault tolerance for applications running on appropriately protected multicore machines. The methods we will use are as follows.
D. Algorithm Based Fault Tolerance Methods
The image processing function of the TRN application that correlates the stored map against the imaging data is required to perform large matrix operations. An algorithm based fault tolerance method for error detection and correction of matrix operations is essential to enhance the robustness of such application. The algorithm reported in Ref. 3 is very fitting for such purpose and it is being implemented in this design.
E. Triple Modular Redundant Methods (TMR) for Fail Operational
The navigation filter of the TRN application is responsible for estimating vehicle states during the critical EDL phase of the mission. To meet the fail operational requirement, the TMR method is implemented by executing three identical filters concurrently to process the same input data on three separate cores. The three outputs are compared via a voting scheme for error detection. A mis-matched output from a failed core will be detected and be taken out of operation. Vehicle control can be maintained by the remaining two cores that produce matching output. A redundant core can then be initialized and the data path reconfigured to re-establish the TMR configuration.
F. Hierarchical Fault Management Architecture
This fault management architecture has the basic principle that processes at each layer are hierarchical in structure, each with a set of child processes at its disposal to enable or invoke a defined scope of fault detection and response methods. A failed child process that can no longer resolve a fault will trigger the next upper level process to take the appropriate action within its domain to invoke an alternate and available mitigation fault response. Conversely, requests entering an upper manager requesting increased reliability will inform its child processes underneath to implement an appropriate strategy. This hierarchical architecture provides a structured approach to manage fault detection, fault isolation and fault response.
VI. Detailed Design Descriptions of Fault Tolerance Methods

A.
Augmenting data packets to correct errors in data and network routing The Tile Processor™ from Tilera® connects an 8 x 8 grid of processor cores with five networks. The User Dynamic Network (UDN) is provided for applications to efficiently exchange data between cores. As mentioned in the previous section, an event could corrupt the data in a packet or the destination in the packet header, in which case the UDN would deliver the data to the wrong core. We implemented software-based error correction using Hamming codes for handling a single event affecting the data or destination.
UDN packet headers include x-y core destination coordinates, the length of the packet, and a tag. In order to protect this data, we wrap the C API for UDN messaging to add a word including a Hamming error correction code (ECC). Because the header is stripped away by the receiving core, we add duplicate destination and data length fields (along with the ECC) in our own header as part of the UDN packet payload data as shown in Fig. 5 . Because error correction depends on the length of the data, we must provide a separate ECC for the data length.
The first two words are the original UDN header (Fig. 2 ) that is stripped away upon delivery to the destination tile. The next three words are our header used for error correction. An index is used instead of coordinates for the destination to provide abstraction for the available subset of cores. With this format a packet will be delivered to its intended destination assuming only a single bit error in the packet. If the destination coordinates in the original header are corrupt, and the packet is delivered to a different core, a listener corrects the packet using the ECCs, sees that the destination index does not match the index of the core, and resends the packet to the intended core.
As discussed next, we use software to determine if a core is There are several methods available for detecting failed cores. On a tiled multicore, where each core C generally has two or more neighbors, those neighbors can ask C to compute a task and check the result. Voting the result should give a reliable indication of who has failed, assuming independent failures. Such E.
Hierarchical Fault Management Architecture
The above sections cover facets we have built and are working. The current section is our plan and the direction we are pursuing. The research work is still in progress.
Resilience is the ability to continue in spite of setbacks. To illustrate, a failure implies a fault somewhere in the system. When the attitude control system supervisor discovers a failure, it selects an alternative method of estimating attitude. This may involve changing sensors or combinations of sensors to avoid the problem. This is an example of finding an alternative way to carry out the original work. A supervisor may have many alternative methods and it is the supervisor's job to work through the alternatives to find one that brings success.
Recall that TMRing a thread function requires repeatability, and that "no shared memory" ensures that. This same property aids in fault protection as well, since it means that when an error occurs, and the supervisor responds, the error propagates no further. This is fault containment.
We adopt from Ref. 5 a hierarchy of threads, but apply it to implement policy-based commanding (Fig. 5 ). For example, power control and reliability control are elements of the system that supervisors are assigned to enforce. The top-level supervisor for the TRN application is in charge of two lower-level supervisors, which in turn are in charge of the vision threads and the filter threads. When a reliability constraint arrives at the top-level supervisor, that supervisor in turn asks the filter supervisors to, say, increase their reliability. There may be several means of doing so, but for our purposes here we suppose the supervisor's response is to change the filter from a single thread to a TMR of filter threads. That supervisor will create the new threads and hook them up (connect their inputs and outputs appropriately) and assign the new threads to processors. The supervisor's final step is to stop the single filter and install the TMR'd filter in its place. The replacement occurs in real-time. This is the scenario we are headed towards.
VII. Remaining and Future Work
The material reported in this paper represents interim results of the research which is still on-going at JPL. The methods described in section F will be fully designed and implemented for adaptation in the TRN application. A key objective of this research is to conduct a performance measurement and quantitative analysis of the design when implementation is completed. Various performance metrics will be measured and analyzed using the profiling tool provided by the Tilera® development environment and by code instrumented in the software. Such performance analysis will benefit future missions in system design when using the multi-core processor for this class of missions. Software simulated faults will be injected in the system during demonstration for triggering the intended response by the methods.
