Abstract Missions involving robotic space flight typically have a way to change the software that controls the flight system, or some part of it, such as an instrument, after launch. Usually this is accomplished by uplinking small sets of binary machine instructions and writing them to known locations in memory. We present an approach, used on the Aquarius mission, that involves replacing running components of, or adding components to, the running software at a higher logical level, specifically at the software architecture level, and on the C++ rather than machine-language level. This approach provides significant advantages in flexibility, robustness, reliability, and testability. We present the component-based flight software (FSW) design features that enable these capabilities. We then discuss the approach used to verify the robustness and reliability of these techniques, and finally describe usages to date.
INTRODUCTION
Robotic space missions need to modify FSW after launch for various reasons: to fix a software bug that is discovered in operations, to accommodate hardware changes or failures that necessitate different FSW behavior, to take advantage of unanticipated science opportunities, and frequently because the FSW was simply not complete at launch.
Modifications can be made to the running software, or to a copy in spacecraft persistent storage, in which case they take effect only when the modified version is loaded during 1 1-4244-1488-1/08/$25.00 C 2008 IEEE. 2 IEEEAC paper # 1450, Version 13, Updated November 27, 2007 a reset or reboot. In this paper, we focus on modifications to the running software.
Modifications to the running software can be made in most spacecraft flight systems and also in the FSW of instruments that have CPUs. These modifications are usually accomplished by uplinking files containing binary machine instructions, typically a modified version of a routine that is already present in the running FSW, and then overwriting the original version of the function.
Such binary code patches are non-persistent; if the spacecraft or instrument reboots, the original version of the FSW is reloaded into RAM from some form of persistent memory, and the modifications are lost. This is also true of the approach on Aquarius which we present here.
Our FSW architecture is an object-oriented, componentbased architecture, implemented in C++. Each component in the design plays a specific role, such as processing all commands and telemetry to/from a particular subsystem, or providing any needed interfaces to a particular piece of hardware. Some components play an infrastructure type of role, such as dispatching commands to other components, or collecting telemetry from other components and formatting it for transmission.
A run-time modification in our approach means either replacing one of the running components with a new version, or adding an entirely new component, one that doesn't fulfill any of the existing roles of the architecture. A replacement component may be only slightly different from the old version, or it can be entirely different in terms of the behavior it exhibits, possibly containing many new classes and functions, and handling new commands, producing new telemetry, or spawning new threads. This flexibility makes FSW modification a tremendously powerful technique for handling unexpected situations.
Our patching approach does not require the ops. team to have knowledge of the layout of the code in the flight computer memory, because we use the operating system's loader to place the new machine code in memory and to resolve references that the new code makes to functions in the existing on-board code base. So the patching process, and the new component itself, can be tested on target machines with different memory layouts or sizes, or even on Unix workstations. This makes the patching process much more robust and reliable. At least one mission has suffered serious consequences because of poking an incorrect address, and this would be virtually impossible with our approach.
Aquarius is a compound instrument consisting of a radiometer, scatterometer, and command and data handling subsystem, the heart of which is a PowerPC processor that runs the FSW discussed in this paper. The requirements on the software include communicating with the spacecraft through 1553 and high-speed serial buses, commanding, monitoring, and receiving data from the two subinstruments, radar data reduction, and temperature monitor and control. We present the FSW design in general terms, leaving out any detail that isn't directly related to the modification capability. A detailed discussion of the FSW architecture in general is given in [1] . See [2] for an overview and mission-level details of Aquarius.
The next two sections are aimed at FSW engineers who are interested in understanding the details of how our techniques work. Those more interested in the operational and system aspects of this patching approach may prefer to skip to the Verification or Applications section.
Unified Modeling Language (UML) diagrams are used throughout. The diagrams are fairly intuitive, but readers wanting to understand the fine points of this paper should be familiar with the UML. A concise and readable introduction to the language is given in [4] .
MODIFIABLE FSW ARCHITECTURE
Understanding our running FSW modification approach depends not on the details of the Aquarius FSW architecture, but on the concepts and features of the architecture that are driven by the modification requirements. We present a view of the architecture that concentrates on those aspects. We begin with a run-time view of the FSW in operation, shown in Figure 1 . In this diagram, we have an Input component that plays a role of routing incoming data (not shown) to a Command component, which then distributes commands to other components. Some of the components produce telemetry, which is sent to the Telemetry component, which then sends some form of the collected telemetry data to the Output component, from which it leaves the system (not shown). Now we move to a similar view of the running system, but introduce some type information, in Figure 2 . In this figure, the connections between the components are shown as typed interfaces. The Input component requires an implementation of the interface MessageSink, through which the data flows to the Command component, which provides that interface. Commands are distributed to the components that handle them through the Queue interface, suggesting that commands may be processed on separate threads, as they are in most cases. Figure 2 -In Operation Mode The disconnection and reconnection is accomplished using a special set of interfaces designed for that purpose. These are shown in Figure 3 , which shows a different run-time depiction of the same set of components. In this figure, the component with CmpNVersion] is shown providing the TaskOwner, CommandProcessor, TelemetrySource, and Linkable interfaces.
By providing these interfaces, the component itself provides the logic necessary to connect itself to the rest of the architecture, or to disconnect itself from the architecture. In order to provide these interfaces, the component requires Figure 5 shows the interfaces and classes involved in making and breaking scheduling connections among components. Our architecture features three kinds of thread run logic interfaces: Periodic, QueueReader, and Sporadic (readers familiar with [3] will recognize some of these interface names. Our interfaces are loosely similar to the corresponding interfaces in the RTSJ.)
Scheduling Interfaces
As with the command processing interfaces, the Component base class provides a no-op implementation of the TaskOwner interface, and only components that need to schedule threads need override that interface. At disconnect time, the descheduleTasks operation is called on the component, which in turn calls removeFromSchedule for each task that it scheduled previously.
Only Periodic tasks actually receive CPU time from the Scheduler implementation in operational mode. However, the Scheduler implementation in our design provides the services of creation, initialization, and finalization of the thread objects that actually run the given interface implementations. The addToSchedule operations create and initialize threads for the given run logic object, and the removeFromSchedule operations finalize and destroy that thread object. These interactions are not shown in our diagrams. As with any of these architectural interfaces, the modification process depends on the removeFromSchedule operation undoing any of the effects of the addToSchedule operation.
The operation addToSchedule for Periodic tasks allows the component to specify the period, in clock ticks, and the offset from the first tick, at which the Periodic's doCycle operation is to be invoked. The implementation of Periodic need only override that one operation.
In our design, the Scheduler interface implementation is provided by the System component.
Telemetry Processing Interfaces
Our architecture supports two main styles of telemetry production: active and passive. As shown in Figure 6 , a component that wants to produce telemetry must override the Component base class' trivial implementation of the TelemetrySource interface. In another example of interface implementation hand-off, the parameter list of each of the operations of the TelemetrySource interface consist of an implementation of the TelemetryManager interface.
Components that are active producers of telemetry must implement the registerlderegisterActiveProducers methods. Again our design choice to use a Queue for passing telemetry from active producers to the Telemetry component shows in the interface: the QueueAttributes parameter to the registerActiveProducer method is used to create a Queue implementation, returned by the call, to which the active producer sends its data, assumed by the TelemetryManager to be in the form of objects of a class that is an implementation of the TelemetryItem interface. Our TelemetryManager implementation supports two styles of queue draining on behalf of active producers: a "keep all" style, and a "keep only the latest" style. In the case of an active producer of error notifications, the "keep all" style is used, while for more typical producers of subsystem engineering telemetry, the latter style is appropriate.
Passive producers provide the TelemetryManager with an implementation of the TelemetryProducer interface. During operations, it is the TelemetryManager that decides when to query the producer object.
At connect time, the two register operations on the component are invoked. The component in turn calls the appropriate register operation on the TelemetryManager interface. At disconnect time, the two unregister operations are invoked on the component, resulting in calls to the unregister methods on the TelemetryManager implementation by the component.
The Linkable Interface
The architecture framework allows the definitions of other interfaces among components in addition to the command, scheduling, and telemetry interfaces provided by the architecture framework. The framework provides support for other interfaces using the Linkable interface. As we see in Figure 7 , the Linkable interface gives the component that implements it the opportunity to link to, or unlink from, the component given as the argument to the call.
As with the other architectural interfaces, the Component base class provides a trivial, no-op implementation of the Linkable interface.
Components that need special connections to some other component must override that implementation. Implementations of the methods of Linkable typically use the RoleFiller interface to decide whether the component provided as the argument can provide an interface that this component needs. In the Aquarius design, there are several cases of the Linkable interface being used to provide special connections between components. For example, we have a component that provides an interface to the 1553 bus. The Telemetry component needs that interface, so it implements the linkToComponent operation with logic to see if the role of the given component is the 1553 interfacing component, and if so to ask it for the interface implementation. In another example, a science data formatting and storage component needs to connect to a component that is responsible for communications with the radiometer in order to get an implementation of an interface that provides that instrument's data.
As with the other architectural interfaces, the unlinkFromComponent operation must be the complete inverse of the linkToComponent operation.
The Component Class Hierarchy Having gained some insight into the architectural interfaces, it's helpful to see an overview of the component class hierarchy, shown in Figure 8 . Figure 11 . In addition to the new classes needed to modify the component's behavior, the patch contains a class that is derived from the class ComponentFactory. In the diagram, the example given is class Cmpl Version2Factory. The patch file will also contain an object of type Cmpl Version2Factory, and an integer variable sideEffect, described below.
The patch is compiled and linked. The patch file itself, in Executable and Linking Format (ELF), contains the machine code and data segments for the new code only. It can refer to objects and functions that are already part of the code repository in flight, but it must not redefine those objects or functions (that would result in multiply-defined symbol errors at load time).
Also, when compiling and linking the patch file we explicitly prevent the instantiation of C++ templates, except for those that are defined only in the modified component The FSW assembles the pieces of the module in a RAMhosted file system as a file. The command script ends with a special command that tells the FSW that all pieces of the module have been received and provides the checksum of the module. Then, in the ValidateModule step in Figure 10 , the FSW computes a checksum over the entire module, compares it to the ground-provided value, and rejects and discards the patch if the checksums are not equal.
If the module is valid, and when the ground sends an installPatch command, the FSW performs the LoadAndCreate step shown in the diagram. If that succeeds, the FSW has created (in the sense of C++ -a dynamically-allocated object) an instance of the new component class (class Cmpl Version2 in the example shown in Figure 10) , and it can then proceed to install the new component instance into the architecture, shown as the InstallPatch step in Figure 10 .
Zooming in on the LoadAndCreate step, we see in Figure  12 that the System component calls the OS-provided load function, the argument being the name of the ELF patch Figure 12 -Loading and Instantiation module file. The OS loader then reads the file, resolving references made by the code in the file to functions that are already part of the code base, and also initializing static variables defined in the load module. In assigning a value to the variable sideEffect, the loader calls the static function registerFactory on class FactoryRegistrar, which has the effect of registering the module's component factory (of type Cmpl V2Factory), with the FactoryRegistrar. Thus the factory object becomes available to the FSW via the getFactory function.
Having obtained the factory instance, the FSW calls makeInstance on that object to create an instance of the new component. The makelnstance method of Cmpl V2Factory runs the constructor of the class Cmpl Version2. The constructor may encounter errors, which the FSW must check for by calling the getErrors method on the new component. Every step of preparing for and installing a patch is checked for errors, though we do not show all of that detail in our diagrams.
After these steps, the FSW is ready to install the patch, which means connecting the new component into the architecture and then letting it run. This process is shown in the sequence diagram in Figure 13 . Figure 14 shows the sequence of operations made by the connectComponent method. Here we see the architectural interface implementations of the new component being exercised, as well as the Linkable interface on all of the components of the architecture. When the connection sequence is finished, the endChange method is called, which announces to all the components that the system is transitioning from architecture change mode operational mode.
The final step is to invoke the resume method on the System component, which re-enables interrupts and starts the system running in operational mode. During the entire uplink, validation, load and install sequences, the FSW emits telemetry to keep the ops team informed of the progress and status of the modification operation.
Recoveryfrom a Failed Patch Process Any step of the patching upload and installation process can fail, and the FSW must be able to handle these errors. We have not shown error handling in the sequence diagrams in order to keep them simple. Following is a discussion of the errors that can happen and the autonomous recovery steps that the FSW takes in response to each.
Errors during uplink of the patch are detected by sequence numbers in each message and by a checksum over the entire patch. A corrupted patch will be discarded, and the patch operation aborted.
Errors in the module itself, such as undefined symbols, or a failure of the module to register a component factory at load time, are detected by the FSW, and it discards the module and aborts the patch operation.
If the constructor of the replacement or new component detects an error, the new instance is deleted, and the patch operation aborted.
The next step for a replacement is to remove the component to be replaced. This is done using the unlinking and uninstalling operations of the architectural interfaces. If any of these calls fail, the FSW attempts to re-connect the old component to the architecture, and aborts the patch operation.
If the old component is successfully removed, the next step is to link the new component into the architecture, using the linking and installing operations of the interfaces. If any of these operations fail, the FSW takes the disconnection steps with the new component, and then re-connects the original component (which is not deleted unless no errors occur) back into the architecture.
The FSW reports success or failure of each step to give the ground insight into the process. If any of the errors occur, the architecture could be in a partially-functional state: e.g., unable to handle certain commands, and the ground would need to carefully evaluate whether a reboot might be prudent at that point.
Whenever the FSW attempts backup steps to recover from an error, it finishes the patch operation by putting the architecture back into operational mode and re-enabling interrupts, regardless of whether or not the backup steps themselves encountered errors. This is safe in Aquarius' case because the FSW cannot damage the hardware, and so we can allow the FSW to stay running in a degraded state in the hope of getting information out to the ground about what went wrong.
In the next section, we describe how we test all of the failure scenarios we've just described.
VERIFICATION
The FSW modification process that we've described is highly automated, and so involves many steps and conditional branches in FSW logic. And since it involves disabling the interrupt that allows the FSW to communicate through the 1553 bus, it could leave the system in a nonresponsive state if something went wrong. Therefore it was essential that we verify the correctness and robustness of this process as exhaustively as possible.
We approached this task with several techniques, chief among them testing, but also with code checkers, code coverage analysis, and detailed code reviews. Our testing included component-and architecture-level white box testing, automatic generation of long sequences of modification operations with injected faults, and systemlevel tests that replace every patchable component and introduce several entirely new components. We insisted on code coverage of 100% for the parts of the System component and other classes directly involved in architecture modifications.
Component-Level White Box Testing
On the level of an individual component, we needed to verify that each component class implemented its architectural interfaces correctly, and also that the methods implementing the architectural interfaces responded appropriately to faults that could occur. For each component class, we wrote a stand-alone test program that exercised all of that component's architectural interfaces, and caused them to encounter every error that they could encounter.
To accomplish this, we developed special test component classes called fault injectors, depicted in Figure 15 . Consider for example the operation registerActiveProducerFaultIn on the fault injector class FaultInjectorTlmManager: the callCount parameter specifies the number of calls to the method registerActiveProducer that will occur before that operation will return a non-zero error count, at which time the component attempting to register the producer is forced to handle that error. The call count logic is necessary because most components make more than one call to a register or install operation, and we needed to make each one of those distinct calls produce an error.
A component-level white box test program creates instances of each of the fault injector classes, and then runs a series of 
Automatic Test Scenario Generation
We developed a patching test harness with the goal to determine the robustness of the patching system with respect to faults while validating the connection and disconnection logic for each component. Our approach gives the ability to specify combinations of patchable components and fault injections and perform software patching under these fault conditions.
The test harness enables the user to choose permutations of components picked from the set of patchable components, where repetitions are allowed. Patching the same set of components with different orderings puts in evidence any dependency relations between connection and disconnection operations. These dependencies can be in the form of memory allocation and de-allocation issues, or unsafe assumptions on the life of a shared object.
The fault combinations are generated as follows: For every method specified in the fault injection component (see Figure 16 ), we specify afault or nofault condition. The test harness generates a user-specified number of test cases, where each test case represents a random set of faults. The set of patchable components are then patched with those faults. Thus thousands of patch operations are made in a randomly-selected order.
We specialize the FaultInjector component (see Figure 16 ) with a patchable component type. A call to a FaultInjector component method results in a call to the specializing component's corresponding method. This method is called with or without the inclusion of a fault, depending on whether a fault was specified for the particular method a priori.
The inclusion of faults during patching tests the ability of the patching system to safely recover from faults associated to the initialization and finalization of flight software subsystems. It puts in evidence the dependencies between the methods of a particular component.
For example, if installCmdHandlers had a fault and it was called before linkToComponent, can linkToComponent execute correctly given a fault in installCmdHandlers? These dependencies are extracted more readily by executing different fault injection scenarios. As a result, the flight software system's resilience to unsafe patching operation is verified.
System-Level Testing
We have a suite of system-level patching tests in which we replace every patchable component and install several new components.
The suite includes successful patch operations, as well as all of the failures that can be generated when running the system as a black box, with only the flight interfaces (the 1553 bus and the high-speed serial bus). To date, we have run these tests in two different environments: one in which the only hardware we have is a breadboard PowerPC that has a serial port, and a separate board that provides Ethernet connectivity. In this configuration, software simulations for the other boards that make up the instrument's command and data subsystem are run as part of the FSW image (by constructing the architecture with simulation versions of the components that interface to these boards.) The 1553 and serial buses are simulated by Ethernet socket messages.
A variant of this stand-alone environment is running the FSW on a Sun workstation. All OS-dependent constructs are wrapped in C++ classes, and the differences among OS's are hidden in the implementation of these classes. This technique enables us to run a FSW image on the Sun that exercises at least 9000 of the code in the FSW. In particular, none of the patching logic is OS-dependent (except for the OS's loader, and some subtleties of thread behavior). Patch modules in the Sun configuration are shared libraries.
The second environment is in the integrated command and data subsystem. In that environment, a modified version of the same commanding tool we use for simulation of the 1553 bus in the stand-alone environment sends commands to a real 1553 bus. So we can use the same command scripts in either the stand-alone environment or the integrated command and data subsystem environment. We've run these tests on engineering models as well as the flight version of the subsystem.
As of this writing, the complete flight instrument has not yet been integrated, though it will be within a few months. When it is, we will run patching tests on that system.
APPLICATIONS
To date, the run-time modification capability has been extremely useful in testing situations. It has enabled tests on a system level that would normally be possible only in white-box context, allowed rapidly prototyping design modifications and problem fixes, and served other utility purposes. Following is a brief discussion of some of the applications of this capability to date.
(1) Writing 
CONCLUSIONS
It is natural to question the necessity of the kind of flexibility that our modification technique provides. After all, once launched, a system can hardly change so much that it would need major behavior modifications. With crossed fingers, we hope this will hold for Aquarius, and that we will never have to patch the FSW in flight. But should the need to change our FSW in flight arise, in however unpredictable a way, we are confident that our modification capability gives us ample ability to adapt. Moreover, there are a number of advantages to our patching method over the usual poking of machine instructions directly into memory:
(1) The flexibility of our patching process greatly facilitates test and development, as our list of applications in the previous section demonstrates. We have institutional requirements that FSW be readily testable, and that it allow the efficient diagnosis of unexpected conditions and faults. We think our patching approach has allowed us to meet these requirements in a more complete way than has been done before.
(2) The relative ease of patching our approach enables by allowing the development of patches at a higher logical level than patching one function, and the automation of the entire process, has allowed us to use patches more widely and frequently than we could easily do with traditional patching, and this has helped verify this capability much more extensively than we might have otherwise been able to do. 
