Abstract. The MARS system has been developed at the Technical University of Vienna during the last decade with the goal of managing highly critical distributed control applications. It relies on a periodic, time triggered structure and the fail-silent behaviour of processing nodes. To validate the MARS concept we have developed hardware components and an operating system, both tailored to the specific needs of the system. This paper presents our approach towards achieving deterministic timing behaviour and the fail-silence propetty of processing nodes. The integral development of hardware and its associated operating system offered us the opportunity to implement specific features either in hardware or in software. This facilitated a very efficient design that has proved useful in practice for more than a year.
Introduction
Distributed computing systems are widely used for control applications. These are often characterized by hard deadlines within which the system has to react to external stimuli. A system that has to meet such timing requirements is called a real-time system: An important group of control applications can further be called critical, because a malfunction of the system can have catastrophic consequences. Since a completely fault-free operation can never be assumed, a computing system for critical applications must be able to tolerate faults, i.e., it must continue to provide its service even in the presence of a fault.
The demand for real-time behaviour on the one hand and fault tolerance on the other is a challenge for the design of both hardware and software. This challenge is often underestimated off-the-shelf-components that are not basically suited to each other and to the system demands are combined. Expensive, sophisticated software is executed on standard hardware, thus compromising the error detection coverage by a lack of suitable hardware devices. Finally, the timing behaviour of software is determined without detailed knowledge of the timing behaviour of the underlying hardware.
From our point of view a very straightforward and promising design approach is a top down design for the time network and communicate by the exchange of messages. The MARS system is the realization of such an approach. It is the target system for the design discussed in this paper. In MARS all active and passive components are replicated in order to prevent a single failure of such a component from causing a system failure?. Up to three processing nodes execute identical software and form a Fault-Tolerant Unit (FTU). The FTUs communicate via the real-time network that is implemented as a replicated broadcast channel (see figure 1).
Fault tolerance
MARS uses a two-layered mechanism to achieve faulttolerance: the bottom layer (node layer) is responsible for error detection and error confinement (i.e., node shutdown on error). These functions are carried out by the processing nodes, which are therefore called 'failhilent'. The task of fail-silent nodes is to detect aU internal errors and to prevent their propagation. Therefore, the top layer (system layer) need not care ahout erroneous da@ it only has to provide enough redundancy to tolerate (silent) failures of parts of the system. The major functions of the top layer are handling of redundant data and reconfiguration of the system in case of a node failure .
To achieve a deterministic timing behaviour even in the presence of faults, the MARS system uses active redundancy for all processing and communication activities: each process is executed simultaneously at all nodes of an FTU and each message is msmitted quasisimultaneously on each of the broadcast channels. Due to the fail-silence property the results of all three nodes of an FTU are assumed to be correct and may be used interchangeably. Since we need only two nodes to tolerate a single failure of a fail-silent node (i.e., the loss of a message), the optional third node, the shadow node. does not transmit any message on the real-time network as long as both active nodes are operational. Only if an active node fails, the shadow node immediately starts to transmit its results, thus restoring the initial degree of redundancy.
Timeliness
M A R S is strictly time triggered, in other words. each relevant action of the system is scheduled before operation.
This property allows us to guarantee the proper timing behaviour of the system already during the design phase of an application. The actions to be scheduled include: 0 The points in time when a node is allowed to send a message (following the time-triggered paradigm, MARS uses a TDMAt protocol for communication) and the types of messages which may be sent at specific points in time.
The start times and deadlines of all processes.
The points in time when sensor values are read and actuator values are written. The processing steps for recovery and reintegration of failed nodes. Although these actions are rarely performed, they have to be incorporated into the schedule to prevent a properly detected fault from affecting the correct timing behaviour of the system. It is not necessary to consider error handling at single-node level in the timing analysis because the detection of an error causes a reset of the processing node. No dedicated recovery actions are performed within a node (see section
5.2).
The existence of a precise global time base is a basic requirement for a time triggered system, because the actions within different processing nodes of the system must be synchronized. Global time is maintained by a distributed, fault tolerant clock synchronization algorithm (see section 4.1). The static structure of MARS forces the system designer to investigate the timing behaviour of aU parts of the system carefully: we have to know time bounds for all processing and communication activities in order to assign time slots to them. In particular, the following timing parameters must be known:
The maximum execution time of each process, considering the architecture and speed of the processor. pipelining, caching, and memory wait states (F' uschner and Koza 1989) . The operating system overhead. 0 The overhead of hardware activities influencing the timing behaviour of the node.
The processing node

Hardware
The rea-time and fault tolerance requirements have already been taken into account during the design phase of the hardware (Steininger and Reisinger 1991), which distinguishes our board from a 'general purpose' board. The processor hoard consists of two complete, independent processing units, the 'application unit' and the 'communication unit'. The application unit is based on a 68070 CPU clocked with 15 MHz. The core of the 68070 is very similar to the 68000, but the chip also includes a memory management unit, a two-channel DMA controller, t The terms fault, error, and fdilure are used in accordance with the definitions given in Laprie (1992).
t Time Division Multiple Access. The communication unit is also based on the 68070. This greatly simplifies development and maintenance of hardware and software. The communication unit comprises an EPROM, an SRAM, a parallel I/O-port and two Ethernet controllers (LANCE) which form the important links to the other nodes in the system. To maintain a global time base, each Ethernet controller is coupled with a Clock Synchronization Unit (CSU, see section 4.1). An FC-bus connects the processor, an 110-port, an SRAM, and an EEPROM for storing system parameters and setup information. The only interconnection points to the application unit are the intemal FIFO, the periodic clock interrupt line, the power supply, and the common reset line.
A detailed rationale for the choice of this architecture, based on the aspects of predictable timing behaviour and fault tolerance, will be given in sections 4 and 5.
Operating system
The MARS operating system (Reisinger 1992) has to deal with the fact, that there are two processors with different responsibilities at each processing node. A microkemelbased operating system (Mullender et al 1990 , Cheriton 1988 allows system designers as well as application programmers to adapt the operating system to their needs easily, since the kernel's dependence on the processor's peripherals is minimal. With our microkemel-based operating system for MARS we are able to execute identical copies of the kernel on both processors of the node. System processes, running on top of the kemel, adapt the operating system to the needs of the given environment.
The tasks of our microkemel are reduced to a minimum: loading of system and application programs, process management, and error handling. All other system functions, such as message passing, handling of peripheral 106 devices, clock synchronization, or redundancy management are performed by processes outside the kemel.
The microkemel communicates with user-level processes by means of a few system messages that are located at certain positions within the message base, a globally readable memory area at each processing unit. The message base is used for communication between processes of different system-or application programs, too. Processes of the message hnndler are allowed to write to the message base, all other processes are only allowed to read from the message base.
In a time triggered system we can predict all actions that a node will perform. On the other hand, we also know all operations that will never be executed by the node. Therefore, we do not have to load the code required for these operations, thus reducing the size of the system software. Especially in the field of embedded systems a small size of the system software is an important issue.
The modularity of a microkemel based operating system is of particular advantage during system design. It allows different groups of software developers to design, implement, and test system services independently, i.e.. without affecting the other groups. The flexibility of the system is increased because application programmers can implement services by themselves, if the provided services do not meet their requirements.
Supporting deterministic timing behaviour
A processing node of a distributed real-time system, especially of a time triggered system, must have some characteristics which are not required from conventional processing nodes: 0 Maintenance of a global time base. 0 Support of a bus access protocol which guarantees message transmission times not to exceed a specified maximum.
Protection of the application software from asynchronous events in the environment.
0 Predictability of the hardware timing behaviour. 0 Strictly bounded operating system overhead.
While a global time base is the basic requirement to coordinate the actions of processing nodes of a distributed time triggered system, the other features are needed to bound the duration of each action within a node, thus allowing prem-time scheduling of all processes.
Global system time
Clock synchronization is based on the fault-tolerant average (FTA) algorithm (Kopetz and Ochsenreiter 1987) which is able to synchronize the clocks of all correctly working processing nodes of a system even in the presence of (a specified number of) faults. The FTA needs to know the deviation of its own local clock from all others. The mean of this set of deviations, excluding the k highest and k lowest deviations (k is the maximum number of faulty clocks that can be handled by the protocol), is used to correct the local clock. Because the quality of the clock synchronization heavily depends on the accuracy of the measured clock deviations, the processing nodes provide a hardware mechanism to determine these deviations: both Ethernet controllers of each node (LANCES) are coupled to a Clock Synchronization Unit (CSU). (Kopetz and Ochsenreiter 1987) which maintains the global time base, generates the periodic clock interrupt. and provides outgoing and incoming messages with accurate time-stamps. When transmitting a messa&e, the LANCE reads the current time from an internal register of the CSU and appends it to the message. Immediately after receipt of the message, the LANCE of the receiver node sends a 'stamp'-signal to the CSU. This signal causes the CSU to latch the current time into a time-stamp register that can be read by the software later. Due to the hardware implementation of the timestamp mechanism all time intervals between reading the current time from the sending node's CSU and generating the 'stamp'-signal at the receiving node are well-known. Therefore. the deviation of each pair of clocks within the system can be determined with very high accuracy (about 2ps). While hardware support is needed to generate accurate time-stamps, the more complex synchronization algorithm is implemented in software. This tandem approach allows clock synchronization with a precision of about 10 ps. Tasks on the application processor and on the communication processor are synchronized to the global system time by an interrupt ('clock tick') provided by the CSU with an adjustable period (0.5 to 8 ms).
Message passing
The protocol for exchanging messages between processing nodes must allow communication times to be bound even in the case of message losses. If the worst case timing behaviour is of concern (as in time triggered hard real-time systems), the cheapest way to achieve reliable message transmission is to send each message twice. Experiments have shown (Kopetz et ~l 1991) that the probability of losing both instances of a message is much lower than the probability of a node failure. Therefore, the overall system reliability will not benefit considerably from a more reliable communication. The processing nodes support this message passing mechanism in a number of ways:
The operating system allows the precise points in time to be specified when messages have to be sent. This avoids delays or message losses caused by contention or collision at the communication channel.
0 The Ethernet controller is configured to suppress delay or rebansmission of messages if the Communication channel is blocked. This helps to prevent a single bansmission failure from crashing the whole channel by causing excessive retransmissions.
0 The processing node supports access to two redundant broadcast channels. Therefore, the two redundant instances of a message may be sent on different buses.
Shielding application software from asynchronous events
Real-world events and the corresponding data streams are asynchronous by their nature, and each system that is embedded in a realistic environment has to cope with this fact. A periodic system structure is a very effective but not a complete solution to this problem, because we cannot shield the system completely from asynchronous events. The dual-processor architecture, however, provides additional protection against uncertainties of applica~on runtime by preventing extemal events from affecting the timing behaviour of processes that execute on the application unit. The communication processor services messages arriving at the Ethernet, extracts the relevant informaiion and puts it into the internal FIFO which is large enough to provide a reasonable flexibility for the transfer of messages to the application processor's memory. This mechanism allows application processes to access messages the same way they access main memory; the temporal uncertainties of message transmission (which are present despite the deterministic nature of message transmission) are hidden from the application processes.
This also applies for the transmission of results from the application processes to the real-time network.
Another interface is provided for the application processes to allow direct communication with the process environment (e.g., sensors and actuators). Again, its implementation as a FIFO was chosen to shield the application from temporal uncertainties of the environment.
The other communication channels do not influence the timing behaviour significantly: The IT-bus is under full control of the processor and therefore does not affect determinism. The RS232 interface is intended for auxiliary functions like testing only and not as a main communication path.
As a consequence of that careful architectural design, the application software is embedded in a fully deterministic environment. All main data streams are buffered by FFOs and under full control of the application software. Timing analysis for the application software can be performed in isolation, reducing efforts significantly. This is important because application software often changes, whereas the basic functions of the communication unit stay the same, allowing more elaborate determination and testing of its timing behaviour.
Predictability of hardware timing
By its nature the timing behaviour of the processor itself is predictable. However, predictability is limited by, asynchronous events like interrupts (especially if nesting occurs), bus arbitration between multiple bus masters and by refresh cycles for the DRAM.
To increase the node's predictability, our hardware only allows two essential interrupts: one is the ,'clock tick' which is required for synchronization and does-by its periodic naturenot result in timing uncertainties. The other interrupt signals that an error has been detected, in which case emergency handling (at the node level!) is activated and real-time constraints become meaningless anyhow. In all other cases buffer sizes and polling intervals are matched such that no interrupts are necessary and a questionable (Kopetz and Kim 1990 ) complex analysis of a nested interrupt structure can be avoided.
Although it ,would have been a more consistent approach, bus arbitration has not been completely avoided for several practical reasons.
The LANCE is only capable of managing its input and output buffers by DMA (Direct M e m o j Access). Because DMA transfers occur asynchronously to the execution of processor statements, they cause unpredictable (but bounded) delays during the execution of processes or operating system routines. The resulting non-deterministic timing behaviour is hidden from the application by the communication processor and therefore does not influence application timing at all. However, the (bounded) timing uncertainty has to be accounted for by the communication software.
The data transfer between FIFO and memory is performed by DMA. This speeds up the transfer by a factor of approximately 2 since single addressing can be used. Duration and data size of each transfer are known in advance and the transfer is fully.controlled by software. So. the arbitration overhead can be estimated tightly and will be outweighed by the reduction of transfer time.
Refresh cycles of the DRAM controller are nondeterministic from a microscopic point of view, e.g., in comparison with an instruction cycle or a memory access.
However, from a more global point of view the refresh occurs periodically every 16 microseconds and lasts for 4 clock cycles (267 nanoseconds). In the worst case each refresh will result in a delay in memory access causing an overhead of 1.68%. The uncertainty induced by DRAM refresh is thus tightly limited and does not impose any problem.
Boundable operating system overhead
The microkemel approach allows the operating system to be partitioned into a set of small modules. The methods for determining the worst case execution times of these modules are similar to the methods used for application processes (see section 2.3). Since each module performs only one well-defined function and there are few alternatives and loops with variable iteration counts, the worst case execution time is close to the actual one. The system modules are treated as ordinary application processes by the pre-run-time scheduler. Therefore, no special methods are needed to bound the overhead caused by these modules.
Apart from the overhead caused by system modules, we must consider the overhead of the microkemel itself. Since there are no interrupts in the system except the periodic clock interrupt, we only have to take into account the execution time of the periodic clock interrupt handling routine. This routine maintains the local time^ and dispatches processes. Both tasks are efficient; handling of local time is supported by the CSU and dispatching is easy due to the pre-run-time scheduling of applications. Our experiences with the implementation have shown that the 108 execution time of the clock interrupt handler is nearly constant and reasonably low.
Achieving fail-silent behaviour
In order to be fail-silent, a node must provide comprehensive error detection and proper error handling. Some of the mechanisms used to detect and handle errors do not prevent the transmission of incorrect messages but rather provide the messages with enough redundancy to make the errors detectable at the receiver's side. So we extend the definition of proper behaviour of our node from 'fail-silent' in a narrow sense to the following: the node either sends correct messages that can be verified as being correct by all non-faulty receivers, or sends detectably corrupt messuges (Shrivastava et a1 1992) that are discarded by each non-faulty receiver, or sends no messages at all. We now present the mechanisms that are implemented in the processing nodes of the MARS system to achieve these goals.
Error detection
To achieve a high degree of self-checking coverage, many error detection mechanisms have been provided and implemented in both hardware and software. while the focus of the hardware mechanisms lies on the detection of permanent errors and errors preventing further software execution, software error detection by its nature concentrates on subtle faults like transients that cause a slight deviation of the program flow or data errors. In particular, the mechanisms that have been implemented are presented in the following. (Karlsson et a1 1989b , Schuette et (11 1986 have shown that a majority of faults affecting a system results in deviations from the correct control flow or in unintended changes of data. Although such errors have arather high probability of being detected on the behavioural level, the implementation of several simple error detection mechanisms on a lower level improves error detection coverage and latency without compromising availability. The following mechanisms have been chosen: a Main memory and FIFOs are parity protected (one bit per byte). Additionally, pull-up resistors ensure that a high-impedance state on the bus is read as parity error (e.g., in case the processor reads from an unused address range or an empty FIFO).
Data and control flow errors. Experiments
a The standard low level protocol of the Ethernet controlkr detects transmission errors by checking, for example, CRC and framing.
a The standard error detection mechanisms of the MC68000-family (e.g., illegal op-code, zero divide, bus error,. . .) have proved to be very efficient (Damm 1986 , Schuette et a1 1986). The design of a fail-silent processing node for the predictable hard real-time system MARS bus access behaviour, and the execution interval of a certain task can provide a very effective means of error detection (Schmid et a1 1982) . The strictly deterministic, periodic structure of our sy-stem allows a convenient implementation of powerful high level error detection mechanisms:
0 The 68070 provides a memory management unit which allows efficient error detection on the basis of memory access behaviour. Illegal address references are also detected by a bus time-out logic.
0 A watchdog-timer monitors the 'aliveness' of the communication processor. It performs a node reset unless toggled periodically.
0 The application and communication units are autonomous and smctly separated. Therefore the probability of a simultaneous error in both units is kept at a minimum. Consequently, mutual checks can properly be used for error detection. While the dual-processor architecture is a feature provided by hardware, the particular way of implementation, however, is left to the system software.
0 A Time Slice Controller (TSC) monitors the access behaviour of the node to the broadcast channels. During a learning-phase the TSC adapts its intemal model of the TDMA protocol by determining time, duration and period of the write accesses. After completion of the learning phase the TSC activates an error signal, if the node attempts to transmit outside its legal time slot. From the hardware point of view the TSC is an autonomous singlechip microcontroller. In order to avoid correlated errors, it is only loosely coupled to the communication unit by the I2C-bus.
0 The high degree of determinism in the MARS system allows other checks for illegal system states. For example, an overflow of the Ethernet receive buffer or a FIFO can be treated as an error, because it must not occur during nomal operation.
5.1.3.
Faults caused by the environment. For proper operation the node needs a'proper environment. That is why monitoring of environmental conditions appears to be a basic requirement. On the other hand, a rigorous mapping of adverse operating conditions (like over-temperature, radiation. etc) into system failures leads to an oversensitive system (Steininger and Schweinzer 1991) with highly correlated node shut-downs. For this reason only the most essential mechanisms have been chosen for error detection on this level: 0 A power supply monitor issues a warning in case of low supply voltage. Additionally, the 'power good' signal provided by many power supply units can be monitored, which usually allows a detection of a power failure early enough to take emergency measures.
Moreover, each node is provided with an individual backup battery to reduce the probability of correlated power supply failures.
0 For exchange of error information with peripheral devices there is an error input line and an error output line.
This allows the inclusion of optional peripherals into the error handling concept.
5.1.4.
Errors in processing and transmitting application data. The previously described error detection mechanisms focus on the various types of faults that are usually anticipated within a node and try to provide a proper mechanism for each class of such faults. Since it is almost impossible to provide a detection mechanism for each kind of fault, we use a second, completely different strategy to achieve fail-silent behaviour. The corresponding mechanisms are entirely implemented in software. They consider only the flow of application data through the system and try to protect both the transmission and the processing of this data by using redundancy. These mechanisms rely on the existence of the above-mentioned mechanisms because there are errors that cannot be detected by software mechanisms at all (e.g.. behavioural errors) or at least not with a sufficient probability (e.g., permanent errors that affect both instances of redundant data or both redundant computations in the same way).
Basically, two mechanisms are necessary to detect errors during processing and transmission of application data (see figure 3 ):
0 Each application message is provided with an endto-end checksum (CRC) to protect it against any accidental change while being transmitted or stored in memory.
0 Messages are processed twice in time redundancy by application processes of the same node. Each of the redundant processes verifies the checksums of all its input messages and finally creates output messages with valid checksums in the absence of failures. These checksums of the output messages are compared afterwards by a comparator process and any difference leads to a reset of the node. A valid checksum guarantees for the reading process that the message was valid for some process at some time since the startup of the system, but does not guarantee that it is the message which the process expects to read. To avoid an accidental confusion of (valid) messages, each message is provided with a unique key. The key itself is not part of the message, but the sender and the receiver of the message know the key and incorporate it into the calculation of the message's checksum. Because we can' assign a key only to each type but not to each instance of a message (we cannot limit the number of instances!), we have introduced an additional mechanism to e ' \ Figure 3 . Protecting the path of application data through the system. be able to distinguish different instances of a message: the creation time of the message is incorporated into the checksum calculation'in addition to the key. The scheduled invocation time of the process that creates the message is used as the creation time, since the operating system guarantees that this time is equal for both redundant instances of a process. The time triggered nature of MARS allows the difference of the invocation times of sender and receiver to be calculated. Therefore, the receiver can determine the creation time of the message from its own invocation time. This value is then used to calculate the reference checksum.
Error handling
The layered structure of system fault-tolerance and the failsilent assumption on the node level allow straightforward error handling. First of all, the error must be prevented from propagating to the environment (error confinement). Then the acquisition of data about the cause of the error must be possible to allow detection of permanent failures and collection of statistical data about the different error sources fault diagnosis).
Error confinement.
Upon detection of an error, the erroneous node must be shut off from the rest of the system instantaneously. Since proper execution of software cannot be ensured in the event of an error, a hardware solution appears to be more reliable. Therefore, error confinement has been based mainly on hardware and the options of software for handling errors have been strongly restricted. The following mechanisms have been implemented to provide comprehensive error confinement:
0 In order to minimize reaction delay, hardware ensures that each activation of an error detection mechanism (enabled by DIP-switches) immediately generates an intermpt Software configuration capabilities are not provided to avoid the risk of accidental maisking.
0 The Ethemet controller forms the only connection to the real-time network. It is locked by hardware as soon as any error is detected (Error signal in figure 4) . Transmission is enabled only after the node is reset.
The Time Slice Controller (TSC-OK signal in figure  4 ) blocks all transmissions to the real-time network unless they are made in the legal time slot. This ensures that undefined behaviour of the node resulting from undetected errors cannot upset the TDMA bus protocol.
a Behavioural errors of the application unit hardly affect the system because the communication unit represents an intelligent interface to the real-time network and detects them. The strict hardware separation of the application and communication units maximizes the Figure 4 . Bus access control, 110 probability that the communication unit will work properly and protect the system in this case.
5.2.2.
Fault diagnosis. Two levels of fault diagnosis can be distinguished system level diagnosis (which node failed?) and node level diagnosis (why did the node fail?). From the system's point of view, node level diagnosis is important to get information about permanent node failures or frequent occurrences of some type 'of error at a specific node. For this reason the processor can read the state of the error detection mechanisms to decide which of them has been triggered. In order to provide this function even in case of a processor hang-up, the communication processor has also access to the error status of the application unit and vice versa. The, error signal is latched until the whole node is reset regardless of the actual duration of the fault to allow convenient reading. This information together with the current system time can be stored in non-volatile memory (the EEPROM) and form a base for comparison with succeeding enmes. By this means, the frequency of error occurrence can be determined and permanent node failures can be detected.
Another method that was implemented to detect permanent node failures is the execution of extensive self test routines both at startup time and in the background during run-time of the nodes.
Conclusion
During the design and implementation of the MARS processing nodes it became eviderrt that an integral development of hardware and operating system is most advantageous in areas where a close cooperation of hardware and software mechanisms is needed to achieve a specific goal. Examples are time-stamping of messages, the time slice controller, the deterministic timing behaviour resulting from the joint timing analysis of hardware and softwm, and the supplementay mechanisms for error detection
The processing nodes have been operating for more than a year at the Department of Real-Time Systems in Vienna. A number of students developed real-time applications, which helped us to verify our concepts, to find bugs and weak points in hardware and software, and to test our system in practice. It became apparent that a time-triggered system requires the support of sophisticated development tools, because a lot of work (e.g., scheduling) must be done before run-time. Our design tools (e.g., the scheduler, the maximum execution time analysis, the compiler) and the run-time environment are currently being integrated. This will considerably reduce the amount of work needed to design applications. Fault-injection experiments are being planned to evaluate the fail-silence property of the nodes. They shall be conducted in cooperation with the University of Chalmers in Gotbenburg, Sweden and with LAAS in Toulouse, France. These experiments will use the following mechanisms: 0 electromagnetic interference, 0 irradiating integrated circuits with heavy ions 0 pin level fault injection (Arlat er a1 1990). (Karlsson et al 1989a) , Based on the results of these experiments, we will improve the error-detection mechanisms of our nodes. Some preliminary experiments have already shown a weak point within our software error detection mechanisms. This led to the new checksum generation mechanism, which incorporates a unique key and the message creation time into the checksum calculation.
