Abstract: The paper relates an industrial experiment performed jointly by LAAS-CNRS and Electricit e de France (EdF in short) for assessing the application of a formal method to the reverse engineering of (a part of) a fault-tolerant monitoring system designed for the control room of French N4 nuclear power plants. More speci cally, the experiment is devoted to the formal speci cation and veri cation of the distributed scheduling algorithm managing the hot redundancy between the two computers composing the system, a single fault hypothesis being assumed for this function. The formal method used for the experiment is RT-LOTOS, a temporal extension of the LOTOS standard Formal Description Technique (FDT in short). The main motivation behind the experiment was to get a better understanding of the fault-tolerant features of the scheduling algorithm by means of both simulation and formal veri cation.
Introduction
The paper describes an experiment dealing with the formal speci cation and validation of a fault-tolerant monitoring system designed for the control room of French N4 nuclear power plants. The monitoring system has been designed to be transparent to a single failure, and it is composed of two computers in hot redundancy. Both machines, master and slave, process the same application inputs and monitor their internal errors. The master is in charge of application processes scheduling and emission of application messages. In case of a single error the faulty computer is isolated and the other one becomes master. A distributed scheduling algorithm has been devised for implementing this hot redundancy scheme.
Although non-critical for the safety of the nuclear power plant, the monitoring system has been recognized by EdF as representative enough for starting and supporting an experiment aiming at assessing the use of formal methods in the reverse engineering of a part of this system, namely the scheduling algorithm. Main expectations on the project achievements were twofold: (i) to assess the feasibility of the reverse engineering process CDH + 96] starting from an analysis of the monitoring system source code, written in Ada, which has been implemented by a third party following the (informal) requirements of EdF (ii) to better understand and assess the fault-tolerant capabilities of the scheduling algorithm under several faulty conditions. Three main requirements have been expressed by EdF for selecting a particular formal description technique for this study:
To have executable speci cations to facilitate the reverse engineering process.
To represent the physical distribution of the monitoring system components, these components running asynchronously the one with respect to the others. To specify a large and complex system made of several components; the method should therefore provide facilities for composing large speci cations from simpler and possibly reusable components.
The previous requirements led to the choice of a process algebra. Assuming also that explicit time constraints had to be expressed, the availability of a complete environment, the RTL software tool, for validating (simulating and verifying) formal speci cations was the main reason which led to the choice of RT-LOTOS (Real-Time LOTOS, a temporal extension of the LOTOS FDT).
The paper is organized as follows: Section 2 presents the monitoring system informal description. Section 3 presents how the reverse engineering process has been carried out. Section 4 describes the main capabilities of the RTL software tool. Section 5 presents some of the achieved validation results. Finally, some conclusions are drawn in a last section.
System informal speci cation
This section presents a brief and informal description of a part of the faulttolerant monitoring system designed for the control room of French N4 nuclear power plants. This informal description relies on di erent documents (informal text, and diagrams) provided by EdF for describing the system functionality. The textual documentation includes an overview of the monitoring system and an informal description of its functional modules. The diagrams illustrate the system functional decomposition, and include many state graphs, as well as Ada tasks ow charts.
System Overview
For plant availability reasons, the monitoring system has been designed to tolerate a single failure; it is made up of two computers, called CPC 1 1 and CPC 2 in hot redundancy. One of these computers is considered to be in the master mode, since it is in charge of scheduling the application processes running on both computers and of supervising the messages exchanged by the application processes with their environment; the other computer is in the slave mode. Single failure occurrence in any of the two computers leads to the isolation of the faulty computer (it enters the isolated mode), the other computer entering the single master mode.
Synchronization messages are exchanged between both computers through bidirectional High-Speed Data channels (HSD in short). These messages make it possible for a computer to notify its mode alteration to the other. Other messages are exchanged through a dedicated network (the N2 network, not explicited in Figure 1 ) between the computers and the environment. These messages are: (i) the stimuli received from the environment by both computers, and (ii) the event noti cations sent by the computers to the environment.
The monitoring system functional decomposition is presented in Figure  1 where ve main modules (Processes, Stimuli Manager, Scheduler, Service Manager and Watchdog) have been de ned.
The Environment
The environment sends stimuli to the stimuli managers in both computers, each stimulus conveying information to be used for scheduling the application processes. This information includes the stimulus priority, the stimulus creation date, the identi cation of the application process to which the stimulus relates and the stimulus type (Activation Stimulus, Resume Stimulus or Allocation Stimulus). The master computer sends consistently, through a HSD channel, the stimulus it elected (i.e. the one corresponding to an execution 1 Central Processing Computer The event noti cations sent to the environment are the following: (i) Normal Termination sent by an application process when it terminates, (ii) Self Suspension sent by an application process when it suspends itself, and (iii) Service Allocation sent by the service manager when it allocates a service requested by an application process.
The environment basic behavior may be described as follows; for any application process:
An Activation Stimulus is sent after some random delay, following the reception of a Normal Termination event noti cation of this process A Resume Stimulus is sent after a xed delay, following the reception of a Self Suspension event noti cation of this process An Allocation Stimulus is sent after a xed delay, following the reception of a Service Allocation event noti cation of the service manager.
System Functional Decomposition

The Processes
The application processes run on both computers in order to implement the hot redundancy scheme. Any application process is composed of two synchronizing Ada tasks: (i) a task called Process implementing application functions (it may suspend itself either at some prede ned suspension point or when it requests some service from the service manager), and (ii) a task called Interface interfacing the process with the scheduler.
The Stimuli Manager
Each stimuli manager manages the stimuli by processing two queues. The Candidate Queue contains all the stimuli received from the environment awaiting to be elected; they are sorted by priority and age. The Elected Queue contains elected stimuli that have not yet been scheduled; on the master computer, it contains only the last elected stimulus; on the slave computer, it contains the stimuli elected by the master not yet scheduled by the slave. The master elects the oldest stimulus among those with the highest priority, i.e. the stimulus located at the head of its Candidate Queue. Whenever the slave receives a stimulus elected by the master, it puts it at the end of its Elected Queue and removes it from its Candidate Queue.
The Scheduler
The scheduling mechanism elects one stimulus at a time and the scheduler starts one execution thread in a process; then, the scheduler idles and waits for the process termination or suspension, which nally leads to removing the stimulus from the Elected Queue.
The Service Manager
A service is a procedure that has only one executable instance, for instance a disk access. Several services are available. Requests to these services are managed by the Service Manager which is not further detailed here.
The Watchdog
The Watchdog task waits for the occurrence of events (either internal or external) that may lead to modifying the scheduling operating mode. Only internal events corresponding to failures detected inside the monitoring system have been considered. These failures are: (i) the dis-symmetry failure when both computers do not receive the same stimuli from the N2 network, recovery being made by the Isolate action, (ii) the CPU over ow failure, recovery being made by the Isolate action, and (iii) the HSD channel failure, recovery being made by the HSD Failure action.
By the Isolate action, the computer, where the failure has been detected, orders its peer, through a HSD channel, to switch to the single master operating mode; it then disconnects itself from the N2 network and the HSD channels.
By the HSD Failure action, the master switches its operating mode to single master and orders its peer, through the N2 network, to switch to the isolated operating mode; the slave then waits some time for a message coming from the master; if there is no message, then it switches to the single master operating mode.
Required Properties
Two types of properties may be identi ed depending whether one considers the nominal behavior (without any internal failure) or a single failure situation.
Case of the nominal behavior
The property characterizing the correct expected behavior of the scheduling algorithm is expressed as follows:
Property 1 The stimuli sequence elected on the slave computer is identical to the one of the master, with the possible exception of the last stimulus elected on the master (which may not yet have been scheduled by the slave).
Case of a single failure
Let us consider a single failure situation together with the assumptions that the failures are sudden and total, and that the failure detection mechanism is fully reliable.
The property to be veri ed corresponds to the correctness of the operating mode switching. Assuming an initial con guration, where CPC 1 is the master and CPC 2 the slave, the two following properties may be stated:
Property 2 CPC 1 switches from the master to the isolated operating mode, while CPC 2 switches from the slave to the single master operating mode, in case of an internal failure detected in CPC 1.
Property 3 CPC 1 switches from the master to the single master operating mode, while CPC 2 switches from the slave to the isolated operating mode, in case of an internal failure detected in CPC 2.
Conclusion
The speci cation of the monitoring system depends on several parameters, among them : the number of application processes to be scheduled on each computer, the number of services available on each computer, the number of sequential threads per application process, and several timing parameters like the transit delays across the N2 network and the HSD channels, the duration of each application process thread and the duration of each available service.
3 System formal speci cation
The purpose of this section is to illustrate the main features of the design method that has been used for translating the informal speci cation introduced in section 2 into a RT-LOTOS formal speci cation.
The design method is essentially based on the LOTOS design methodology developed within the European LotoSphere project BvdLV95]. The key concept of the approach is the design trajectory. A design trajectory is made up of several design steps. Starting from an initial high-level speci cation expressed in LOTOS, the execution of each design step leads to re ning the speci cation by using so-called transformations. Two of these transformations, known as the functionality decomposition and the functionality rearrangement BvdLV95] , are particularly useful for building step by step complex speci cations. The same design method may be applied to RT-LOTOS since the di erence between RT-LOTOS and LOTOS stands essentially at the level of the elementary action o ering, but not at the level of the composition operators.
Applying this design approach has been relatively easy since the starting point of the re-engineering process was a functional architecture of the monitoring system (see the Ada tasks functional decomposition and the associated state graphs introduced in Section 2). 
Monitoring System Specification
The High-Level Speci cation
From the system functional architecture presented in Figure 1 , two main entities may be identi ed, namely the monitoring system and its external environment. The monitoring system is itself composed of three modules corresponding respectively to the two CPC computers and to the bidirectional HSD channels. This high-level functional architecture may be immediately translated into the high-level RT-LOTOS speci cation architecture presented in Figure 2 , where RT-LOTOS process instances 2 are represented by boxes and where synchronization gates are represented by small circles labeled by their name. Both computers may potentially have the same behavior; they correspond therefore to two instances of the same process de nition, namely process CPC Computer ](mode:mode type); formal parameter mode permits the de nition of the operating mode of each instance (master, slave, ) 3 .
Re ning the High-level Speci cation
The next step consists in performing the re nement of process CPC Computer following the functional decomposition of Figure 1 . A RT-LOTOS process has been associated with each functional module, and the processes have been composed in parallel with a mandatory synchronization on their common gates, as illustrated in the architecture depicted in Figure 3 .
Stimuli election is performed by synchronizing processes Scheduler In the informal speci cation, it is easy to distinguish the behavior and the data parts of the system. The behavior part results in a composition of RT-LOTOS processes using the parallel composition, the choice, operators. The data part describes the values (messages) exchanged between processes through the synchronization gates. Every message structure (stimulus, event noti cation, mode) de ned for the monitoring system has been translated into a particular data type.
In standard LOTOS, the description of the data type signatures is completed by the de nition of equations, expressed in the Act-One formalism, for providing the type semantics. For many reasons, related to the non-obvious industrial applicability of Act-One, only the data type signature is expressed 2 Depending on the context the term process will de ne either a process de nition or a process instance 3 We assume an initial con guration where CPC 1 is the master and CPC 2 the slave 
Failure speci cation
Modeling (internal) failures of the monitoring system consists basically in introducing new behaviors in the formal speci cation of the system that lead, after some random delay, to the occurrence of an event characterizing a failure detection (remember that the failure detection mechanism is assumed to be fully reliable). Such an event will activate the associated recovery mechanism (i.e. the operating mode switching) described in section 2.
Conclusion
The resulting speci cation comprises around fty processes for a total of approximatively one thousand RT-LOTOS lines (without the data type implementation). Each leaf process (de ned at the bottom of the process hierarchy) is rather simple and corresponds merely to a state machine of few symbolic states. This appears to be one of the most visible interest of LOTOS-based approaches: the description of highly complex concurrent behaviors by the stepwise composition of processes always becomes simpler when going the hierarchy down.
The RTL tool environment
The validation techniques implemented within the RTL tool may roughly be classi ed into two main categories:
Veri cation techniques for formally proving some property; the purpose here is to analyze a complete ( nite) model of the (RT-)LOTOS specication; these techniques become not feasible when such a model cannot be computed (either because it is in nite or just because it is too big with respect to the size of the available RAM memory). Simulation techniques for observing some possible traces of the specication global behavior; the purpose here is to understand the global behavior of the speci cation better and to gain a certain level of condence on the validity of some property (i.e. no trace has violated the property during the -as many and as long as possible -simulation runs)
RTL Veri cation Capabilities
The veri cation method implemented in the RTL software tool consists in translating a RT-LOTOS speci cation into a timed automaton model, on which reachability analysis is performed. The general way to proceed is not original, but the speci c method implemented in RTL presents several advantages: (i) it permits to minimize the number of clocks in each control state of the timed automaton, thanks to the de nition of the DTA (Dynamic Timed Automata) model, and (ii) reachability analysis is performed on the y when generating the DTA model from the RT-LOTOS speci cation (see CdO95] for details).
Both advantages are important from a practical point of view, since the complexity of veri cation algorithms developed for timed automata depends directly on the number of clocks YL93]. The DTA model, initially developed for taking into account non regular RT-LOTOS processes, has proven to be very e cient since it has permitted to drastically reduce the number of clocks to be de ned in each control state of the model (from more than 20 clocks to around 0 to 5 clocks per control state for the present case study). Reachability analysis does not furthermore require the underlying (untimed) LOTOS behavior being nite, since it is performed on the y.
RTL Simulation Capabilities
Besides its veri cation capabilities, RTL also provides several simulation capabilities. Although simulation cannot in any case be used for formally proving a property, it appears as particularly useful for debugging a complex specication and/or gaining a good level of con dence on the satisfaction of some property. The observer approach (discussed in the next section) may still be used within a simulation framework, and becomes an interesting testing technique.
The trade-o between veri cation and simulation is very easy to understand. Depending on the available RAM memory (100 M-bytes in our case) and on the average size of a state representation in memory, one can easily estimate the maximal number of states that can potentially be produced before entering the swap zone. Many enhancements have been performed in RTL for drastically reducing the memory size of the state representations. For the present case study, this has led to having a 28 k-bytes representation of a state be decreased to 2.2 k-bytes. As a consequence, around 45,000 di erent states can potentially be developed for this speci cation.
The trade-o is therefore between (i) the simulation of the complete speci cation that has been produced by the re-engineering process, and (ii) the veri cation of a simpli ed speci cation derived from the original speci cation.
Validation of the scheduling algorithm
This section presents some results related to the simulation and veri cation of the monitoring system scheduling algorithm. It is organized as follows:
Simulation results are detailed rst with the purpose of illustrating the use of RTL (i) for performing the initial debug of the complex specication produced by the re-engineering process, and (ii) for gaining a certain level of con dence on the validity of the required properties Veri cation results are presented next on a simpli ed formal speci cation in order to be able to master the size of the induced state space; under these simpli cation assumptions, the desired properties of the scheduling algorithm have been formally proven.
Simulation of the Scheduling Algorithm
Using Simulation for Debugging the Speci cation
Simulation has extensively been used for debugging the speci cation. It has been particularly useful for identifying undesirable deadlock situations due to the incorrect speci cation of RT-LOTOS processes synchronizations.
Debugging has essentially been achieved by the display of simulation event traces. From these event traces, scheduling diagrams (see Figure 4) have been produced. These diagrams display the interleaving of the processes executions threads in both computers, and permit therefore to analyze the behavior of the scheduling algorithm. By this way, with a minimal e ort, it becomes possible to observe several parameters like: the time required for executing a process, the load of the stimuli queues, the number of elected stimuli per period of time .
As an illustration, let us consider Figure 5 which shows the execution time of processes 1 and 2 on the master computer (CPC 1). Assuming that the priority of process 1 is higher than the one of process 2, it clearly appears that the execution time of process 1 is less than the execution time of process 2.
Expressing a Property by an Observer
A classical veri cation technique is related to the so-called observer (or tester) approach BvdLV95]. Basically, observers are modules synchronizing themselves with the speci cation on some internal gates, and checking on-line whether some particular condition characterizing the violation of some property arises; in case of such a violation, the observer o ers a speci c error action. Proving that the property checked by the observer is valid consists therefore in showing that the error action is not reachable. The advantage of the technique is that it can easily be implemented; its main drawback is that it is less powerful than model checking. Assuming the time parameter con guration provided by EdF, no action error has been detected in these simulations 4 .
Then, temporal values, characterizing some delay of the slave computer when running the application processes, were selected. Since the slave computer elected stimuli queue has by de nition a limited capacity, a too important delay with respect to the master computer causes stimuli to be lost, leading to the occurrence of action error. In this way, it has been possible to identify a set of parameter values leading to an incorrect behavior of the scheduling algorithm. These parameter values have been analyzed by EdF and the third party software company in charge of the implementation of the scheduling algorithm. Several changes have been made in the monitoring system in order to overcome the (potential) error situation identi ed by these simulations.
In a similar way, observer processes have been developed for validating Property 2 and Property 3 of section 2. No error occurrence has been reported, validating consequently the mode switching mechanism.
Formal Veri cation of the Monitoring System
Validation by simulation can obviously not be considered as a formal proof, since it does not cover the complete speci cation state space. However, the simulation results have already provided some level of con dence on the speci cation quality and on the validity of the desired properties.
Using observers, the veri cation principle is simple, and corresponds to a standard reachability analysis. The same observers have been used for simulation and veri cation.
Simpli cation of the system speci cation
Veri cation by reachability analysis faces the classical state explosion problem Hol93]. Several simpli cations have been made on the speci cation:
The number of parallel components has been reduced by decreasing the number of application processes in each computer (from 4 for the simulation to only 2) and the number of services that may be requested by the application processes. The speci cation has been simpli ed by withdrawing any behavior which did not directly a ect the property to be veri ed. The internal architecture of the speci cation has been simpli ed, by replacing a composition of processes by an equivalent unique process. The value domain of some parameters has been reduced, and parameters that do not directly interfere with the scheduling algorithm have been removed.
Formal veri cation of Property 1
Various veri cation-oriented speci cations have been derived from the initial formal speci cation (the one which has been intensively simulated). Each veri cation-oriented speci cation includes the observer process used for verifying the relevant property. Results related to the formal proof of Property 1 are summarized below. Case I: RT-LOTOS processes involved in the mode switching mechanism have been removed, and only two application processes have been considered, without any service. Equal values have been considered for the durations of the execution threads in both computers (i.e. t exec min = t exec max), and a latency of 250ms has been speci ed for the communications between the environment and the computers (N2 max -N2 min = 250); the transmission delay of the HSD channels has nally be neglected (i.e. dHSD = 0). Under these assumptions, the complete reachability graph has been constructed (see details in Table 1 ) with action error being not reachable in this graph, proving therefore formally Property 1 for this con guration. Memory used (KB) 34300 28764 23544 Table 1 : Property 1 veri cation results Case II: The same speci cation has been considered plus the additional assumption that the slave computer (CPC 2) has an important (processing) deterministic delay (i.e. 250ms) with respect to its master computer. This situation leads to the violation of Property 1. Case III: The same speci cation has been considered but with new timing parameters. The processing delay of the slave computer has been removed, and a latency of 10ms has been introduced for characterizing the variability of the thread duration. The transmission delay of the HSD channels has been established to its nominal value, i.e. 10ms. With these assumptions, the proof of Property 1 has been successful.
Many other veri cations have been performed with di erent parameter sets. Due to the lack of space, they are not reported here.
Conclusion
The speci cation phase has been much more simple than initially expected; the re-engineering process has greatly been facilitated by the existence of Ada ow charts, state diagrams, ... The use of a LOTOS-based approach has also greatly simpli ed the speci cation development. This is largely due to the LO-TOS general parallel composition operator with multi-way synchronization, which permits the speci cation of complex behaviors by the composition of much more simple ones.
The simulation phase has brought much more results than initially expected; many simulations have been conducted for debugging the initial speci cation, and then for validating the scheduling algorithm behavior with numerous parameter con gurations. Error situations have been reported for some parameter values, and have been analyzed in depth by our industrial partners. The use of the observer approach within such simulation framework has proven to be very simple and e cient.
The veri cation phase has been as di cult as initially expected; several improvements, not detailed here, on the RTL tool have been made during the project, most of them for reducing the number of bytes required for coding in memory a RT-LOTOS state; although veri cation results have been obtained only for simpli ed con gurations of the monitoring system (when reducing the number of application processes, the number of threads or services, ...) the validity of the proposed approach has been validated on a complex industrial application.
One important return of experience is the successful trade-o achieved between simulation and veri cation. Both have been carried out consistently and in cooperation, and not in isolation: (i) the veri cation-oriented specications have been derived following a strict methodology from the complete formal speci cation (ii) the observer approach has been used for both the simulation and the reachability analysis. As a consequence, errors in simulation have been better understood by analyzing reachability graphs, and vice versa reasons preventing the convergence of the reachability graph minimization have been better understood thanks to the simulation.
