In designing high assurance systems, the dependability goals are achieved through the adoption of several fault tolerance techniques. Unfortunately, their combined effect on the system cannot be, in the general case, derived by straightforward composition of the stand-alone component's analysis, because of mutual dependence of their controlling parameters. In this paper the assessment of overall system dependability induced by such integrated fault tolerance organizations is carried out through a stochastic simulation approach. To this purpose, a few fault tolerant multiprocessor architectures, based on the integrated usage of standard error processing structures with a recently proposed diagnostic mechanism, called -count, are selected and evaluated. The diagnostic mechanism gets its input (error signals) from the error processing mechanism, whose behaviour is in turn influenced by the rapidity and correctness with which -count identifies permanently/intermittently faulty processors. The choice of the basic fault tolerance mechanisms to adopt, as well as the reference system architecture, has been driven by the characteristics of the envisaged target applications: mainly, stringent dependability requirements, to be traded with adequate levels of performance and cost. The analysis has been focused on performability, which is an appropriate measure to evaluate whether a certain design is "better" than another under dependability and performance point of view.
Introduction
A common approach to fulfil high assurance requirements consists in introducing redundancy into the system, managed by fault tolerance techniques, namely error processing and fault treatment [1] . Error processing aims at removing errors from a computational state, possibly preventing failure; fault treatment aims at preventing faults from being activated again, i.e. resulting in new errors. Dynamic error processing schemes (SCOP [2] , stand-by sparing), which employ resources according to the actual fault manifestations, are particularly suited to cope efficiently with errors. On the fault treatment side, it has long been recognised that real systems experience mostly transient faults [3] which rapidly disappear; hence, removing a processor exhibiting a possibly transient faulty behaviour may unduly impair subsequent performance. In a recent paper [4] the same authors have presented a diagnostic mechanism based on a weighted count of the accumulated error scores of each processor, called -count, designed to discriminate intermittent and permanent faults against low rate, low persistency transient faults. The mechanism, which draws from error signals available in the system, has been extensively analysed in terms of purposely defined performance figures; however, these figures are not directly related with classical dependability attributes.
Extensive studies on the development of a wide number of fault tolerance (FT) mechanisms have appeared in the literature, together with evaluation of the system benefits (in terms of various dependability attributes) deriving from the application of each single mechanism (e.g., [5] [6] [7] [8] ).
However, the pre-set system dependability is generally achieved through the adoption of more than one FT technique; unfortunately their combined effect on the system cannot be, in the general case, derived by some sort of straightforward composition of the stand-alone component's analysis. In fact, the usual case is that the mechanisms contributing to fault tolerance have strict relationships among each others, to better exploit the synergy derived from their compound usage.
The established relationships have, of course, to be properly accounted for when evaluating the overall system dependability figures; the resulting analysis is far more complex than if the mechanisms were independent from each other.
The contribution of this paper is mainly a comprehensive analysis of a few integrated fault tolerance organizations, to gain insights in the effects on the overall system dependability induced by such combined usage. Specifically, the -count mechanism has been coupled with a few representatives of dynamic error processing schemes to build complete fault tolerance strategies.
The resulting fault tolerant organizations have been studied via simulation, to evaluate the performability gained by applying each individual strategy in the set. The envisaged target applications are characterized by high assurance requirements; however, they also call for adequate system organizations to trade the demand on high dependability figures with specific requirements on performance and cost. Therefore, the context of a multiprocessor architecture is assumed, since: i) owing to its natural redundancy, is particularly suitable both to provide large raw computational power and to implement techniques to tolerate operational faults; ii) its regular structure will best benefit of the increasing level of device integration at low cost. Both features fit well the requirements of evolving fault-tolerant embedded systems. Preliminary results of this study have been presented in [9] ; here a comprehensive report is given, with ample presentation and discussion of results.
The rest of the paper is organized as follows. Section 2 describes the logical structure of the assumed multiprocessor architecture, identifying the components in charge of fault tolerance activities. Then, Section 3 introduces the fault tolerant organizations, recalling the behaviour of the -count mechanism for fault discrimination and briefly describing the error processing structures selected to be coupled with -count. The simulation environment set up for the evaluation of the overall fault tolerant architectures is presented in Section 4, including the fault model and the interesting quantities under analysis. In Section 5, a number of representative system scenarios are defined and analysed, followed by a thorough discussion of the obtained simulation results.
Finally, conclusions are drawn in Section 6.
System Model
The architecture we assume is a symmetrical multiprocessor system (SMP) composed of N processors (or units), u 0 ,.., u n-1 , which are considered atomic failure units. The architecture also includes components devoted to manage dependability aspects, namely: i) proper allocation of tasks to processors and removal of faulty processors, ii) error processing and judgement on processors' behaviour, and iii) identification of faulty processors to be consequently removed.
The logical architecture of our system is illustrated in Figure 1 . Upon service request, the Planner component selects a certain number of processors based on the error processing strategy 1 employed and makes available to them not just the input data but also the application software to run. Then, the task is executed by the selected processors. The correct execution of the same task with the same inputs on several processors, or by the same processor at different times always yields the same output. The results produced by the processors involved in the execution of the same task are collected by the Error Management (EM) component, which selects the result to be delivered by applying an adjudication function, and either forwards it to the users or stores it in a stable storage to be used in subsequent computations. When dynamic error processing mechanisms are employed [2, 10] , redundant execution of an applicative task might be performed in phases, where the execution of further copies of the application is conditional on the absence of an adjudged result in the current phase, as notified by the EM; this implies information exchange between EM and the Planner. EM provides also information to another component, the Diagnosis Mechanism (DM): for each redundant task execution EM delivers to DM a notification about the processor(s) that originated disagreeing results with respect to the adjudicated output. Since most errors are caused by transient faults, DM is in charge to decide if the error frequency exhibited by a processor, and the attending loss of computing resources to recover the error's effects, are large enough to offset the performance benefits of keeping that processor on-line. 2 Upon taking such decision, DM tells the Planner that the processor must be removed from the system. The Planner then stops sending that processor any further service request.
Repair/replacement of faulty processors is not considered in this paper; therefore, the number of active processors in the system progressively shrinks, and we assume that the system stops to be operative when the number of active processors decreases below a given threshold. Embedded systems that must operate for all their life in circumstances where no repair is admissible (for example, space-exploring vehicles) obviously fit the present system model. It also applies to a single mission of the much more popular mission-oriented systems, characterized by preset duration operative periods alternating with off-line periods, where preventive maintenance is regularly carried on (e.g., in the case of totally autonomous battle-field vehicles).
Fault Tolerant Organizations
In the assumed architecture, fault tolerance is managed essentially by the EM and DM components. Four different organizations, obtained by coupling in turn four different error processing schemes with the diagnostic mechanism -count, have been chosen for evaluation in the following Sections.
The Diagnosis Mechanism
The DM component is in charge of identifying processors to be taken off-line. Physical faults are classified as permanent, intermittent or transient [1, 3] . Permanent and intermittent faults are internal, persistent faults, while transient faults are caused by external conditions (e.g. stormoriginated EMI), whose effects tend to disappear rapidly. Since real systems experience mostly transient faults, removing a processor exhibiting a transient faulty behaviour may unnecessarily harm the system performance and diminish its residual redundancy. Therefore, processors
showing erroneous behaviour have to be cautiously treated, because of the high percentage of transient faults with respect to the overall fault manifestations. The -count mechanism, used as DM, is a simple count-and-threshold scheme presented in [4, 11] , which was designed to dis-criminate intermittent and permanent faults against low rate, low persistency transient faults.
-count adds up signals caused by errors in a given processor, as time goes on, weighing down signals as they get older. If the rate of detected errors, filtered according to a tuning parameter, exceeds a tunable threshold, the processor is signalled to the Planner.
The basic formulation of the filtering function is the following: = 1 upon error, and K is a parameter to be set by the system designer.
When the value of i (L) exceeds the given pre-set threshold T , u i is diagnosed as affected by a dangerously frequent intermittent fault, and the consequent signal is issued. The values that the parameters K and T have to be assigned so that the strategy works best, i.e., it recognizes faulty processors as soon as possible and lowers the probability of identifying healthy processors as faulty, depend on the expected frequency of permanent, intermittent and transient faults and on the probability of correct judgements of the error signalling mechanism used.
The behaviour of -count has been analysed in [4, 9, 11] as a stand-alone mechanism. Two figures of merit have been evaluated: i) the time, D, between a (permanent or intermittent) fault occurrence in a processor u i , and its recognition by the threshold crossing of the pertinent count i ; ii) the wasted time, NU, spent by a processor idled after being wrongly signalled as faulty (normalized to the expected processor's life). Note that in the time span measured by D a faulty processor is maintained in use and relied upon, because its condition has not yet been recognized, thus opening a window of vulnerability to cathastrophic multiple fault occurrence.
In this paper we take advantage of the already performed analysis of -count in choosing some parameter values in the ensuing evaluation.
The Error Manager
The EM component deals with processors' run-time errors. Some significant error-processing approaches are hereafter briefly recalled. We consider instances of dynamic error processing schemes which employ a relatively low level of redundancy; these solutions appear to be the most appropriate ones for our target high assurance applications, which require a good balance between dependability and performance (minimum usage of redundancy to favour performance, while assuring an acceptable level of dependability). The schemes are based on redundant execution followed by comparison (or, more generally, by adjudication, whenever hardware diversity is employed). As long as the occurring faults allow the adjudication of the correct result, errors generated by single processors are prevented to corrupt the output. This approach brings the advantage, in our context, of immediate identification of processors whose output disagrees with the adjudged one (i.e., processors to be notified to DM).
Self checking pair (SCP).
In the self checking structure, two replicas of the task are executed by two different processors, possibly at the same time, and their results are compared by the adjudicator. Consistent results are accepted and a "correct" judgement is attributed to both processors. If a disagreement is observed instead, a detected error is notified as the service output.
Then, a diagnostic routine is launched on each processor. If both diagnostics give the same output, whether "faulty" or "good", both processors are signalled as possibly faulty, since it is not possible to discern which one(s) actually failed. If instead only one processor recognizes itself as faulty, the other is not signalled to the DM. The fault coverage of the diagnostic routines plays an important role in the overall efficacy of DM: intuitively, the higher c d , the higher the probability of correct identification of faulty processors, with a consequent more precise information transmitted to DM.
In case only one processor is identified as incorrect by the diagnostic routine, the computed output of the other "good" processor could optionally be assumed as the task result. A system configuration with this option enacted will be referred to as "Self-Checking Pair with Error Correction", or SCP-EC.
"2+1". This technique is a simple instance of SCOP (Self-Configuring Optimistic Programming) [2] , a class of error processing strategies based on the idea of using redundancy incrementally over a number of steps. "2+1" employs three replicas; its execution starts with two replicas executed on two different processors and their results are accepted if they prove in agreement; otherwise, the third replica is executed and its result examined by the adjudicator along with the other two. A correct result is then produced, provided that a single fault occurred, and the disagreeing processor is identified and signalled to DM. From its operational behaviour, "2+1" can also be seen as a "dynamic" variation of the traditional TMR ( [3] ) to improve efficiency.
"2+2". The third scheme considered, called "2+2", belongs to the same class as the "2+1" scheme. It is composed of four replicas, two of which are executed in a first phase. When two disagreeing results are observed, replicas are re-executed on two other processors, thus obtaining 4 results to build the judgement on. A result value is considered correct if agreed upon by a majority of at least two units.
"3+1". In this scheme four replicas are also employed, with a different arrangement: three of them are concurrently executed in the first phase; if a majority of at least two agreeing results is not obtained, the second phase is entered, and all the four values concur in the choice of the final output, as in the "2+2".
From the point of view of fault tolerance capability, "2+2" and "3+1", employing more redundancy, obviously are the best, followed by "2+1" and by SCP. In fact, "2+2" and "3+1" are able to provide correct results not only in presence of 0 or 1 units failed (as "2+1"), but also in those cases where two units failed providing different results.
SCP exhibits fault tolerance only in its SCP-EC variant. This structure compares with the "2+1", as the diagnostic routine executed upon mismatch is a 2nd phase of a sort, which offers added flexibility: its overall cost, both in terms of development and of execution time, can be modulated by design, in search of the best trade-off between fault coverage and probability of delivering erroneous results (the lower the catastrophic error probability, the higher the required coverageand cost).
The basic SCP has fault detection capabilities only, with the consequent higher rate of detected errors as compared to the other schemes. It exhibits, however, the lowest probability of undetected error (excluding the "3+1"), limited to the probability of having two coincident faults causing identical results.
Regarding the level of redundancy used, SCP is the most convenient (provided that simple and short diagnostic routines are used). The cost of executing additional replicas is paid by the "2+1" and "2+2" schemes only in the (rare) event that a fault is actually detected (SCOP is in fact an "optimistic" redundancy scheme). The "3+1" scheme pays always the cost of three executions in the first phase, in exchange of avoiding the need of the second phase for all single-error occurrences in the first one, which is instead entered by "2+1" and "2+2". From the point of view of simplicity of operation, SCP is the favourite one, since it does not require to track and synchronize the outputs of replicas from phase to phase. Then, the simpler overall organisation of SCP has to be weighed against the lower offered tolerance. The "3+1" is the most expensive in terms of employed resources; however, it is characterized by a short time consumption.
Moreover, it can deliver better fault-tolerance, provided it is equipped with a more sophisticated adjudicator (assuming that the adjudicator follows the three-out-of-three rule in the first phase, and the two-out-of-four in the second phase, two faults causing coincident errors are detected).
Evaluation
By coupling the fault detector mechanism -count with one of the error processing schemes introduced in the previous Section, a complete fault tolerant organization is obtained. Its evaluation is performed hereafter. The intricate relationships existing among the integrated mechanisms EM and DM, that depend, among other things, on their respective parameters, make the analytical evaluation quite hard. In fact, note that: i) a Stochastic Activity Network (SAN) describing a single -count mechanism expands to several thousands of states; ii) each N-ple of processors needs to be tracked individually in its behaviour, complete with the pertinent tuple of -counts; iii) an N-ple assigned a given task may be composed of different processors in different executions (e.g., because of faulty processors being replaced); processor's history thus interleaves with that of N-ples. An analytical model of such a system would count states in the number of millions or tens thereof; the necessary computational effort would have been beyond our resources. Therefore we have resorted to stochastic simulation, which is a valid tool for the class of redundant configurations excluding ultra-dependable systems. We developed our own discrete event, artificially regenerative simulator, which has been designed to deal with the multiprocessor architecture described in Section 2.
Fault Assumptions
The assumptions that define the fault model adopted in the evaluation are the following:
1-Only hardware faults are considered. Faults affecting the links of our architecture, the stable storage, the Planner, EM and DM components are not considered in this study. 4-When two processors provide two erroneous results in executing the same task, their results will have an identical erroneous value with a probability q d .
5-The diagnostic routines used by the SCP organizations never identify a healthy processor as faulty; therefore the coverage parameter c d only refers to the likelihood of the diagnostics to correctly recognize a faulty processor as faulty. Actually, this assumption is verified by diagnostic routines many current systems are equipped with.
6-The outcomes of a redundant execution may be: i) success, i.e., the delivery of a correct result, ii) a detected error, detected either by comparison of redundant results or by a (reliable) watchdog timer catching timing errors, or iii) an undetected (or catastrophic)
error (delivery of an erroneous result).
Performability Measure
Our work is directed to [real world] high assurance systems supporting critical applications for which reliability or safety alone are not the only or topmost design concern; performance (in terms of the number of error-free task executions in a given time/mission) and overall system cost are also ruling parameters. For such systems, a performability measure [6, 12, 13] is more appropriate to evaluate whether a certain design is "better" than another. Necessary to performability is the definition of a reward model, which ideally should take into account all factors intervening in the production of valuable results, or in operational costs. To limit the complexity of the analysis we use here, by way of example, two simple additive reward models which fit our mission-oriented systems, but do not capture all the utility factors of the computational schemes, e.g., the organizational simplicity of the SCP and the promptness of the "3+1".
In the first model, M 1 , successful executions add one unit to the value of the mission; executions producing detected errors add a cost C B ; an undetected error aborts the mission (mission failure), and sets the reward to a negative value C C . This reward model is appropriate, for instance, in tool control in manufacturing industry, where each iteration produces a unit of some product, the loss associated to an undetected error may be the stoppage and/or some damage to the tool, and that of detected error may be the production of an imperfect item. An alternative applicative scenario is a somewhat complex transaction-processing or scientific application, where detected errors can be recovered (at some cost); an undetected error destroys the value produced during the day, also implying additional costs (e.g., costs from erroneous services provided to clients, or from a rollback and rerun of the transactions at the end of the day, after some inconsistency has been detected by external means). 
Simulation settings
System characteristics adopted in the simulation experiments are here described. The system load has been assumed to be infinite, and the processors are used without any synchronisation or scheduling delays, to reach the highest parallelism permitted by the employed error-processing scheme. Table 1 summarizes the notation and values used for the parameters involved in the simulation.
Parameters description SymbolsValues
Number of processors in the system N 10
Minimum number of processors for the system to be operational N min 6
Processors fault rate (per hour) 5E-04 
Parameters and values used in the simulation
Processors fault rate has been chosen to be 5 10 -4 per hour; then, two different settings for p , i and t have been analysed, to investigate the sensitivity of the scheme to transient faults occurrences. Our multiprocessor system is supposed to run tasks having the same duration, T t , which has been set to 600 sec. From the analysis of the -count mechanism's behaviour performed in [4, 11] , where the relation between the activation rate of intermittent faults and the decay ratio of the filtering function (K) has been studied, a significant pair of values has been The values just shown for the simulator settings are not intended to refer to any particular system; rather, some parameters, e.g. the fault rates, are representative of average values exhibited by a broad range of real systems; while others, as the mission times and confidence level, have been chosen to trade off precision and significance of simulated results vs. computational resources.
Simulation results
The results of the simulation experiments are now presented and discussed. In order to gain deeper understanding of the mutual effects of the error processing and the -count mechanisms composing each selected fault tolerant organizations, three additional quantities other than the performability measure, which is the main focus of this work, have been evaluated. Indicating with FU a processor affected by a permanent/intermittent fault, we have evaluated: i) the average number of processors identified as FUs; ii) the average number of processors among those removed which are really FUs; and iii) the average number of healthy processors wrongly identified as FU, and so removed. Two scenarios have been studied, which differ in: a) the mission duration, b) the conditions which determine the mission failure, c) the reward structure and d) the ratio of transient to intermittent/permanent faults occurrence.
The coverage c d of the diagnostic routine in the SCP scheme gives a further dimension to the analysis to be carried out, which is not present in the other schemes; examining extensively all the value combinations for the scheme's parameters could hamper the visibility of relevant results.
Therefore we first analyzed the c d parameter's effects on the SCP, in the intent of simplifying the overall study.
Study of the Self-Checking Pair in Terms of c d
The effects of the diagnostic coverage c d have been analysed in terms of a few significant figures of the system behaviour, namely the probability of mission success, the average number of healthy processors unduly removed and the performability. shape.
The SCP-EC variant improves on the SCP in delivering a lower number of detected errors: when only one processor is tagged as faulty out of the two diagnostics run on the two processors, the result provided by the "good" one is delivered as a correct output. The down side of this option is that it comes with higher probability of undetected error: because of the incomplete diagnostic coverage, the possibility arises that in case of both processors failing with non-identical results, one of the twin processors does not recognizes its own fault. In such a case, rare but to be accounted for, its result is wrongly assumed correct. This suggests to use SCP-EC with an adequately high value of c d .
To The obtained results are reported in Table 2 
Scenario I: short missions and detected errors do not cause mission failure
In this scenario, T M =1500 hours and missions terminate with failure whenever an undetected error occurs or when less than N min processors are left. The two fault tolerant organizations composed of SCP plus -count and "2+1" plus -count have been considered. Transient faults account for 80% of the total faults ( t =0.8 ), intermittent fault rate is i =0.15 , and permanent fault rate is p =0.05 . In Figure 3 .a, the value assigned to C C is zero, to stress the impact of the higher number of benign errors have a relatively low associated cost, the SCP structure may be usefully adopted, because of its structural simplicity.
For growing values of C B , the effect of the higher number of detected errors experienced by SCP w.r.t "2+1" shows up: since the average number of detected errors for SCP are a few thousands times that of the "2+1", the performability obtained by the SCP sharply drops whilst the "2+1" curve hovers straight above. The behaviour of the SCP-EC configuration departs significantly from that of the simple SCP. Not unexpectedly, from the figures obtained for that scheme it appears more similar to the "2+1" than to the SCP. As already discussed in the previous section, the most prominent variation against the SCP is in the number of detected errors, which gets reduced by a factor of ten, all other parameters being the same. In our setting, this strong reduction brings notable advantages even for low values of C B . -100.
-80.
-60.
-40.
-20. 
Figure 3. Performability vs. the cost C B of detected errors
In the curves relative to SCP and"2+1" threshold values T =1.1, T =2.0 and T =3.0 have been considered, which constitute the interesting range to appreciate the effect of the usage of the combined error processing and -count mechanisms. In the SCP case, the sensitivity to variations of T is appreciable in the whole displayed range, especially for higher costs of detected errors.
Notably, the effect of performance inversion that clearly shows up is caused by the typical -count behaviour: the curve with the larger value of T starts higher, only to get worse performability at increasing C B costs, because faulty processors are kept longer in the system, raising the chance of detected errors. The "2+1" is not sensible to the variation of T in the range To examine the effect of the cost C C on performability, recall that such cost is incurred whenever the mission fails. Although not shown in the figures, the simulation results show that SCP and "2+1" compare well in reliability, as they achieve similar probability of mission success. The best figures are practically the same in both schemes, albeit obtained with different parameters: the "2+1" scores 0.9944 with T =1.1, the SCP gets 0.993 with T =3.0. Therefore, the effect of C C on performability is comparable for both schemes, at least in a wide range of values (in our case, up to ~100,000). Of course, as higher values of C C would have to be paid, even the small difference in the success probability would eventually become sensible.
In Figure 3 .b the extreme cases are shown, emulating the system behaviours in absence of -count. With T =0.5, i.e. a value smaller than the elementary -count increment of 1, even a single, solitary error gets the count over the threshold, causing the processor removal.
Conversely, an infinite value for T means that processors are never removed, no matter how many times they incur in errors. Figure 3 .b has been obtained by using C C =50,000, to give a more complete view of the extreme situations (reasonably, a catastrophic error has a weight greater than that of a detected error). The first observation is that T =0.5 always yields negative performability in both schemes, with worse values in case of SCP. The unforgiving removal of processors, even upon a transient fault, causes most missions to terminate unsuccessfully because of depletion of processors (and SCP, because of the assumed value of diagnostic coverage, is more prone to point out processors as faulty). In the other extreme case, labelled T =INFIN in the figure, SCP gets better performability at low values of C B . In fact, as in this case faulty processors are never removed, their continued presence in the system is less harmful for SCP then for "2+1" (this last has more chance to encounter two faults during a task execution than the former because of the two phases, thus resulting in a higher probability of undetected error).
However, although the number of missions completed by SCP is higher than that of "2+1", the number of detected errors in completed missions is much higher for SCP than for "2+1". Figure 4 has been plotted following the same criteria adopted in Figure 3 to allow direct comparisons. 32.
34.
36.
38.
40.
42.
44. -70.
-50.
-30.
-10.
10.
30.
50. Looking at Figure 4 it can be immediately observed, as expected, an improvement shown by both strategies. In fact, the lower rate of permanent and intermittent faults increases the probability of successful missions. As shown in Figure 4 .a, the trend of the curves remains essentially the same. In fact, as derived from the analysis of -count, the ability of this mechanism to correctly discriminate faults highly depends on the difference between the rate of transient faults and the manifestation rate of intermittent faults: the higher is this difference and the better is the discrimination performed by -count. The value chosen for t in the present case does not significantly affect such difference w.r.t. the first case. In Figure 4 .b, while there is no appreciable difference with Figure 3 .b for T =0.5 (as expected, since the total failure rate has not been changed), it can be observed a marked improvement in "2+1" with T =INFIN: the reduced presence of permanent/intermittent faults significantly diminishes the probability of executing the second phase for this strategy. Table 4 shows the behaviour of SCP and "2+1" with respect to the identification of faulty processors in the present case.
Numbers in Table 4 are smaller than the corresponding ones in Table 3 because of the smaller number of faulty processors now present in the system. However, the same comment as in Table   3 applies here. Two sub-scenarios have been individually studied. In the sub-scenario II.1 the fault tolerant organizations based on "2+1" and "2+2" have been considered: the goal is to see the effect on the performability induced by the higher fault tolerance of the "2+2". The sub-scenario II.2 has the goal to compare two schemes, i.e. "2+2" and"3+1", having the same fault-tolerance but different organization in resource usage.
With reference to sub-scenario II.1, Figure 5 illustrates the performability, based on the reward structure M 2 , at varying values of the cost of a mission failure C C, for T = 1.1 and T = 3.0. The behaviour in the extreme cases (as defined in Section 4.3) is illustrated by four curves, plotted for T = INFIN and T = 0.5 (the two curves with T = 0.5 overlap). The best results are obtained by "2+2" with T =1.1; while this is not surprising, since "2+2" has an higher probability of successful missions than "2+1", it has to be noted that "2+2" is not always better than "2+1" in terms of performability: in fact, the curve describing "2+2" with T =3 is the lowest in the set.
The worsening of the performability at increasing values of T shown by both strategies makes explicit the conflicting effects of T on a) the (wrong) removal of healthy processors and b) the (risky) longer permanence of faulty processors in the system. In the general case, as discussed in [4, 11] , increasing values of T decrease the number of wrongly removed healthy processors (as shown by Table 5 ), while the probability of erroneous computation tends to increase because faulty processors stay longer on-line. In the present setting, where wrong outputs (even if detected) cause costly mission failures, the second effect dominates. Moreover, since faulty units spend more time in the system as the threshold increases, the second phase, where "2+2" uses two processors instead of one, is entered more often: this contributes to the slightly better results of "2+1" with T =3. With reference to sub-scenario II.2, Figure 6 illustrates the performability, based on the reward structure M 2 , in terms of the mission failure C C, for T = 1.1 and T = 3.0, as well as for the extreme cases T = 0.5 and T = INFIN. It is apparent from Figure 6 that, for low values of the failure cost C C , the "3+1" delivers roughly one-third less performability than the "2+2". This is a consequence of the fact that the "3+1" is handicapped by having always to run at least three replicas, and, on the other hand, the reward model M 2 does not take into account the responsiveness, which is better in the "3+1".
Focusing on the effect of T , the figure shows that in the "3+1" the performability grows with T (in the value range examined), while in the "2+2" a degradation is observed when T =3.0 (as already pointed out when discussing scenario II.1). In fact, in this case, the longer permanence of faulty processors leads to a higher probability of mission failure for the "2+2" wrt the "3+1"
because of the higher probability of the former to enter the second phase, during which the scheme is further exposed to the occurrence of a second fault. The figure shows that, if the cost of the failure is high enough, the performability of the "2+2" gets even lower than that of the "3+1";
however, in our settings this observation leads to no practical use, since this happens in a region where the performability is negative in both structures. Table 6 completes the analysis of the fault tolerant organizations based on "2+2" and "3+1" w.r.t.
the removal of processors. Because both strategies have the same fault tolerance capabilities, and consequently the same level of correctness of processors' judgements issued to DM, the figures pertaining the removed faulty processors shown by both strategies are almost the same (and, as expected, very good ones). 
Conclusions
This paper has contributed to the understanding of the effects of integrating fault tolerance mechanisms in a multiprocessor system, targeting high assurance applications, where dependability and performance are of great concern, thereby focusing on the performability measure.
Starting from known error processing and diagnostic mechanisms, a few fault tolerant multiprocessor organizations have been examined. The computational power offered by multiprocessors, and the recognition that the great majority of physical faults affecting processors has a transient nature, led us to couple simple instances of error processing schemes based on redundant execution with the threshold-based fault detector mechanism -count.
The analysis performed has shown how the behaviour of the combined use of error processing and -count mechanisms (well analysed in isolation in previous work), influence a measure which is descriptive of the entire system, namely the performability. A simulation approach has been adopted, to overcome the difficulties stemming from the inter-dependencies of the selected fault tolerance mechanisms, which would result into a state number explosion in attempting analytical solutions. The early presentation made in [9] has been brought to full extent, with: i) ample reports and discussions of simulation data, ii) the inclusion of diagnostic routines in the SCP scheme (with detailed analysis of the effects of their coverage factor), iii) the addition of the "3+1" scheme, and iv) the analysis of sensitivity to the percentage of transient faults (in Scenario I).
The results of our simulations have given numerical evidence to qualitative behavioural forecasts.
The conflicting effects of the threshold T on the (wrong) removal of healthy processors and on the risky longer permanence of faulty processors in the system, combined with the different fault signalling capabilities of the error processing schemes, have been clearly visualized. The utility of equipping the system with the -count mechanism has also been shown up, by a direct comparison with extreme values of T to simulate simple fault treatment policies alternative to -count (processor removal at the first error detection, and processors never removed). Finally, an important outcome of our analysis is that careful integration of the fault tolerance mechanisms has to be planned, as less-than-obvious results may be otherwise obtained. We have shown, in fact, that the intuitively "golden" choice of pairing -count with the best error processing scheme ("2+2") actually performs worse than using the cheaper "2+1".
