This paper proposes two new scheduling strategies for TMR computer-controllers. Both strategies can tolerate correlated faults as well as independent faults. These strategies, TMR-R (TMR with Rotated task group) and TMR-Q (TMR with Quintuple computation), are developed using task grouping and assignment. To evaluate the reliability of these strategies, a discrete-time Markov model for control systems is devised. Reliability equations for the TMR-R and TMR-Q are derived from state transitions of sampling intervals based on the Markov model. The reliability of these TMR is proved by comparing them with a conventional TMR, using numerical analysis. These proposed strategies are anticipated to be useful for control systems operating in harsh environments, such as controllers of airplanes or nuclear power plants.
I. INTRODUCTION
Acronyms 1 CTF -correlated transient faults EDCC error detection and correction codes EMI electromagnetic interference ITF -independent transient faults TMR triple-modular-redundancy strategy TMR-C conventional TMR TMR-N TMR with computations TMR-Q TMR with Quintuple computation TMR-QC TMR with Quintuple and Concatenated computation 1 The singular and plural of an acronym are always spelled the same.
TMR-QS TMR with Quintuple and Spaced computation TMR-R TMR with Rotated task group TMR-RC TMR with Rotated and Concatenated task group TMR-RS TMR with Rotated and Spaced task group Digital computers have become an essential part of real-time control systems, e.g., computers control spacecraft, nuclear reactors, power-distribution systems, and chemical plants. Especially when control of life-critical systems such as airplanes or nuclear reactors is involved, the role of computers becomes very important, because a malfunction can lead to an enormous disaster. Computers in these life-critical control applications must be designed to meet stringent reliability specifications, which are often achieved by introducing hardware and software redundancies. The reliability of a control system over the mission is the probability that its entire critical workload executes successfully on-time over the period [1] , [12] .
Faults encountered by control systems are either permanent or transient. Permanent faults are caused mainly by hardware defects (e.g., component aging or breakage) and the faults remain in the system until the component containing the fault is removed. Hardware redundancy is essential to cope with permanent faults. Transient faults are caused mostly by temporary changes in electrical or mechanical conditions and disappear after an active period [13] . As hardware manufacturing technology improves, permanent faults are gradually decreasing; instant malfunctions of computers due to transient faults are the main reasons for system failures. More than 90% of field failures are reported as being caused by transient faults [3] . Transient faults occurring -independently in each processor are called ITF [2] . They are usually caused by internal factors. In contrast, transient faults affecting several processors (or components) simultaneously are called CTF [1] , [11] ; in the literature they are also called coincident faults resulting from a common-cause [5] or dependent faults [2] . CTF are caused mostly by external factors such as EMI, and are especially important in aerospace [1] , which is characterized by an environment containing important electromagnetic and elementary particle radiation. For example, airborne computers can be disrupted by lightning strikes.
Transient faults can be handled by hardware, time, or information redundancies. TMR is one of the most popular hardware fault-tolerance methods. Errors generated by any single faulty module are masked using a simple voter [6] , [15] . The TMR is useful for tolerating a single permanent fault or ITF. Research on TMR systems is documented in many papers [6] , [8] - [10] , [15] . In [3] , [4] , [6] , retry techniques using time redundancy are introduced in TMR to tolerate ITF more effectively. To tolerate CTF, [2] suggests task staggering and resynchronization in a 0018-9529/00$10.00 © 2000 IEEE TMR system. Reference [5] suggests a method to eliminate or alleviate the effects of coincident faults by sequencing tasks on different modules in a TMR system. Reference [1] tried to find the optimal configuration of duplexes and triplexes under CTF, ITF, and permanent faults. This paper continues this earlier work [1] , [2] , [5] directed at enabling controllers to tolerate CTF and ITF. Simple methods are proposed to enable TMR systems to tolerate ITF and CTF, where a fault is assumed to have a finite duration. Specifically, TMR-R and TMR-Q are developed using task grouping and assignment; grouped tasks are allocated for execution on 3 processors at different times. To evaluate the proposed strategies, a reliability model is devised for control systems characterized by transient faults having a finite duration with no restriction on the number of occurrences (multiple faults can occur within a deadline). The model reveals, via numerical results, that these TMR are more reliable than the conventional TMR.
Section II introduces TMR-R and TMR-Q and investigates their properties. Section III describes the reliability model of TMR-R and TMR-Q under CTF and ITF. Section IV presents the numerical results. TMR is widely used in systems that require high reliability and fault tolerance. In a conventional TMR system, 3 processors usually execute the same tasks at any instant and their results are voted. Thus, conventional TMR can tolerate faults in 1 processor module through voting [5] , [6] , [8] , [14] , [15] . When TMR-C is exposed to an environment of CTF, all processors are likely to be disrupted, causing a TMR failure. Executing different tasks at any instant can alleviate the effect of CTF [5] . The task-scheduling strategies in this paper are related to executing tasks at different times on 3 processors.
A. Basic Assumptions

1a
Sampling intervals (or periods) are the same for all tasks. 1b
The deadline of a task is equal to its sampling interval. 2
Total tasks can be grouped into 3 independent task groups that have similar execution times. 3 Execution results for tasks are saved and voted-on later. 4 The voter is fault free, and the voting overhead is negligible. 5
Faults always cause errors. 6 -Correlated faults affect 3 processors simultaneously. 7
Transient faults arrive according to a Poisson process with rate , and recover with rate . 8
Processors are fast enough to provide sufficient time redundancy. Assumptions 1a and 1b mean that a TMR system fails when it cannot produce correct results within one sampling interval or the deadline. For assumption 2, the existence of independent task groups means that the execution order of each task group can be changed. Hence, if two tasks are dependent, they should be put into 1 task group. It might be difficult to make 3 independent task groups have similar execution times. However, task groups of different execution times are not unacceptable, but need more redundant time because the maximum execution time is used.
Where the 3 processors execute different tasks at any instant, it is not a simple assignment to determine when to vote on the execution results. To circumvent this difficulty, assumption 3 [5] is introduced; it is based on the proposition that the saved data can be made immune to faults during the wait for voting, through well-developed memory fault-tolerance schemes such as EDCC. Assumption 4 holds if a voter is simple and reliable [5] . Assumptions 5 and 6 are conservative; faults might not cause errors, and -correlated faults might affect only 1 or 2 processors. Assumption 7 is common in many analyses [1] - [3] , [5] . Assumption 8 is required for the TMR-Q (67% time redundancy) in this paper. Fig. 1 shows the results from using TMR-C over 1 sampling interval and 3 task groups. The 3 independent task groups are . Figs. 2 and 3 illustrate the TMR-R and TMR-Q. For the TMR-C, 3 task groups are executed sequentially and identically on 3 processors. However, for TMR-R, task groups are allocated in a rotated order as shown in Fig. 2 . At any time, 3 processors perform computations for different task groups. Two types of TMR-R are considered according to allocation of idle times: TMR-RC and TMR-RS. Idle times are not allocated between task groups for the TMR-RC, and equal idle times are allocated for the TMR-RS. Because each task group is executed 3 times in TMR-R, a 2-out-of-3 voter system is needed, just as for the TMR-C. In TMR-Q, 3 task groups are allocated in 5 time-slots, and specified to not execute the same tasks at any instant, as shown in Fig. 3 . Each task group is computed 5 times. Thus a 3-out-of-5 voting system is possible using TMR-Q. Two types of TMR-Q: TMR-QC and TMR-QS are also considered, according to allocation of idle times.
B. TMR-R and TMR-Q
When a CTF occurs at as in Figs. 1-3 (rounded box region; e.g., , fault duration ), the TMR-C suffers a failure because 3 s are all corrupted by the fault. However for the TMR-R and TMR-Q, 1 each of , and are corrupted; thus, a system using either the TMR-R or the TMR-Q can survive under this CTF. If the duration of the CTF is increased (dotted region in Figs. 2 and 3 ; e.g., the fault duration is increased to 0.1), the TMR-RC fails, but the TMR-RS does not fail. Two each of s, s, and s are corrupted by the CTF when the TMR-RC is used, whereas 1 each of , and are corrupted when the TMR-RS is used. Thus, the idle time between task groups can alleviate the effect of a long fault-duration. Even under an increased fault-duration time, neither the TMR-QC nor the TMR-QS fails. In addition, the TMR-Q can guarantee successful execution, even when two executions of each task group are corrupted in a sampling interval. As for the TMR-R, the TMR-QS is anticipated to be more reliable than the TMR-QC for a long CTF, because of the idle times associated with the strategy. For ITF, it is anticipated that TMR-Q are more reliable than the TMR-R and TMR-C for the 3-out-of-5 voting system. The effect of idle times is not obvious under ITF. These effects are investigated in Section IV using numerical results.
One way of implementing the strategies in this paper is as follows. Each processor has a memory area (safe-memory) to store voting data. The area can be part of general memory, which the central processing unit (CPU) accesses with memory fault-tolerance schemes such as EDCC. At the end of each task execution, each processor stores results from the task execution in its own safe-memory (assumption 3). Then the results are transferred to a voter in a predetermined order to synchronize with other processors when all results are available. For the TMR-C, the results are transferred to a voter at the end of each task.
The reliability degradation due to the extra complexity required to implement these strategies is inevitable, although the degradation can be reduced considerably by implementing proven techniques on the EDCC and voter system. These strategies can be extended to a general TMR-N, where is odd for majority voting. For TMR-N, each task group is allocated in time-slots with the same allocation rule as that used for the TMR-R and TMR-Q. However, the TMR-N might not be useful in practical applications, because it requires more redundant time and a more complex voter such as -out-of-voter. The required redundant time for the TMR-N is more than 133% of the original execution time. 
III. RELIABILITY ANALYSIS
A. Reliability Model for Control Systems
Faults are often classified into 2 categories: permanent and transient. This paper focuses on transient faults. Computer control systems read sensor values, compute control inputs using suitable algorithms, and output the results to plants for every fixed sampling interval . A computer control system operating with transient faults can be modeled by a 2-state discrete-time Markov chain, which evolves with sampling times . Fig. 4 shows the Markov model. State 0 denotes healthy (processor up or fault free); state 1 denotes faulty (down or under fault) at . Because transient faults occur according to a Poisson process with rate and disappear with rate , the following expression holds [7] : The are derived from and as follows:
B. Reliability under CTF
Because CTF affect all processors simultaneously, a TMR system can be considered as a single-processor system under CTF. Fig. 6 represents the reliability model of a TMR system under CTF.
• State 0' means that all processors are healthy (fault free) at and control is successful (the TMR system produces correct results).
• State means that 3 processors are faulty (under fault) at , but control is successful.
• State : the TMR system fails at state due to incorrect results.
To evaluate the state transition probabilities, a sampling interval composed of frames is considered, where a frame has an execution subinterval and an idle subinterval as shown in Fig. 7 . Let be the maximum value of the 3 execution times of , and . All execution subintervals have the same length, . Except for the last idle subinterval, which has length , all other idle subintervals have length . In this reliability analysis, if any fault occurs within the execution interval of each task group, it is presumed that all the tasks in that group are corrupted, i.e., the worst case of fault occurrence has occurred.
Evaluation of control system reliability under transient faults has some complex computations. The procedure is represented by: Fig. 7 . A sampling interval with n execution and idle subintervals.
The derivation of , is:
two's complement of
• The terms multiplied by represent the probability that the execution subinterval of a frame is corrupted by transient faults . • The terms multiplied by represent the probability that the execution subinterval is fault free . The , is obtained from if if for . is element of . Since frames can be thought of as one frame plus the other frames, is calculated recursively from a frame transition probability and . One sampling interval corresponding to the TMR-C, TMR-R, TMR-Q under CTF can be treated as the sampling interval in Fig. 7 with the parameters in Table I . The transition probabilities in Fig. 6 are defined by number of 1's in is the allowable maximum number of corrupted task groups for successful control;
for the TMR-C, for the TMR-R, for the TMR-Q. 
Let
, then the Markov model of Fig. 6 can be used to obtain the state transition probability over sampling intervals as follows:
Because the reliability at is , then, A fault-free state at has been assumed.
C. Reliability under ITF
Because a single-controller-processor is modeled by a 2-state Markov chain (Fig. 4) , the model of a TMR system under ITF should have states and 1 failure-state. However, if all 3 processors have the same software/hardware components and configurations, the number of states can be reduced to 4, because all processors have the same fault occurrence and recovery rates. Fig. 8 shows the reduced model.
In Fig. 8 , state represents the number of faulty processors at ; the TMR system produces correct results in state ; state represents failure of the system due to incorrect results.
The are obtained by multiplying the transition probabilities of 3 single-processors:
(1) represent states of 3 processors defined in Fig. 4 . Note that:
, and their sum is the state of a TMR system under ITF.
The is a set consisting of the states of 3 processors, , which constitutes 2 states: of a TMR system under ITF.
In is the dimension of . Elements of represent states of the task groups in each processor. For a successful transition (successful control), the number of the same task groups corrupted by ITF must not exceed . Table II gives the for the TMR-C, TMR-R, TMR-Q. The reliability equation can be obtained in a similar way to that for CTF. Fig. 9 shows how TMR system reliability varies according to and under CTF. As the 'duration of the fault' increases (decrease of the recovery rate, ), strategies with long idle times between task groups become more effective than those with shorter idle times, because the probability of 2 or 3 task groups being corrupted by CTF becomes smaller. Thus, the TMR-RS has a higher reliability than the TMR-QS for small . However, as the duration of the fault decreases (increase in ), the idle times between task groups become less influential; thus TMR-QC/QS have a higher reliability than TMR-RC/RS. The order of reliability of TMR scheduling strategies under CTF is :  -----for  ;  -----for ; even though the order in Fig. 9 (a) cannot be distinguished owing to only slight differences between strategies for and . Because the idle time between task groups has no effect on reliability when faults have a negligible duration ( : instantaneous faults), the reliabilities of TMR-RS & TMR-QS approach those of the TMR-RC and TMR-QC, as . Fig. 9(b) shows the reliability of each strategy for values of . For small , which is the usually the real case, the order of reliability of TMR-C, TMR-RC/RS, TMR-QC/QS largely depends on ; the order does not change if is fixed. However, for very frequent faults , when the probability of 2 or more faults in a sampling interval is large, the order depends on both and . For the TMR-Q under ITF, the TMR-QC is superior to the TMR-QS for in the simulation example of Fig. 10 . For the TMR-R, the TMR-RC is superior to the TMR-RS for all . In Fig. 10(b) , in a similar behavior to the system under CTF, for small , the order of reliability largely depends on .
IV. NUMERICAL RESULTS
In summary, as anticipated in Section II, the TMR-QC/QS and TMR-RC are more reliable than the TMR-C under CTF and ITF. The TMR-RS has a higher reliability than the TMR-C under CTF, but a lower reliability under ITF.
The strategies in this paper can be compared qualitatively with [5] (the sequencing strategy). The sequencing strategy is similar to TMR-R, but totally different from TMR-Q. The maximum distance between the same tasks (task distance) in the sequencing strategy is about , which is greater than that of TMR-RC, but less than that of TMR-RS.
For CTF, the longer the task distance, the more reliable the TMR system is. For ITF, it is anticipated that the reliability of the sequencing strategy is similar to that of the TMR-RC, but higher than the TMR-RS, because there is no idle time between tasks in the sequencing strategy.
In conclusion, based on the above simulation results and the properties of the sequencing strategy, the qualitative comparisons in terms of reliability under small are summarized here.
• For CTF: -the sequencing strategy, --, or -the sequencing strategy --. • For ITF: -the sequencing strategy --.
