I. INTRODUCTION
ITH the advance of very large scale integration (VLSI) W technology, large multiprocessor systems are becoming available on commercial market at reasonable prices. These multiprocessor systems are ideal candidates for providing high-performance, scalable computing services for a broad range of time-and safety-critical applications. Typical examples of these applications include radar signal processing, command and control centers, on-line optimization of telecommunication and urban transportation networks [2] , [3] . In these applications, computational tasks should be completed before their deadline to avoid catastrophe or any significant degradation of quality of system service. Faults that might occur to the computing system should therefore be quickly detected and isolated in such a way that normal computation can be resumed in a bounded time.
Most failure recovery strategies for distributed systems are static in nature, and thus are not very efficient for multiprocessor systems [4] , [5] . Moreover, commercial fault-tolerant multiprocessor systems are usually designed for transactionoriented applications (e.g., [6] , [7] , [8] , [9] ) which do not consider the deadline constraints and thus are not well suited for time-critical applications. For its simplicity and high fault coverage, the hardware-based majority voting architecture is College Station, TX 77843-31 12; e-mail liu@cs.tamu.edu.
Computer Science, University of Michigan, Ann Arbor, MI 48109-2122. widely accepted for the design of real-time, non-stop faulttolerant systems [lo] , [ll] , [12] , [13] . In this architecture, a functional unit is replicated N times, forming an N-modular redundant (NMR) unit, and copies of each critical program are executed in lock-step by redundant units. The NMR architecture is very flexible, and it can be integrated with different fault detectiodcorrection mechanisms. For example, in the Hitachi FT-6 100 system [ 131 a single-error-correctioddouble-errordetection (SEC/DED) code is used in its main memory, but its central processor unit is protected with an NMR architecture.
Several NMR-based fault-tolerant multiprocessor systems have been built. Some of the most notable systems include the FIMP [lo], Fault-Tolerant Processor (FTP) [14] , and C.vmp [ 151. Flexibility (in providing fault-tolerance) and performance are two main issues associated with the implementation of an NMR-based multiprocessor system. The NMR architecture is conceptually simple, but its implementation may have a major impact on system performance. In an NMR system, clocks of the N redundant units need to be synchronized with one another so as to ensure lock-step execution of copies of a program on the redundant units. For its cost-effectiveness, the All Digital Phase-Locked Loop (ADPLL) [ 161 technique is widely used, which can synchronize the N modules at a clock rate lower than the running clocks of processors, e.g., the FTP and the Transputer [ 171 architectures. For clocks of the redundant modules to be phase-locked to each other by an ADPLL, the phase difference of the N clocks are compared at a fixed time interval of every k pulses of the running clocks, called one synchronization cycle. Using the phase-comparison results of the synchronization cycles, the running clocks are adjusted, by additioddeletion of clock pulses, to compensate for the phase difference of the synchronization cycles. This way the phase difference between N clocks sources can be upper bounded by a physical time margin S,. Unlike the conventional analog PLL technology, the ADPLL can be implemented with digital circuitry, and it has good tolerance to the time skew between redundant units.
Despite its cost-effectiveness, a major problem associated with the ADPLL-based NMR architecture is that the peak voting frequency of redundant units in a conventional voter is constrained by S , . To overcome this problem, Parhami proposed a pipelined, multiple stage cellular voter architecture, called voting networks, to support different majority voting rules [l] . We will call a triple modular redundancy (TMR) voting network a pipelined voter (PV) . A PV consists of a majority voter at its output and a set of input buffers, each of which is associated with a status bit to indicate the readiness of 0018-9340/95$04.00 0 1995 IEEE SO4 IEEE TRANSACTIONS ON COMPUTERS, VOL. 44, NO. 4, APRIL 1995 data. For their flexibility, ADPLL and PV will be used in our study as a basis for the implementation of an NMR-based multiprocessor system. TWO main issues, state recovery and reconfiguration, need to be considered for the design of a gracefully degradable multiprocessor system. A large multiprocessor system may consist of hundreds of processors. Therefore, time efficiency is a main concern for fault recovery of large-scale multiprocessor systems. The time required to restore the processor state is relatively small. However, if the size of main memory is very large (as is usually the case), then the time spent on majority voting for memory-state recovery can be substantial. To overcome this problem, a memory paging technique was proposed [18] , so that only the memory pages being modified need to be voted on for state realignment. An important question left unanswered in [ 181 is "What is the optimal page size for state realignment?' To address this question, we develop an efficient algorithm to derive the optimal memory page size.
To deal with the system reconfiguration issue, we use dynamic reconfiguration which can maintain a maximal number of fault-free redundant modules in the system. To fully utilize the architectural features of contemporary multiprocessor systems, we propose two important supporting mechanisms, both of which are considered as the system reliability hard core, for the implementation of an NMR-based multiprocessor system. The first mechanism is the monitoring-at-transmission (MAT) bus for cost-effective implementation of PV. The other is the dynamic reconfiguration network (DRN) for dynamic reconfiguration of clocking and control signals of redundant units. The MAT bus can be implemented by the Wired-OWAND logic or other similar techniques, and the MAT feature can be readily found in some of existing commercial products. On a MAT bus, each of NMR processor modules monitors the bus transaction when they output data to the bus. If an inconsistency is detected between the bus value and the output value in a functional unit, then a fault is detected, and the computation is interrupted to recover from the fault. The MAT bus is well suited for the widely-used cluster-based multiprocessor architecture. A processor cluster consists of a set of processors interconnected via a broadcast bus, and the processor clusters are interconnected via a different system network. A DRN can be implemented with ADPLLs and multiplexor-demultiplexor circuits. When some processor modules fail, the failed modules can be decoupled from the DRN, and the remaining faultfree units can be regrouped together into new NMR units. For simplicity, we will use the TMR model as an illustrative example throughout the paper.
The rest of the paper is organized as follows. The dynamically-reconfigurable architecture and the algorithm for optimizing the memory-page size are described in Section 11. Experimental results and performance evaluations of the proposed architecture are presented in Section 111. Concluding remarks are made in Section IV.
SYSTEM ARCHITECTURE

A. System Organization
Processor modules-ach of which consists of a processor and its cache memory-and memory modules are the two basic functional units to be considered in our design, as in most commercial systems [ 191. In a processor triad, three processor modules are synchronized with one another, and all the external data writes are voted on by the PV to mask faults in the processor modules. Like the design of the C.vmp architecture [151, the PV is placed between the cache and main memory to form a processor triad (see Fig. 1 ) so that both the write-back and write-through cache coherence protocols can be incorporated. Coordination between the PV and cache memory is described by the flowchart in Fig. 2 .
The main difference between the MAT-bus voter and a PV is that, on the MAT-bus voter, an inconsistency between redundant processors is detected, not masked, in each data vote. For processor modules in a processor triad to take a vote on their outputs, they first place data into their buffers and then make a transmission request to the bus arbiter. The bus arbiter can grant the bus transmission after all the ready bits in the same triad are asserted, and all processor modules will then monitor the bus transmission. If any processor detects an inconsistency between data on the bus and its own output value, the processor invalidates the transaction, and the data voting will be retried. If the inconsistency remains even after several retries, a permanent fault is assumed detected and the processor reconfiguration procedure will be invoked.
In the regrouping process of fault-free processors on the MAT bus, fault-free processors may need to be first decoupled from their own processor triads, so that they can later be coupled with other fault-free processors to form new processor triads. After fault-free processors are dynamically grouped into triads, their control and clock signals need to be phase-locked into each other. The dynamic reconfiguration network (DRN) is designed to serve this purpose. A DRN consists of a set of DRN modules (DRNM) connected to one another forming a logical ring. Each DRNM has five access ports, three of which are directly connected to three processor modules, and the other two are connected to other DRNMs on the ring. There is a three-input ADPLL circuit in each DRNM for synchronization of the clocking signals from any three of the five access ports. When a processor cluster is free of any faults, processor modules directly connected to a DRNM are synchronized with one another on the DRNM to form a processor triad. However, once a fault is detected in a processor triad, the failed processor module will be disconnected from the DRNM associated with the processor triad. Then, the DRNM will coordinate with other DRNMs to form new processor triads after the clock and control signals of the non-faulty processor modules are properly grouped with each other by the reconfiguration algorithm to be discussed shortly. Fig. 3 
B. System Reconfiguration
The DRN architecture is managed by the neighborhoodgrouping reconfiguration algorithm, which is guaranteed to find a maximal number of processor triads on a processor cluster. Let fi, fi, f3, -. . fn, E (0, I , 2, 3 }, denote the numbers of fault-free processor modules on the n DRNMs of a cluster, then we can get + processor triads using this algorithm.
I=" ' 1
For convenience, a DRNM with i fault-free processor modules connected to its access ports is called an i-DRNM. Essentially, the neighborhood-grouping algorithm is based on a first-fit principle for processors on 2-DRNMs and 1-DRNMs to be grouped with each other along a ring direction. An error signal connected to every DRNM through a wired-OR broadcast line is used to inform every DRNM about initiation and termination of a reconfiguration process. A reconfiguration -,
W r i t e Back Cache Operation
Fig. 2. The cache coherence protocols with PV
process is initiated by the DRNM which detects the latest failed processor module by asserting the error signal. The reconfiguration process is terminated when the error signal is reset. Once the error signal is asserted by a DRNM, all the 2-DRNMs and 1-DRNMs connected on the ring will participate in the reconfiguration process. Since a DRNM can accept at most two external clock sources, no triad can be formed on 0-DRNMs, and thus 0-DRNMs do not participate the reconfiguration process. Moreover, including 3-DRNMs in the reconfiguration process will add extra overhead without gaining any flexibility, because if any processor in a 3-DRNh4 is to be grouped with other fault-free processors, then we will need a new fault-free processor from another processor triad to make up the loss.
For convenience, the DRNMs participating in the reconfiguration process are relabeled along the ring direction as D1, D2, D3 ... Dh, h 2 1, where D1 is the coordinator, R: of Di is connected to RF' of Di+l, and R/' of Dh is connected to R: of Dl.
Any DRNM between Di and Di+l is in the bypass mode. Three different types of messages, invite, join, and done, are passed between DRNMs along the directed ring direction, assuming that Di sends its messages to Di+, through RI, and Di receives messages from Ri-l through Ri. The invite-message is used by a message sender to indicate that it needs one more processor module to form a new triad. The join-message is used by a message sender to indicate that it has only one processor module which can be combined with other processor modules to form a new triad. Finally, the done-message is used to indicate that all the processor modules between the initiator and -,
the message sender have been combined into triads. Di, i 2 1, responds to the three different types of messages based on the following rules. 0 invite-message: Di grants the request and routes the clock-control signal of a non-faulty processor module to Di-l through Ri. If Di does not have any more non-faulty processor module, then it passes a done-message to Di+1. join-message: 1) If Di is a 1-DRNM then it sends an invite-message to Di+l, and waits for the response from Di+l. If the request is granted, a triad can be formed on Di+l. Otherwise, if the error-signal is reset then the reconfiguration process is aborted without forming any triad. 2) If Di is a 2-DRNM, then it accepts the joinrequest from Di-l to form a new triad, and it will send a done-message to Di+l.
0 done-message: If Di is a 2-DRNM, then it sends an invite-message to Di+l. Otherwise, it sends a join-message
When the reconfiguration process is initiated, D1 sends an invite-message to D2 if it is a 2-DRNM; otherwise it sends a join-message to D2 if it is a 1-DRNM. A new triad can be formed on Dl if the invite-message is granted, and the clockcontrol signals (of a non-faulty processor module) from R! of D1 will be synchronized with that of the two non-faulty processor modules on D , . Similarly, if the join-message of Dl is granted, then the only non-faulty processor on D1 can be grouped with other processor modules to form a new triad on some other DRNh4. After Di, i 2 1, sends out a request, it waits for a response from Di+l. New triads can be formed either on to Di+1. We now prove that the neighborhood grouping algorithm is deadlock-free based on the following simple argument. As mentioned earlier, both the MAT bus and the DRN are the reliability hard cores, implying that DRNMs will not fail during the reconfiguration process. In the neighborhood grouping algorithm, each of the DRNMs participating in the reconfiguration process receives from, and generates a message to, its neighbors along the same ring direction. Thus, the reconfiguration initiator will receive a message and will terminate the reconfiguration process in at most n steps, and all processors will proceed with their subsequent computational steps.
LIU AND SHIN: EFFICIENT IMPLEMENTATION TECHNIQUES FOR GRACEFULLY DEGRADABLE MULTIPROCESSOR SYSTEMS
The process for reconfiguring processor modules is illustrated with an example plotted in processor modules in a cluster form five processor triads. After several processor modules had failed, we have taken a snapshot of the system state immediately after the latest failure occurred to a processor module on DRNM,, Le., Fig. 4(b) . After detecting the faulty processor, DRNM, orders other processors to begin the reconfiguration process. After their state information is properly saved, all the existing triads on the 1-and 2-DRNM are decoupled from each other, and the neighborhood-grouping algorithm is initiated. DRNM3 and DRNM4 will not be involved in the reconfiguration process since they have three and zero faulty processor modules, respectively. After an invite-message is sent from DRNMl to DRNM2, the request is granted by DRNM2, since it has two non-faulty processor modules. The new processor triad is formed on DRNM,, and DRNMz will then send a join-message to DRNMs since it has only one non-faulty processor module available. DRNMS does not have enough processor modules to form a new triad. Thus, it sends an invite-message to its next neighbor, DRNM,. Since a processor triad is already formed on DRNMl, DRNM, terminates the reconfiguration process by resetting the error-signal. As a result, no new processor triad can be formed on DRNMs. The final configuration is plotted in Fig. 4(c) .
When an N M R system is used for hard real-time applications, it is important to know the worst-case timing behavior for the scheduling of fault recovery routines. For this purpose, we briefly analyze the time complexity of processor-cluster reconfiguration process as follows. Let Nf denote the total number of 2-and 1-DRNMs on a processor cluster. After a faulty processor module is detected in a processor triad, the state of processor triads on the 2-and 1-DRNMs needs to be saved into the main memory before the reconfiguration process can begin, and this will take Nf Ts(Iye-sIaIe time units to complete. It will then take Nf Tneigmr time units for the DRNMs to be grouped with each other based on the nearest-neighbor algo-rithm, where Tneighbor is the worst-case time for one 2-or I-DRNh4 to make its reconfiguration decision. After the reconfiguration decision is made, the clocks of the processors in the same processor triad need to be phase-locked to each other on the ADPLL of a 2-(l-)DRNM before normal computation can be resumed on the processors. The different clock sources can be phase-locked by resetting the clock sources of the processors and waiting until the phase difference between the different sources falls into a pre-specified time skew. We denote this phase-locking time as Tphase-lock, and it is upper-bounded by Nf Tphase-lmk. Finally, the state of processor triads needs to be reloaded, and the time complexity of this operation is Nf Tstute-rpstore. Summarizing the above discussion, we can express the total time complexity of processor reconfiguration as Nf (Tsave-stute -k Tneighbor q>hase-lock Tmre-restore).
C. System State Alignment
The DRN is designed to dynamically phase-lock the clocks and control signals of redundant processor modules during reconfiguration. However, phase-locking clocks is only necessary, but not sufficient, to ensure consistency of processor states. Note that even under normal operation, redundant processors in a triad may have inconsistent states if they do not handle external events at the same time, since the physical time skew between the redundant processors may not be negligible. Hence, the redundant processors need to handle external events at identical logical steps to ensure their mutual consistency. For instance, when an VO interrupt signal is asserted, processors need to wait for each other in order to enter an identical state for processing the interrupt event. This can be achieved by two design approaches: 1) a precise interrupt architecture for processors so that they will flush their pipelines before handling the interrupt, and 2) the interrupt requests are processed by each processor at the beginning (end) of each synchronization cycle. Similarly, when processors need to read external data, they have to write the input data into main memory that will be voted on by PVs. A watchdog timer is also needed to detect stalled redundant processors due to failed ready bits of the PV.
Realignment of processor state is relatively easy with a small performance penalty. When a fault is detected (masked) by the PV, the system can either ignore the failed processor or generate an interrupt signal to the processors for realignment of their internal states. That is, the cache memory needs to be flushed, and the registers' contents would be written back to the main memory. Since all the data must pass through the PV during the flushing of cache and registers, faults in any one of the redundant units will be masked. After the masked data is written into the main memory, processors can read back the registers and resume their computation. This way all the transient faults in processors and cache memories can be masked quickly. However, if faults continue to be detected in a module, the faulty module should be retired. Before the reconfiguration process begins, the old state information needs to be saved first, and then, after completing the reconfiguration, the newly-grouped functional units need to read in their state information altogether to resume lock-step execution of the program.
In a system with large main memory, realigning the memory state based on majority voting may become very timeconsuming. The inefficiency of majority voting can be alleviated by SECDED codes within each memory module, and each memory module can have a backup copy to cope with permanent failures. After a permanent fault occurred, and memory modules are regrouped, the state of the new backup module can be made identical to that of the original module in a word-byword readlwrite manner. It should be noted, however, that transient faults are the predominant causes of memory failures [20], [21] . The memory can be periodically scanned to recover from the single-bit transient faults with the SECDED codes. However, SECDED codes are not perfect; for example, they usually do not handle faults in the control/address and other decoding circuitry. Therefore, the voting technique is still useful for realignment of memory state in recovering from transient faults.
To reduce the memory realignment time, a memory system can be partitioned into pages, each of which is associated with an update tag bit to indicate its status [ 181. Let the main memory of size W be partitioned into K pages. Only those pages that have been updated need to be realigned. That is, the memory realignment time can be expressed as
t , = [K+F:)t,,,
where tv is the time to take a vote, and F a random variable denoting the number of recovery pages to be realigned, 0 I F I K . We assume that W is an integral multiple of K, and the inaccuracy resulting from such an approximation is found to be negligible.
Although the memory partitioning technique has been successfully implemented in [ 181, there remains an important question unanswered: "What is the optimal page size to balance between the page realignment time and the tag scanning time of the memory pages?' Since only the faulty pages need to be realigned, if the page size is too small, the time overhead of page scanning becomes the dominating performance overhead as compared to the actual page realignment time. On the other hand, if the page size is too large, then the pagerealignment time may become excessive. It is impossible to deterministically guarantee an upper bound of the realignment time. So, we propose to guarantee that for the mission period t, the probability of memory realignment requiring longer than a time period T is less than a given & . That is, depending on the memory failure rate, one can make a tradeoff between the time for scanning the page tags and the time to realign the faulty pages by minimizing the realignment cost. This is formally stated as: 
5300
. 6100. 6900. 7700.
--
8500.
EXPERIMENTS AND PERFORMANCE RESULTS
In this section, we present an experimental implementation of the PV and compare the performance of PV under two cache replacement policies. To validate our key design concepts, we developed prototypes of a 65C8 16-based processor triad and an ADPLL circuit, and logic simulation of the PV using the Galaxy logic simulation tool [22] .
The 65C816-based processor triad was built with an ADPLL circuitry, and data could be voted on every synchronization cycle. Upon detection of an error, an interrupt signal will be generated for each processor to handle the event of error detection and to force the faulty unit to be retired. The ADPLL was implemented with three programmable generic 12700. -~ chronization clock, by a divide-by-16 counter to be broadcast for mutual synchronization. Each ADPLL adjusted its rate when the skew between the synchronization clocks exceeded the duration of two high-speed clock pulses. Over the several months of our experiments, the ADPLL circuits showed remarkably stable behavior. No single loss of synchronization event was registered, and the time skew between different sources was maintained within two high-speed clock pulses. The only constraints on the performance of the ADPLL was the delay of logic circuits. It was found in a similar logic simulation that much higher speed clocks can be synchronized using high-speed logic devices. We then examined the performance of the PV through logic simulation with the Galaxy CAD tool [22] . In our simulation , , , . , , . ,
. . 
LIU AND SHIN: EFFICIENT IMPLEMENTATION TECHNIQUES FOR GRACEFULLY DEGRADABLE MULTIPROCESSOR SYSTEMS
511
Slow Clock - each processor has four registers and an ALU, and three processors were grouped into a processor triad. The instruction set included loadstore of data between main memory and registers, and simple ALU operations between the registers. Different memory write frequencies were randomly generated in test programs ranging from 10% to 50% in our experiments. The three clock cycles of the processors were set at 164, 170, and 176 simulation time units, a 7% frequency difference between the fastest and slowest clocks, where the simulation time unit can be scaled to different physical time units as needed. The time skew between the three clock sources is plotted in Fig. 6(a) , in which the fast and clocks are referred to as the clock sources of 164 and 176 time units, respectively. The simulated PV had an 8-word FIFO buffer and a majority voter. The PV was controlled by a simulated control unit so that the effective propagation delay in the majority voter could be altered in the range of twice faster or slower than one instruction cycle.
First, we studied the impact of clock skew on the clocking effect in the data buffer. For clock skew less than two clock cycles we found no significant impact on the queue length. Two separate experiments were run to examine the effect of different voting latencies on the queue length. In the first experiment, the PV allowed a vote to take place in one half of the execution time of a loadstore instruction. As shown in Fig. 6(b) which represents a random snapshot of queue lengths of the PV with the write ratio set to 50%, the queue length was at most one under very high write ratios. In the second experiment, each vote takes one and half of the instruction execution time. On some rare occasions where write operations occur consecutively, up to five outstanding data items were recorded in the PV. It should be noted that, when the cache memory is added to each processor, the main memory read/write ratio is expected to be reduced, and thus, the queue lengths are expected to be further reduced.
We examined the fault masking capability of the PV using a simple fault injection circuit. In our experiment, only one voter was used for data voting, and the voter is driven by the fast clock source. After a fault was injected, the fault detection (masking) latency was recorded as part of the simulation output, and then manually calculated due to the lack of an automatic event trace registration facility.
Different faults were injected to write-control signals, address lines, data lines, and the clock signals of the target system. The processor was set to execute a load or store instruction in four cycles, and the data manipulation instructions in three cycles. In most of the simulation runs the voter cycle time was set to be less than one instruction cycle to simulate fast cache memories. Therefore, the main part of the voting latency was contributed by the time required to write the voted data to memory. For slower memories, the voting latency was simulated to take twice as long as one instruction cycle. The data collected in these experimental runs assumed that the voter took six clock cycles to complete the voting process.
Transient, intermittent, and permanent faults were tested. Injection of permanent faults was relatively straightforward. Permanent faults were created by the fault injector after a random period, and the target signal line was set to either a stuckat-0 or -1, and then, the number of cycles after the fault occurrence was monitored and recorded. Although transient faults were also injected, only a small fraction of faults were detected when they caused data faults. Almost all transient faults on the address bus, data bus, or control signals did not create any error in the computational results, and thus were not detected in the voting process. Hence, results on only a few instances of transient faults being detected were not reported as they lack statistical significance. Intermittent faults were created by forcing a stuck-at-1 (-0) after a few clock cycles. Clock faults were created by forcing the faulty module's clock line to be stuck-at-1 or stuck-at-0. A similar phenomenon existed initially in the intermittent fault simulation runs, but eventually many more fault detections were registered, and thus were reported here. The detection times of different faults are plotted in Fig. 7 .
In general, the fault detection time first decreased with the write ratio, and then increased after a certain threshold. The reason for this trend is that, when the write ratio was very low, it took a lone time before the faultv data could be voted on bv also observed that the watchdog timer was triggered in many cases before the data voting could take place, because of the clock skew. Therefore, the watchdog timer needs to be properly adjusted to avoid excessive false alarms.
We now compare the performance of PV under the writeback and write-through cache replacement policies using the average memory access time T M as the performance parameter. T M is determined by the voter architecture, clock skew, and the speed difference between the main and cache memories. The main memory is assumed to be k-way interleaved, and no performance loss is assumed in case of a cache read-hit. The processor is blocked in case of a cache miss. Under the writethrough protocol, data items to be written into the cache and main memory are first voted on by the PV, and the voted result will be stored into the main memory if no error is detected. Since the main memory is assumed to be interleaved, there is a random waiting time before the data item can be stored into the main memory, and all subsequent outputs from the PV will be blocked .
For its performance analysis, a PV can be modeled as a queueing system, where the write-buffer of the PV, and the main memory are, respectively, represented by the queue and the server. The waiting time T, for redundant data to become ready in the PV is determined by the clock skew between the redundant processors. Let S denote the maximum time skew between redundant units, and clocks are adjusted once every rs seconds, then the average waiting time due to the clock skew can be expressed as
The average waiting time before the voted data can be stored into the main memory is where tc is the main memory cycle time. The average memory write time is thus E(Tw) = %+$. Since we know the average service time of the server and the arrival rate of the customers (write-requests), we can get the average number of data items waiting in PV as Q = AE(TW) = A(%+$), where A is the arrival rate of memory-writes.
We note that under the write-through protocol, the processor can proceed with its computation without performance loss in case of cache hit, as long as the PV is not full, so that the data item can be directly loaded into the PV. Processor performance loss occurs only in case of a cache miss, since the PV must be flushed before a new cache block can be read in, " the PV. With a further increase of write ratio, the injected faults might be overwritten by new write commands, thus increasing the fault detection time. In all cases, the variances between different runs of experiments are fairly large, indicating the unpredictable nature of fault detectiodmasking. It is and thus, the performance loss is affected by the of outstanding data items to be voted on in the PV. Assuming that CPU blocking has a negligible effect, the average memory access time can be expressed as where h denotes the cache-hit ratio, b denotes the cache block size, and P , denotes the fraction of memory-write instructions.
The three terms multiplied by the factor (1 -h) denote the average delay for flushing the PV, reloading the cache, and accessing the needed data item, respectively. We now derive the average memory access time under the write-back cache coherence protocol. In the write-back protocol, a dirty block will be written into the main memory when a cache miss occurs. Assuming that cache blocks are randomly replaced, the average memory access time under the writeback protocol can be expressed as 
where Pdirty is the probability that the cache block to be replaced is dirty. Pdirty is derived as follows. Assuming that the cache consists of n blocks and all blocks have an equal probability to be accessed in each cycle, the probability of a cache miss at the ith cycle since the last cache miss is hi-' (1 -h). Thus, the probability of a dirty cache block being replaced is That is, the total probability of a dirty block being replaced is 
where Pdirry = & when n w 1. As depicted in Fig. 8 , the aver-. . \ .
age memory access time of the write-back protocol is not sensitive to the time skew between the redundant units, nor to the memorywrite frequency. On the other hand, the average memory access time of the write-though protocol increases sharply with the clock skew and memory-write the write-back protocol is better suited for Nh4R multiprocessor system.
1-h
l-h*(l+).
The performance difference between the two protocols can be expressed as
IV. CONCLUSION
In this paper, we have presented the MAT bus and the DRN as two architectural support mechanisms for dynamic reconfiguration in an NMR-based multiprocessor system. We have shown that the MAT bus can implement the fault detection, instead of fault masking, function of the PV. The MAT bus and DRN together provide a flexible platform for the dynamic grouping of processors. The MAT bus-based architecture is rationalized by the fact that faults occur much less frequently than data readdwrites. Thus, to improve the system throughput, data from redundant computing units can be put into the PV buffer, and proceed with their normal computation. Experimental results showed that the proposed scheme is feasible and introduces only a very small performance overhead. Our model also showed that combination of the PV with the writeback protocol can virtually eliminate the performance loss resulting from data voting.
We prove Theorem 2.1 in four steps. We first exploit the basic attributes of the probability function on the number of faulty pages in the system, in Lemma A. 1. Then, we will find the maximum number of faulty pages that can be realigned without violating the timing constraint, in Lemma A.2. In the third step, we will show in Lemma A.3 that if the system has a large number of memory pages, the probability off pages being faulty is insensitive to the change of page size. So, as the last step, we can conclude that for an arbitrary large integer K, we get f K close to f K * , where K* is the optimal page size being calculated. Therefore, even though K* is unknown, we can still estimate f K * as fK. By plugging fK into the objective function and then taking its derivative, we can get the approximate optimal value of K*, Le., Theorem 2.1. Details of these steps are explained as follows.
The probability that f pages become faulty in the system by time t is Pr(F(r) = f ) = (fK)RF-f'(t)(l -R , ( t ) f , where 1 -RJt) is the probability that one or more of the redundant memory pages are faulty. Let A and q denote the failure rate of a memory word and the number of redundant modules, respectively, we have R,,(t) = e-q%'. Let I+V = WqAt, and K be a fixed constant, then the conditional probability that f recovery pages need to be realigned is K = k is a feasible solution if and only if Pr(Z(t) > r ) < E . If T 2 Wtv, the recovery page design is trivial, because the memory can be easily realigned by voting on every word. If T < Wt,, only a limited number of pages should be realigned. If K* S 1 , then an exhaustive search for K* suffices. On the other hand, if K* > 1, as is in most large systems, an approximate value of K* can be found through the following simple optimization technique. Proof: From Lemma A.3, we get PK, (f) = PK, (f), Vf .
applying Lemma A.2 to an arbitrary K such that P ( f > fK) < E .
Clearly, for a given E , fK = 3, V K > 1, where 3 is some constant. The cost function Z(t) to be minimized can be expressed as min(K+ f -). Since the objective function is convex when K is continuous, the optimal solution of real-valued Ks is K' = m. Then, K* can be found by an exhaustive search in [K' -6, K' + 4, where 6 is very small compared to 
