Abstract-Conventional fault-tolerant modulo arithmetic processors rely on the properties of a residue number system with L redundant moduli to detect up to L / 2 errors. In this paper, we propose a new scheme that combines r-out-of-s residue codes with Berger codes to concurrently detect any number of module errors without any redundant moduli. In addition, this scheme can tolerate L faults if L redundant moduli are used, and has the property of graceful degradation when the number of faulty moduli exceeds L. Finally, it is shown that the added cost for fault tolerance is much less than those were reported earlier in the literature.
I. INTRODUCTION S IN OTHER computations, reliable computing is critical
A in arithmetic processors as well. In particular, when such processors are used in critical applications such as flight navigation, medical analysis, and real time monitors.
In general, two approaches can be used to detect errors in a digital system. The first is off-line testing which requires interrupting the normal operation of the system to diagnose it for faults; the second is on-line error detection which can be carried out during the normal operation of the system. Offline testing is effective for detection of hard (i.e, permanent) or long duration circuit faults only, while on-line testing (usually called concurrent error detection, or CED for short) can detect transient faults (or called soft) faults, which are predominant in modern digital systems.
Fault detection and correction techniques fall into three major categories. The first category deals with redundant residue number system (RRNS) which has a number of redundant moduli [19] , [32] . The second category exploits the arithmetic codes [28] . The third category uses a discarding policy.
Most efforts to date in the first category rely on the fact that if the proper redundant moduli are included in the residue number system (RNS) code, then the special algebraic structure of the RNS allows the computational errors to be detected and corrected [2] , [7] , [ l l l , [191, errors [ 191. However, the hardware required for detecting and correcting the error is very complicated. In addition, these schemes are implemented by ROM lookup tables. This is ineffective when the underlying modulo arithmetic unit is not constructed by using ROM lookup tables such as those proposed by these authors [ 151, [ 161.
The second category for error detection and correction is based on arithmetic codes which have two varieties: A N codes [3] , [19] , where A is the generator of the code and N is the information represented, and residue codes [28] . Several totally self-checking error detection circuits for low-cost AN codes with generator A = 2" -1 have been suggested [8], [22] , [23] . However, no effective error detection and correction procedure has been reported.
The residue code with check base A is the code which attaches to an arithmetic value X a check value X' where X' = X mod A . that is, it forms ( X , X ' ) pair. The essence of residue codes is the fact that they preserve their properties in arithmetic operations: addition, subtraction, and multiplication.
That is, let N I and N2 be two positive integers and R1 and R2 be their respective residues, then ( N I f N2) rnod A = ( R I f R2) rnotl A and (NlN2) mod A = ( R I & ) mod A . A sufficient and necessary condition for using residue codes to detect and correct errors in a digital system has been given in [lo] . Several residue generators have been reported in the literature [5] , [24] . The third category of error detection and correction is based on discarding of the faulty modulus [27] . The rational behind this is that if the faulty modulus is known then after the residue digit represented by the faulty modulus is removed the remaining residue digits still represent a legitimate number. Therefore, the redundancy can be reduced considerably. In general, one can correct up to L digits in a system with L redundant moduli. Using this idea, Taheri er al. [30] designed an RNS processor with the capability of distributed fault detection. However, their design requires a complicated and expensive combination of residue decoders and their error detection mechanism is based on the parity code which is evidently insufficient for VLSI implementations [ 11, [26] .
The error detection and correction schemes mentioned above except the arithmetic code method are founded on the redundant residue number system and they use ROM lookup tables throughout the entire system. The error detection and correction circuits for these schemes are very complicated and costly in both time and hardware [3] , [7] , [ 111, [19] . Once an error is located, the correction proceeds in an unchecking way although self-checking is used during the course of error detection [ 1 11, [ 121. To reduce the hardware cost, an algorithm that combines the operations of scaling and single residue error correction into one circuit were proposed in [29] . Although this reduces the hardware cost of the scaling and error correction circuit by using the same mixed radix conversion circuit, the entire circuit, which is called an error correction circuit with scaling (EECS), is not self-checking. Therefore, the reliability of the circuit is in question. In most modulo arithmetic processors, the checkers and error correction circuits can no longer be assumed to be error free because checker circuits are constructed from the same components as the circuits that perform the arithmetic operations and hence are subject to the same types of failures [28] . Thus, the checkers and the error correction circuits must be able to detect and indicate this fact when errors occur inside these circuits.
In this paper, we improve the aforementioned results by introducing self-checking Berger code checkers into an arithmetic processor which was described in [151. This new approach exploits the r-out-of-s code representation of the perands of an arithmetic processor to distribute the error detection into each modulus rn,. As such, it can be viewed as a "discard the faulty modulus" approach. The result is a simple modulo arithmetic processor that can detect any number of module errors without any redundant moduli. Furthermore, it can tolerate up to L faults if L redundant moduli are used. The remainder of this paper is organized as follows. Section I1 describes a cyclic permutation network and establishes its fault detection property. Sections I11 shows how to design a CED arithmetic processor using cycllic permutation networks. Section IV extends this CED arithmetic processor to obtain a fault-tolerant arithmetic processor and the paper is concluded in Section V.
CYCLIC PERMUTATION NETWORK
In this section, we examine the effects of most common faults in NMOS circuits [l] , [18] on cyclic permutation networks which will be used in Sections I11 and IV to construct a CED arithmetic processor and a fault-tolerant arithmetic processor.
We first recall the definition of an r-out-of-s residue code. That is, an r-out-of-s residue codeword is a concatenation of r 1-out-of-mi codewords for 1 5 i 5 'r.
From this definition, it is easy to see that 1-out-of-ni, code is a special case of r-out-of-s residue code with r = 1 and
In general, a cyclic permutation network consists of three function blocks: a switching network, an input encoder, and some output buffers. The input encoder converts the binary input operand Y into its equivalent two-rail binary form to control the switches in the switching network. The switching network performs the required permutation operation on its rout-of-s residue coded operand. . The output buffers provide . . . 
Fig I network
A binary to 2-out-of-5 reside encoder on a cyclic permutation the interface between two cyclic permutation networks. These buffers are not necessary for the operation of the cyclic permutation network but included to restore the signal strength and hence to reduce the delay [21] . The cyclic permutation networks in our arithmetic processors are implemented using NMOS switches or CMOS transmission gates. Fig. 1 shows how a binary to 2-out-of-5 residue encoder is implemented on a cyclic permutation network using NMOS switches. We consider only transistor failures in our MOS circuits. Two common faults of MOS transistors are stuck-open and stuck-on faults [I] .
Definition 2: (uniformfault) The faults in a circuit are called uniform if they all cause the transistors to be stuck-open or stuck-on but not both.
Definition 3: (unidirectional error)
The errors in a circuit are called unidirectional if either all the errors in any codeword are from 0 to 1 or they are from 1 to 0. but not both.
To illustrate how a unidirectional error can occur on the network in Fig. 1 , suppose that the network receives 01100 on its left inputs, and bz = l , b l = 1. and bo = 0. The network will produce the codeword 01100 if all transistors are faultfree and no two lines are short-circuited. On the other hand, if both transistors A and B are stuck-open then the codeword changes to 01111. This is an example of a unidirectional error.
Definition 4: A path from an input to an output is said to be active if it is a signal propagation path under a given switching state.
Theorem I : Assume that all output lines of the input encoder of the cyclic permutation network under consideration are error free. The errors caused by any uniform faults in the switching stages of a cyclic permutation network are unidirectional.
Proofl Assume that all pull-up transistors are fault free. Any transistor with stuck-open fault will disconnect the active path containing the faulty transistor from some input to its output. Therefore, the outputs will be always 1's due to using pull-up transistors. This causes a unidirectional error at the output of the cyclic permutation network. Similarly, any transistor with stuck-on fault might cause multiple paths. Some of these may combine both "0"-signal and "1"-signal. If such a case occurs, the resulting signal is justifiably assumed to be "0" [21] . Therefore, the stuck-on transistor will absorb the "1"-output and causes a undirectional error at the output of the cyclic permutation network. 11
The following theorem is a corollary to Theorem 1. Theorem 2: Any uniform faults in the switching stages of a cyclic permutation network can be detected by an inspection of the codewords at its outputs.
Proofi By the previous theorem, the errors caused by any uniform faults are unidirectional. This means the codeword at the output of the cyclic permutation network in question will have either none or multiple occurrences of 1's. But this invalidates the codeword as an r-out-of-s residue code and hence can be detected. )I
The following theorem extends this result to uniform faults at the output lines of the input encoder of a cyclic permutation network.
Theorem 3: Any uniform faults at the input encoder of a cyclic permutation network can be detected by an inspection of the codewords at its outputs.
Proof: Any stuck-on or stuck-open fault inside the input encoder of a cyclic permutation network will cause its output to be stuck-at-1, stuck-at-0, or floating. These will in turn cause the transistors in the switching stages to be stuck-on or stuck-open faults. If the result is stuck-on, multiple copies of 1's will propagate to its outputs for at least one code input. On the other hand, if the result is stuck-open fault, it will block the transmission of any 1's to its outputs for at least one code input. Both these faults, when they do not occur simultaneously, invalidate the r-out-of-s residue code at the output end of the cyclic permutation network, and therefore can be detected. 11
Theorem 4: The errors caused by any single stuck-at fault or line-open faults at the input lines of the input encoder of a cyclic permutation network cannot be detected by an inspection of codewords at its outputs.
Since no stuck-at fault or line-open fault at the input lines of the input encoder of a cyclic permutation network can set both output lines of the input encoder to the same values (1 or 0), it can never invalidate an r-out-of-s residue coded codeword, and hence cannot be detected by an inspection of codewords at the outputs of the network. I( One way to detect the faults at the input lines of the input encoder of a cyclic permutation network is to use two-rail code. However, it is not easy to design a self-checking r-outof-s residue code to two-rail code conversion circuit. A better alternative is to encode the inputs of the input encoder by a Berger code [4] forces the outputs to which the transistors are connected to be grounded. For example, if the third input is 1 then b2, bo, and go are grounded (0) and bl and g1 remain connected to V d d (1). The Berger code encoded binary residue decoder not only can detect the errors caused by the faults inside the decoder but also can detect the errors at the input lines of the decoder [ 181.
It follows that with the addition of a Berger code encoded 1-out-of-mi binary residue decoder, our cyclic permutation networks can be made to detect any uniform faults including those at the inputs of the input encoder.
A CED MODULO ARITHMETIC PROCESSOR
The general structure of cyclic permutation network modulo arithmetic processor is shown in Fig. 3 an output r-out-of-s residue decoder. The operation of this processor was described in detail in 1151.
All four parts are constructed from cyclic permutation networks. The outputs of the binary residue encoder for operand Y are encoded by Berger code and then checked by self-checking Berger code checkers [17], [20] . Each selfchecking Berger code checker receives two inputs: binary residue code and Berger check and then generates one pair of complement outputs, i.e., 1-out-of-2 code. When the circuit is fault-free, it outputs ( 0 , l ) or (1, 0 ) ; otherwise it outputs The operation of the r-out-of-s residue decoder is described as follows. First, we carry out the residue decoder by using an extension of Garner's algorithm. Let ( T I , T Z , . . . , T , ) and (z1, 5 2 , . . . , zP) be the mixed radix and RNS representations of X with moduli ml, m2, . . . , m,, respectively. Garner's algorithm receives 1c1 , 22, . . . , z, as input and produces a single output. The algorithm given below is a modified version of Garner's algorithm and computes t bits of the binary output at a time, where t is some positive number between 1 and lg M and M = m1m2.. . m,.
(1,1) or (0,O).
Algorithm (r-out-of-s residue decoder)

Begin
Step 1: Determine constants c+ satisfying m;c;,j = 1 mod m j , for 1 5 i < j 5 r, and constants czt,i satisfying 2tczt,; = 1 mod mi, for 1 5 i 5 T , where t is a positive integer 2 max { [lgmil}. This step is carried out off line and is not part of the residue decoder.
Step 2: Let y; = z;, for all i, 1 5 i 5 r.
Step 3: Base extension to mod 2t.
Compute
lsisr
: 3, t = 3, ml = 3, m2 = 5, m3 = 7. Step 5: Repeat Steps 3 and 4 for End Fig. 4 depicts a network implementation of the residue decoder for T = 3,ml = 3,mz = 5,m3 = 7, and t = 3.
Compute
The number 5 (take (0,2,3) = 87 as an example) enters the circuit on the left in RNS, and exits it in binary on the right in three iterations; the first iteration computes the least significant three bits (=11 l ) , the second iteration computes the next three significant bits (=010), and the last iteration computes the most significant bit (=001). Because we use r-out-of-s residue codes to encode each cyclic permutation network inside the modulo arithmetic processor, and since those cyclic permutation networks with the same modulus are cascaded straight without any interface using non-r-out-of-s residue codes between them, it suffices to use self-checking Berger code checkers in the last stage of each modulus or where the outputs are converted to binary. modulo arithmetic processor with three moduli, m l = 3, m2 = 5,m3 = 7.
To collect the faults detected by the self-checking Berger code checkers and produce a final error indication signal, the two-rail outputs of the self-checking Berger code checkers are fed into a two-rail code checker tree [28] . This tree maps m input pairs, {(ao, bo)(al, b l ) . . . (~"-1, bm-l)} to an output pair, (20, zl) . The output pair is complementary if and only if each and every input pair is complementary [31] . In general, to reduce the hardware cost, the tree is designed by modular approach, that is, it is divided into h levels. A complete tree with h levels of cells, where each cell is an m-bit two-rail comparator, can be used to compare mh bits and contains ( m h -l ) / ( m -1) cells. The value of m determines the number of code inputs required for testing the tree, that is, 2", where m is the size of the bit vectors compared by each cell [13] , [14] . An example can be found in [28] .
IV. A FAULT-TOLERANT PROCESSOR
Fault-tolerance is the ability of a system to continue to perform its functions after the occurrence of faults. In a modulo arithmetic processor, since each module works independently except in the final residue decoder stage where they are combined together, any correction operation must be done in the residue decoder stage.
Most of the earlier efforts on fault-tolerant processor design have focused on the stages before the final residue decoder. These efforts assume that the residue decoder is more reliable than the other circuits and the correction circuits are error-free [6] , [7] , [12] , [29] . However, this is not the case when the required output is a binary number which requires a residue decoder to combine the values from all moduli into one result and the correction circuit is built by the same technology as all the other circuits in the processor. Another shortcoming of these efforts is that they assume overflows and errors could not occur simultaneously [ll] , [12] , [29] .
Two approaches are generally used to achieve faulttolerance in digital systems. One is a fault masking technique that uses redundant circuits which work in parallel and vote for the outputs of the system. In this technique, we do not need to know the exact faulty module. However, we must maintain multiple copies of the system which requires excessive hardware cost.
The second approach is based on reconfiguring the system so as to circumvent the faults. In this approach, three steps are needed: fault detection, fault location, and fault recovery. Most of the published literature on fault-tolerant modulo arithmetic processors belong to this class [12], [29] . The bulk of this technique is spent in fault detection and fault location. Therefore, if the faulty modulus can be located easily then the reconfiguration process is reduced to a simple one. Taheri et al. [30] proposed a bit-sliced ROM based modulo arithmetic processor which can detect the parity error on the basis of each modulus. Once detected, faulty moduli are discarded. The essence behind this is the following theorem.
A redundant residue number system (RRNS)
is defined to be a residue number system with L additional moduli. All L + r moduli must be pairwise relatively prime to ensure a unique representation for each number in the system and each redundant modulus must be greater than any modulus mi; 1 5 i 5 r.
Theorem 5 ([30]):
An RRNS with one redundant modulus allows correction of one error if the erroneous modulus is discarded. In the following, we combine the features of cyclic permutation networks, unidirectional detecting codes (l-out-ofmi code and Berger code), and self-checking Berger code checkers to propose an approach which can tolerate L faults if L redundant moduli are used. This approach is also based on the principle of discarding faulty modulus.
The general structure of our fault-tolerant modulo arithmetic processor has the same architecture as the CED modulo arithmetic processor described in the previous section. However, some modifications in the residue decoder must be made so that it can be easy to remove a faulty modulus from the system.
As shown in Fig. 6 , two 2 x 1 multiplexers are added to the residue decoder. In general, the residue decoder requires (T-1) 2 x 1 multiplexers. In addition, each modulo arithmetic network is confined into a single operation stage with two states: state 0 is the identity permutation and state 1 is the normal operation. These two states are controlled by error indication signals Ei's.
In order to generate an appropriate set of error indication signals to control the operations of cyclic permutation networks or the 2 x 1 multiplexers, the cyclic permutation networks that operate on the same modulus have an individual ) arithmetic processor with r = 2, L = 1, ml = 3, m2 = 5 and m3 = 7 self-checking two-rail checker tree. These cyclic permutation networks constitute a residue digit module with modulus m;.
Once a faulty module within a residue digit module with modulus mi is detected, all output is removed from this residue digit module (i.e., is effectively removed this faulty modulus) to prevent data from reaching succeeding stages. This can be done by setting all corresponding arithmetic network stages to their identity permutations. To see this, let us recall the algorithm described in the previous section. In order to remove the effect of residue digit module mi, it is necessary to set all T; to 0 and c;j to 1 for 1 5 j 5 T in Step 3.1 and Step 3.2 and set S; to 0 and mi to 1 in Step 3.2 of the algorithm. For example, in a modulo arithmetic processor with T = 3, ml = 3, m2 = 5, and m3 = 7, if any module within residue digit module with modulus 3 is faulty then we discard this residue digit module by setting T I = 0,c12 = 1, and ~1 3 = 1 in Step 3.1 and
Step 3.2; SI = 0 and ml = 1 in Step 3.2. Therefore, if residue code is ( 2 1 , ~2~x 3 ) = (2,3, l ) , then the result in binary representation is 8. Note that if modulus 7 is a redundant then 8 is within the range spanned by modulus 3 and 5.
The settings ri = 0 , c;j = 1, and mi = 1 are easily affected by the cyclic permutation networks. Since the value of T; is combined into the subtraction; c;j and m; are combined into multiplication, these settings correspond to the identity permutations. As for Si, due to the fact that S; is sent to residue digit module i + 2, if residue digit module i + 1 is faulty, a 2 x 1 multiplexer is used to select and send S; or S;+1 into residue digit module i + 2 circuit to obtain a correct result.
An example of fault-tolerant cyclic permutation network modulo arithmetic processor with T = 2 , m l = 3,m2 = 5 , L = 1, m3 = 7 is shown in Fig. 6 . Each of the control signals, Ei's, comes from the output via an XNOR gate of the two-rail checker tree of each residue digit module, mi, for 1 5 i L. 3 , are used to control the states of cyclic permutation networks and select the sources of the multiplexers. When E, = 1 (indicating residue digit module mi is faulty), E; signal forces the switch states of all cyclic permutation networks which it is connected to identity permutations. This effectively removes the effects of the residue digit module m; from the system. As indicated in the figure, the number of faulty residue digit modules that system can tolerate is not constrained to 1. In this figure, if another residue digit module, m2 = 5 , is also faulty, then after it is removed the system is still workable but with the range spanned by modulus ml = 5 only. That is to say, the operating range of the system is degraded gracefully from three moduli to one modulus. In general, if L redundant residue digit modules are used then up to L faults can be tolerated in a gracefully degrading fashion until all residue digit modules are faulty.
V. CONCLUSION
In this paper, we introduced a cyclic permutation network modulo arithmetic processor which combines the features of cyclic permutation networks and unidirectional error detecting codes: 1-out-of-mi codes and Berger codes. The resulting processor can concurrently detect any number of faulty moduli without any redundant moduli and also has fault-tolerant capability. It can tolerate L faulty moduli without any performance degradation if L redundant moduli are used in the system. In addition, it has the characteristic of graceful degradation when the number of faulty moduli exceeds the number of redundant moduli.
Comparing our design with the work of Taheri et al. [30] , our design has the following advantages. First of all, in contrast to their single error assumption, our design can detect all unidirectional errors. Second, to tolerate L faults, the residue decoder of Taheri et. al. needs ( " : L ) copies of residue decoders. In our design only one residue decoder with r + L residue digit inputs is required. Thus the hardware cost is minimal. Third, to degrade the system gracefully, they need much more hardware support whereas in our design no extra hardware is required. That is to say we use the same hardware for both fault-tolerance and graceful degradation.
Comparing our design with other similar efforts [12] , [29] , our design also has several advantages. First of all, we do not need to assume that overflow and error could not occur simultaneously because we do not use the properties of RRNS. Instead, we mix the features of cyclic permutation networks and unidirectional detecting codes: 1-out-of-m codes and Berger codes, into our circuits. Second, [12] , [29] need at least L redundant moduli for detecting L faults. However, our design does not need any redundant moduli if the goal of the system is only to detect errors. Third, [ 121, [29] only consider the errors which occur before the residue decoder. However, this is not the case in most practical applications. Since the residue decoder also is a complex circuit, it, in general, is as likely to fail as the other circuits in the system. In our design, the error detection is for the entire system. Finally, they all rely on the MRX algorithm [9] which requires excessive hardware to do the error detection. However, in our design we use the self-checking Berger code checkers [I71 which have less hardware cost than the MRX algorithm.
As for future research, a major task that remains is an effective MOS VLSI implementation of the self-checking Berger code checkers used in our CED and fault-tolerant arithmetic processors.
