n À 1 adders as fast as n-bit 2's complement adders have been recently proposed in the open literature. This makes a Residue Number System (RNS) adder with channels based on the moduli 2 n , 2 n À 1, and any other of the form 2 k À 1, with k < n, faster than RNS adders based on other moduli. In this paper, we formally derive a parametric, with respect to the adder size, test set, for parallel testing of the channels of an RNS adder based on moduli of the form 2 n ; 2 n À 1; 2 k À 1; 2 l À 1; . . . ; with l < k < n. The derived test set is reusable; it can be used for any value of n; k; l; . . . , regardless of the implementation library used and is composed of n 2 þ 2 test vectors. A test-per-clock BIST scheme is also proposed that applies the derived test vectors within n 2 þ 2n cycles. Static CMOS implementations reveal that the proposed BIST offers 100 percent postcompaction fault coverage and an attractive combination of test time and implementation area compared to ROM and FSM-based deterministic BIST or LFSR-based pseudorandom BIST.
Abstract-Modulo 2
n À 1 adders as fast as n-bit 2's complement adders have been recently proposed in the open literature. This makes a Residue Number System (RNS) adder with channels based on the moduli 2 n , 2 n À 1, and any other of the form 2 k À 1, with k < n, faster than RNS adders based on other moduli. In this paper, we formally derive a parametric, with respect to the adder size, test set, for parallel testing of the channels of an RNS adder based on moduli of the form 2 n ; 2 n À 1; 2 k À 1; 2 l À 1; . . . ; with l < k < n. The derived test set is reusable; it can be used for any value of n; k; l; . . . , regardless of the implementation library used and is composed of n 2 þ 2 test vectors. A test-per-clock BIST scheme is also proposed that applies the derived test vectors within n 2 þ 2n cycles. Static CMOS implementations reveal that the proposed BIST offers 100 percent postcompaction fault coverage and an attractive combination of test time and implementation area compared to ROM and FSM-based deterministic BIST or LFSR-based pseudorandom BIST.
Index Terms-Residue Number System, Built-In Self-Test, deterministic and pseudorandom tests, formal test sets. 
B
UILT-IN Self-Test (BIST) is an effective approach for testing contemporary integrated circuits (ICs) that reduces the need for external test since the circuit and its tester are implemented in the same chip, enabling the circuit to test itself. To meet the ever-shrinking time-tomarket requirements of custom ICs, rapid test-pattern generation and test set embedding are essential. Test sets derived by deterministic test pattern generation, however, require large implementation areas if embedded as a finite state machine or by storing them in an ROM, while easily implementable pseudorandom test pattern generators require long test sequences. Besides the specific disadvantages mentioned previously, the above schemes, except the ROM-based one, suffer from lack of reusability. In the case of regular circuits, reusable formal test sets may be derived, which are parameterized with respect to the number of inputs of the circuit. If this test set is also independent of the implementation library, migration to new libraries can be done rapidly. In this paper, we derive a formal test set and propose an efficient BIST scheme for Residue Number System (RNS) adders.
The RNS has been widely investigated and used in Digital Signal Processing (DSP) applications [1] , [2] . A set of L moduli, suppose m 1 ; m 2 ; . . . ; m L ð Þ , that are pairwise relative prime is used to define an RNS. Any integer X, with 0 X < M, where M ¼ m 1 Â m 2 Â . . . Â m L , has a unique representation in the RNS given by the L-tuple of residues X ¼ x 1 ; x 2 ; . . . ; x L ð Þ , where x i ¼ X mod m i . A two operand RNS operation, suppose Ã, is defined as z 1 ; z 2 ; . . . ; z L ð Þ¼ x 1 ; x 2 ; . . . ; x L ð Þ Ã y 1 ; y 2 ; . . . ; y L ð Þ , w h e r e z i ¼ x i Ã y i ð Þmod m i . In most RNS applications, Ã is either addition, subtraction, or multiplication. According to the above, each residue can be computed independently of the others allowing fast data processing in L parallel independent channels.
The latency of an RNS operation depends on the latency of the slowest among the channels. The delay of an adder modulo m is greater when m 6 ¼ 2 n . Efficient designs for the RNS channels have been recently proposed in [3] , [4] , [5] for m of the form 2 n À 1. The authors of [3] , [4] , respectively, propose Carry Look-Ahead (CLA) and parallel-prefix design methodologies for modulo 2 n À 1 adders that lead to implementations that can operate as fast as modulo 2 n adders. Modulo 2 n À 1 modified Booth multipliers that can operate as fast as the corresponding integer multipliers were introduced in [5] . According to the above, RNS adders based on moduli of the form 2 n ; 2 n À 1; 2 k À 1; 2 l À 1; . . . , with l < k < n, lead to the fastest implementations. For this reason, we focus on RNS with moduli of this form. In most practical cases, the choice L ¼ 3 and k ¼ n À 1 is preferred. Examples of such systems using earlier design methodologies for the modulo 2 n À 1 channel have been presented in [6] , [7] , [8] . Efficient residue to binary converters in the L ¼ 3 and k ¼ n À 1 case, have been proposed in [9] , [10] .
Formal test sets for CLA integer adders have been presented in [11] , [12] . In [11] , the authors show that 2 Â n þ 1 ð Þvectors are sufficient for testing a simple n-bit CLA inclusive-OR adder. The author of [12] derives test sets for exclusive-OR CLA and block-CLA adders. Inclusive-OR CLA adders are faster than exclusive-OR CLA adders.
In this paper, we derive a formal test set for modulo 2 n inclusive-OR CLA and parallel prefix adders, consisting of 2 Â n vectors, and, for the first time in the open literature, a formal test set for modulo 2 n À 1 full CLA and parallelprefix inclusive-OR adders. We show that extracting from the latter test set a subset of k bit positions k < n ð Þin order, will form a reduced width test set for testing a modulo 2 k À 1 or a modulo 2 k adder. Therefore, the test set of a modulo 2 n À 1 adder is a superset of the test set of any modulo 2 k À 1; 2 l À 1; . . . ; adder with l < k < n. By merging the test sets derived for modulo 2 n and modulo 2 n À 1 adders, we get a formal test set for an RNS adder based on the moduli 2 n ; 2 n À 1; 2 k À 1; 2 l À 1; . . . , consisting of n 2 þ 2 vectors. Our test sets, apart from being composed of a small O n 2 ð Þ ð Þ number of vectors, also have the advantage that they are derived based on the adder's equations and are parameterized with respect to the size of the adder; therefore, they are reusable. A designer can use them, irrespective of the implementation library that he targets or the size of the operands.
We also present a novel Built-In Self-Test (BIST) scheme that can apply the formal test set in O n To speed up the addition operation, the carry computation time should be minimized. To this end, carry lookahead (CLA) adders [13] , [14] are used. Since the circuit required by the carry look-ahead logic grows rapidly with the operand width, it is quite profitable to divide the carry computation unit into smaller units. The carry outputs of these smaller units can then either ripple between them or be driven to a CLA unit of a subsequent level, leading to two or more level CLA adders. If carry computation is treated as a prefix-problem, a special form of multilevel CLA adders can be derived, which are well-known as parallel-prefix adders.
When dealing with CLA adders, two functions are commonly used for describing the equations for the carries:
. g i ¼ a i Á b i , the carry generate function, and . p i , the carry propagate function. The propagate function can be defined in two alternative ways. Its definition affects the adder's testability [11] , [12] . In inclusive-OR CLA adders, p i is defined as p i ¼ a i þ b i and another function, suppose h i ¼ a i È b i , is commonly used to denote the half-sum at bit position i. Obviously,
Since, in current CMOS technology, an OR gate is faster than an XOR gate, inclusive-OR CLA adders are faster than the corresponding exclusive-OR ones. Therefore, in this paper, we consider inclusive-OR CLA adders. In an integer adder, the carry at bit position i is given by:
Several fault models have been proposed for representing the actual faults of CMOS integrated circuits. The most common, however, is still the single stuck-at fault model because of its effectiveness and its simplicity. In this paper, for every cell, we exercise the logical paths from their inputs to their outputs and propagate the fault effects to the primary outputs of the circuit, which is equivalent to stuckat fault test generation [12] . A library containing just the most primitive gates: inverters, 2 input AND, NAND, OR, and NOR gates is assumed. More complex gates are replaced by equivalent circuits built by the library elements. For example, although three tests are adequate for covering a 2-input XOR gate, by the above assumptions, the exhaustive four vector test is required. Note also that, by the above assumptions, no changes are required to the extracted test patterns if a library that includes multipleinput versions of the primitive elements is assumed [11] .
For describing our test patterns, we use the notation introduced in [12] which utilizes carry propagation and generation across multiple adjacent bit positions. The notation is restated here for the sake of completeness:
G l is a test that generates a carry out of l adjacent pairs of bits.
P l is a test that permits carry propagation across l adjacent pairs of bits without generating a carry out.
O l denotes a test that neither permits carry propagation nor carry generation out of l adjacent pairs of bits.
A complete test vector is written as a product list with the most significant bits at the left. For example, a 3 ; b 3 ; a 2 ; b 2 ; a 1 ; b 1 ; a 0 ; b 0 ; ð Þ ¼ ð 1; 0; 1; 1; 0; 1; 0; 1Þ is denoted as
The notation P a and P b is used for distinguishing between the cases a; b ð Þ ¼ ð1; 0Þ and (0, 1), respectively. The existence of carry-ins and carry-outs in test vectors, wherever they are needed, are handled by the + and -symbols as follows:
T þ is a test that requires a carry-in.
T À is a test that requires not to have a carry-in. þ T is a test that produces a carry-out.
À T is a test that does not produce a carry-out.
A test set consisting of k vectors is denoted by:
We consider single-level CLA or parallel-prefix adder implementations [13] , [14] . A modulo 2 n adder differs from an integer adder at the following:
1. A carry input signal is not present, hence p 0 is not required. 2. A carry output signal and the logic required for its production is not required, hence p nÀ1 and g nÀ1 are not required. Since a modulo 2 n adder does not have a carry-in signal, the test sets derived for inclusive-OR integer adders in [11] , [12] cannot be straightforwardly applied. However, we can follow a similar procedure in order to derive a new test set for modulo 2 n adders. In a modulo 2 n adder, the carry at bit position i is given by
for 0 i n À 2.
For testing the half sum functions, since they consist of XOR gates, we need to apply all four input combinations for every pair of bits. The propagation of possible fault effects on them, to the adder's outputs, is not a problem since they are connected to the XOR gates at the adder's outputs and the output of an XOR gate changes whenever either of its inputs change. According to this, the test set below should be applied for every pair of bits:
The generate functions are implemented by AND gates, which require that, at each pair of input bits, the following tests are applied:
Fault effects from these tests must be propagated through the carry computation functions. One way to achieve this is to propagate a fault effect on g i to c i , by forcing the other product terms in the function of c i to zero. This, however, cannot be achieved by controlling p i , but by ensuring that there is no carry-in at the ith bit position. For example, the equation for c 2 is given by
propagating fault effects of g 2 to c 2 (equivalently to s 3 ), the terms p 2 Á g 1 and p 2 Á p 1 Á g 0 should be at zero. Since p 2 cannot be set to zero independently of g 2 , we require that both g 1 and p 1 Á g 0 are at zero. Note that, since g 0 ¼ c 0 , fault effects at g 0 do not cause any propagation problems. In our notation, the above mean that each bit position 1 i n À 2 ð Þneeds the tests (the tests for bit position 0 have already been considered in (1)):
In a similar way, the p i functions require the tests O ½ , P a ½ , and P b ½ , but, for propagating fault effects on them to c i , a carry-in is required. So, the following tests for each bit position i, 1 i n À 2 ð Þ, are required:
For showing how the required tests for the carry functions can be derived, we assume a modulo 32 adder. The most significant carry function in this adder is given by
we need to test each product term and propagate possible faults on them. We present below the required tests for this in terms of the inputs at the four least significant bit pairs of the adder:
In order to propagate the fault effects to c 3 , the other terms should be set to 0 by the vectors
Merging both the sensitization and the propagation conditions yields the complete test for the g 3 term:
. The term p 3 Á g 2 requires the application of
In order to propagate the fault effects, the other terms should be 0. Setting
So, a complete test for the term p 3 Á g 2 is:
. Working the same way, a complete test for the term p 3 Á p 2 Á g 1 is:
and . A complete test for the last term is:
For finding a complete test set for c 3 , one needs to combine The pattern shown in (4) can be extended for deriving the test set required for testing the carry functions in a modulo 2 n adder which will have the form:
A complete test set for a modulo 2 n adder can be derived by merging the set (5) with (1), (2), and (3). One possible test set with 2 Â n vectors is:
As has been shown in [3] for CLA implementations and in [4] for parallel-prefix implementations, a modulo 2 n À 1 CLA adder follows a structure similar to that of an integer adder, but with modified carry functions. More specifically, the carry function at bit position i,
In this paper, we consider inclusive-OR modulo 2 n À 1 adders designed either with a single CLA level [3] or as parallel-prefix [4] . A modulo 2 n À 1 adder has neither a carry input nor a carry output; however, the carry computed at the most significant bit position is XOR-ed with the half-sum at bit 0 for forming the least significant sum output. The test sets given in (1), (2), and (3) are required at each pair of input bits both for sensitizing faults on, respectively, the half-sum, the generate, and the propagate functions, and for propagating possible fault effects through the carry computation circuitry. Consider now the case that n ¼ 4. In this case, the carries in the modulo 15 adder are given by:
We can observe that:
. The equations are derived in a cyclic manner, therefore, if one knows a test set for c k , a test set for any of the rest of the carry functions can be devised by left/right rotations of every vector in the test set of c k . . In a modulo 2 n À 1 adder, the carry function at each bit i has a form similar to that of the most significant carry function of a modulo 2 nþ1 adder. In the above example, this means that each carry of the modulo 15 adder has a similar form as c 3 of the modulo 32 adder.
By the above, utilizing test set (4), we can derive the test sets required for each carry function:
The above four test sets can be merged into a single test set for all the carry functions as follows:
. The vector P Á P Á P Á P ½ is used to replace all first vectors. . In the second group of vectors, all "don't cares" are assumed as P and therefore only the vectors
According to the above, a test set for all carry functions of a modulo 15 adder is:
(6) By extrapolating the above pattern, it is possible to derive the test set required for testing the carry functions of any modulo 2 n À 1 adder, which will have the form:
(7) The latter shows that n 2 þ 1 vectors are required for testing the carry computation unit of a modulo 2 n À 1 adder. The cardinality of the derived test set shows that a modulo 2 n À 1 adder compared to a modulo 2 n adder or an integer adder is a far harder to test circuit. By the way that the above vectors were derived, it is also clear that (7) is a superset of the test set required for testing the carry function at the most significant bit of a modulo 32 adder. In general, a test set for the carry functions of a modulo 2 n À 1 adder is also a test set of the carry functions of a modulo 2 nþ1 adder which, according to the discussion of the previous subsection, is a superset of the test set for the carry functions of a modulo 2 n adder. Moreover, by observing the form of the vectors of (7), we can see that extracting any subset of k bit positions in order will lead to a reduced width test for the carry functions of a modulo 2 k À 1 adder, with k < n.
Formal Test Set for an RNS Adder
A formal test set for testing an RNS adder for the moduli set 2 n ; 2 n À 1; 2 k À 1; 2 l À 1; . . . , with l < k < n, according to the discussion in the previous subsection, can be obtained by merging the test sets indicated in (1), (2), (3), and (7). One possibility that leads to n 2 þ 2 vectors is:
Example 1. Consider an RNS adder based on the moduli 2 4 ; 2 4 À 1; 2 3 À 1 . Suppose that x 3 x 2 x 1 x 0 and y 3 y 2 y 1 y 0 , w 3 w 2 w 1 w 0 , and z 3 z 2 z 1 z 0 and g 2 g 1 g 0 and f 2 f 1 f 0 are the operands of the modulo 2 4 , the modulo 2 4 À 1, and the modulo 2 3 À 1 addition channels, respectively. Then, a common test for parallel testing of the above addition channels is given in Table 1 . The gray shaded vectors of Table 1 are not required for testing the modulo 2 3 À 1 adder.
PROPOSED BIST SCHEME
In this section, we introduce a test-per-clock BIST scheme that applies the test set derived earlier for testing all the channels of the RNS adder. Scalable test generators for the case of inclusive-OR CLA integer adders have been presented in [15] . However, the generators of [15] cannot be used in our case since they are capable of producing O n ð Þ vectors, whereas, in our case, we need O n 2 ð Þ vectors, for testing the modulo 2 n À 1 adder. Fig. 1 presents a block diagram of the proposed test-per-clock BIST scheme.
The proposed BIST circuitry is composed of the following:
1. Modifications of the input buffers of the adders in order to transform them, in test mode, into shift registers. During test mode, these registers only perform a one bit right rotation. n À k flip-flops are added at each input buffer of each modulo 2 k À 1 adder, with k < n, in order to form an n-bit shift register. In the most common case, an RNS adder consists of only three channels, a modulo 2 n adder, a modulo 2 n À 1 adder, and a 2 nÀ1 À 1 adder, implying that only two flip-flops need to be inserted in the most common case. When test mode is entered, the left register of each channel is initialized to 111...111 and the right register to 000...001. 2. A control module, that generates three signals, t 1 , t 2 , and Test_Complete. The first two signals are used to conditionally complement the value of the bit that is shifted back into each register. t 1 is used to occasionally toggle the bit that is shifted into the left register of each channel, whereas t 2 is used for the same reason and the right registers. The control module is composed of a test vector counter and combinational logic that decodes the states of the test vector counter and produces t 1 , t 2 , and Test_Complete. Signal t 1 should be 1 at the fourth, fifth, . . . , n þ 2 ð Þth cycles, at the 2n þ 2 ð Þth and the 2n þ 3 ð Þth cycles, and at the k n þ 1 ð Þth cycle with 3 k n. t 2 should be 1 at the second, third, . . . , n þ 2 ð Þth cycles, at the 2n þ 3 ð Þth cycle, and at the k n þ 1 ð Þand k n þ 1 ð Þþ1 cycles with 3 k n. We observe that t 1 is at 1 in the vast majority of test cycles that t 2 is also at 1, so we conclude that some of the logic that decodes the test vector counter states is shared among the implementations of t 1 and t 2 . Test_Complete is asserted at the end of the n 2 þ 2n ð Þ th cycle. The control module can be easily described in HDL and synthesized for different values of n.
Modifications of the adders' output buffers so as, in
test mode, to behave as Multiple Input Signature Analyzers (MISRs). The vectors produced by the proposed BIST are summarized in Table 2. Comparing Table 2 and (8), it is obvious that the test vectors of (8) are included in the vectors produced by the proposed test pattern generator. The shaded vectors of Table 2 are vectors produced by the proposed BIST scheme that do not belong to the set defined by (8) . The required n 2 þ 2 vectors are applied by the proposed BIST scheme in n 2 þ 2n cycles. A counter of
Þbits is therefore required in the BIST control module of Fig. 1 .
Example 2. Consider the proposed BIST for a three channel RNS adder based on the moduli 2 4 , 2 4 À 1, and 2 3 À 1. Table 3 presents the input buffers' contents along with the t 1 and t 2 signals' values. Comparing Table 3 and  Table 1 , one can easily verify that the vectors required for testing the modulo 16, 15, and 7 adders are well within the test vectors provided by the BIST scheme given in Table 3 are those vectors produced by the proposed BIST that do not belong to the formal test set indicated by (8).
BIST EVALUATION AND COMPARISONS.
In this section, we compare the proposed BIST against ROM and FSM-based BIST schemes as well as against pseudorandom LFSR-based BIST schemes.
For embedding the test set provided by an ATPG tool, a designer may either design a TPG as a finite state machine or store the test patterns in an embedded ROM and successively retrieve them using a ROM address counter. We will denote these alternatives as FSM_BIST and ROM_BIST, respectively.
In the pseudorandom LFSR-based BIST schemes, we consider that the input buffers of each addition channel are modified to function, in test mode, as a single distinct LFSR. Two distinct cases are investigated regarding the test completion signal:
1. The test is complete when the channel that requires the largest number of LFSR states for achieving 100 percent precompaction fault coverage has received all required test vectors (we will hereafter refer to this scenario as Single_Check) and 2. Each channel's LFSR and MISR freezes when all states required for 100 percent fault coverage before compaction have appeared. We will hereafter refer to this scenario as Distinct_Checks. This scheme has less energy consumption than the first one.
In order to find a seed for each LFSR capable of ensuring short pseudorandom sequence for achieving 100 percent fault coverage we used the optimization procedure described in [16] . One hundred percent precompaction fault coverage was targeted.
For our comparisons, we use as metrics:
1. the area overhead imposed by each BIST scheme, 2. the postcompaction fault coverage (PCFC) attained, and 3. the test application time in number of test vectors. Since an RNS adder is usually embedded in a larger circuit, its inputs and outputs are not accessible by the primary inputs and outputs of the chip. For applying a test set in such embedded circuits, one usually relies on scan paths. Therefore, for evaluating the hardware overheads imposed by the different BIST schemes, we use as a basis a scheme in which the flip-flops of the input and output registers of the RNS adder are chained together in a single scan path.
We examined three different RNS adders. For getting realistic measures of the area overhead, we described the examined RNS adders in HDL and used the Synopsys1 tools driven by the UMC-VST 25 implementation technology (0.25 "m, up to 5-metal layers, 1.8/3.3 V) for our implementations. Our targeted operating frequency was set to 200 MHz, for typical process parameters and irrespective of the insertion or not of any Design-ForTestability (DFT) hardware. Table 4 presents the area overhead of an RNS adder supported by the examined BIST schemes as a percentage of the implementation area of an RNS adder with a single scan path. In the case of ROM_BIST, we can implement the ROM either as a memory array (with one transistor per bit) or as a combinational circuit. Because of the small ROM sizes required in the examined adders, the latter approach gives the best area results and is indicated in Table 4 . As we can see from Table 4 , both FSM_BIST and ROM_BIST require excessive implementation areas. In all examined cases, the hardware required for implementing FSM_BIST is larger than the RNS adder itself. Although smaller, the implementation area required by ROM_BIST is also very large. The required area for implementing ROM_BIST in the 2 32 ; 2 32 À 1; 2 31 À 1 adder case raises to 78 percent of an RNS adder with a single scan path, compared to 15.4 percent of the proposed BIST. The values in Table 4 reveal that the proposed BIST scheme requires an implementation area similar to that of the LFSR-based approaches for all examined RNS adders.
In order to measure the PCFC, we used a custom developed fault simulator. adder case, none of the considered schemes leads to complete 100 percent PCFC, we assume that, in this case, the three output registers of the RNS adder are converted into a single MISR, increasing in this way the degree of the primitive polynomial of the MISR. The results obtained for the PCFC and test length are presented in Table 5 . Each entry in Table 5 , excluding the PCFC percentage of the 2 8 ; 2 8 À 1; 2 7 À 1 adder, is an ordered triplet, whose elements refer to the modulo 2 n , the modulo 2 n À 1 and the modulo 2 nÀ1 À 1 channel, respectively. The results of Table 5 show that all schemes, except the FSM_BIST/ROM_BIST schemes in the case of the 2 8 ; 2 8 À 1; 2 7 À 1 adder, achieve 100 percent PCFC in all examined cases. 
CONCLUSIONS
Rapid test-pattern generation contributes to meet the evershrinking time-to-market. To this end, in this paper, we derive a formal test set for RNS adders consisting of n 2 þ 2
vectors. This set can be used for parallel testing of addition channels that use the moduli 2 n ; 2 n À 1; 2 k À 1; 2 l À 1; . . . , with l < k < n as their base. The test set was derived based on the adder's equations and can therefore be equally well applied to full CLA and parallel-prefix implementations. Moreover, it is parameterized and independent of a specific implementation library; therefore, it is totally reusable. A test-per-clock BIST scheme has also been proposed, that applies the derived test set in n 2 þ 2n cycles. Experimental evidence on three benchmark RNS adders shows that the proposed BIST scheme requires an implementation area close to that of LFSR-based schemes, whereas it is far more efficient than the latter in terms of test application time. BIST solutions based on embedding the test set provided by Automatic Test Pattern Generation (ATPG) tools by using a ROM or by designing a TPG as an FSM require too much area for their implementation. . For more information on this or any computing topic, please visit our Digital Library at http://computer.org/publications/dlib.
