Abstract-In this paper, we present new design methods for modulo 2 n AE 1 adders. We use the same select-prefix addition block for both modulo 2 n À 1 and diminished-one modulo 2 n þ 1 adder design. VLSI implementations of the proposed adders in static CMOS show that they achieve an attractive combination of speed and area costs.
INTRODUCTION
M ODULO 
2
n AE 1 arithmetic has been used in a variety of applications for many years. A first application of modulo 2 n AE 1 arithmetic is in Residue Number Systems (RNS) [1] , [2] , [3] , [4] . In an RNS, every operand X is represented by a sequence of residues ðX 1 ; X 2 ; . . . ; X M Þ, where X i ¼ X mod p i . The p i s, 1 i M, comprise the base of the RNS and are pair-wise relative prime integers. Every RNS operation on two operands, suppose }, is defined as ðZ 1 ; Z 2 ; . . . ; Z M Þ ¼ ðX 1 ; X 2 ; . . . ; X M Þ}ðY 1 ; Y 2 ; . . . ; Y M Þ, where Z i ¼ ðX i }Y i Þ mod p i . For most RNS applications, } is either addition, subtraction, or multiplication. Significant speedup over the corresponding binary operations can be therefore achieved because each Z i is computed in parallel in a separate arithmetic unit (channel) since its computation depends only on X i , Y i , and p i . One of the most popular three moduli sets is f2 n À 1; 2 n ; 2 n þ 1g [5] because it offers very efficient implementations. Addition in such systems is performed using three channels that, in fact, are a modulo 2 n À 1, a modulo 2 n , and a modulo 2 n þ 1 adder [1] , [4] . Modulo 2 n À 1 (equivalently, one's complement) adders also find great applicability in fault-tolerant computer systems. They are commonly used for implementing residue, inverse residue, product (AN), and checksum arithmetic codes. For low-cost implementations of such codes, modulo 2 n À 1 adders are used both for the encoding process and for implementing the various arithmetic operations on the encoded operands [6] , [7] , [8] . Such codes are also used extensively in checksum computation and error detection in TCP/IP networks [9] . Given that the state of the art in transmission technology today over a single channel is 40 Gbps and given the global deployment of TCP/IP based internetworking, the need for hardwarebased engines that assist in the computation of checksum code and in the detection of transmission errors is easily appreciable.
Modulo 2 n þ 1 adders are commonly utilized as the last stage adder of modulo 2 n þ 1 multipliers. Modulo 2 n þ 1 multipliers find applicability in:
. Pseudorandom number generation: Special cases of the linear congruential sequence [10] use modulo 2 n +1 multiplication to obtain reasonably long sequence of pseudorandom numbers. . Cryptography: For attaining the desirable statistical independence between ciphertext and plaintext [11] , [12] , [13] . Cryptography plays an increasingly important role in today's wireless networks and smartcard applications. . In the Fermat number transform, which is an effective way to compute convolutions because of its easy VLSI implementation and its lack of round off errors [14] . Leibowitz proposed the diminished-one number system [15] for attaining efficient implementations of modulo 2 n þ 1 arithmetic circuits. In the diminished-one number system, each number X is represented by X Ã ¼ X À 1 and the representation of 0 is treated in a special way. Therefore, diminished-one modulo 2 n þ 1 circuits require only n bits for their number representations. The efficiency of the resulting circuits was shown in many residue number system implementations [16] , [17] .
In this paper, we derive new architectures for modulo 2 n À 1 and diminished-one modulo 2 n þ 1 adders. Both architectures are constructed by properly interconnecting select-prefix addition blocks. The resulting modulo adders offer a very attractive combination of short execution times and small area implementation requirements. Therefore, they are especially useful in applications for which, apart from speed, area is also a critical parameter. Static CMOS implementations are utilized to show the efficiency of the proposed adders in the area Ã time 2 (A Ã T 2 ) sense. The rest of this paper is organized as follows: The foundations of speeding up the addition operation as well as, the application of these research efforts to modulo 2 n AE 1 addition, are reviewed in the next section. In Section 3, we present our new architectures. Comparative results against the architectures for modulo 2 n AE 1 adders already proposed are given in Section 4. We draw our conclusions in the last section. 
ADDITION SPEEDUP ISSUES
For the bits s nÀ1 s nÀ2 . . . s 1 s 0 of the sum S of A and B, we then have
Carry computation using the propagate and generate terms can be transformed into a prefix problem [18] if the associative operator o defined by:
is introduced. Then, the carries are given by c i ¼ G i , where G i is the first member of the group relation (assuming that the carry input, c in , is 0):
Considering carry computation in binary addition as a prefix problem, several algorithms have been devised. These algorithms lead to efficient implementations called parallel-prefix adders. Tree structures have traditionally been used for graphically representing the various parallelprefix algorithms. As an example, Figs. 1 and 2 present the tree structures of the fastest unbounded [18] and bounded [19] fan-out algorithms, respectively. Fig. 3 presents the general structure of an adder with a parallel-prefix carry computation unit and Fig. 4 gives the gate level implementation of the operators used in Figs. 1, 2, and 3. Knowles, in [20] , examines combinations of the algorithms presented in [18] , [19] and presents possible trade offs of fan-out and implementation area. Ladner-Fischer and Kogge-Stone prefix structures, according to [20] , are the end cases of minimum implementation area and maximum speed, respectively, of a large family of addition structures, which all offer the minimum logical depth property.
Parallel-prefix adders with a carry input signal c in ¼ c À1 can be easily designed by inserting an extra stage of . prefix operators (in fact, only the AND-OR gate is required) between the carry computation and the summation unit, as shown in Fig. 5 [21] . We will hereafter refer to this extra stage as the carry increment stage. The carry output in a parallel-prefix adder with a carry increment stage is given by:
Considering 2 n AE 1 addition, several architectures have appeared during the last years. Taking into account the execution latency, the most efficient architectures for modulo 2 n À 1 addition have been proposed in [22] and [23] , which propose CLA and parallel-prefix modulo 2 n À 1 adders, respectively, which can operate as fast as the corresponding 2's complement adders. Time efficient CLA and parallel-prefix diminished-one modulo 2 n þ 1 adders have been proposed in [24] , [25] . Considering those architectures, we note the following:
1. The CLA architectures for modulo 2 n À 1 adders which have appeared in [22] , as well as the CLA modulo 2 n þ 1 adders for diminished-one operands of [25] , utilize the idea of the reentering carry simplification. CLA modulo 2 n AE 1 architectures lead to implementations with small area requirements that are also very fast for small operand widths. For wide operands however, a single level of CLA is insufficient; two or more levels of CLA are required. In this case, a large number of possible combinations among the number of CLA levels and the implementation of each CLA level have to be investigated for each different implementation technology in order to reach a fast design.
2. The parallel-prefix algorithms for modulo 2 n À 1 and diminished-one modulo 2 n þ 1 adders of [23] and [25] , respectively, utilize the idea of carry recirculation at each prefix level. The resulting architectures offer extremely fast implementations for sufficiently large operands in the cost of increased implementation area. When n 6 ¼ 2 k , with k ¼ 2; 3; . . ., several equally fast solutions can be derived. Therefore, a designer has to explore the design space in order to find the most efficient in terms of area solution.
3. Parallel-prefix adders based on the algorithms presented in [18] , [19] slightly modified for achieving modulo 2 n AE 1 addition have been proposed in [24] . For small and medium operand widths, the resulting adders form a good compromise in both complexity and delay terms. As a result, their area À time 2 product, A Ã T 2 , is the best among the existing architectures. However, for wide operands, the reentering carry input has very large fan-out requirements, leading to considerably slower designs. Moreover, for wide operands, the area grows considerably because of the reentering carry's buffering requirements. Motivated by the above, in the following section, we introduce new architectures for modulo 2 n AE 1 adders that are capable of offering a performance close to that of 2 above, with an implementation area close to that of 3 above for all examined operand sizes.
3 SELECT-PREFIX MODULO 2 n AE 1 ADDITION
Select-Prefix Addition Block
We can consider the adder of Fig. 5 as a building block. BG i and BP i are the generate and the propagate output lines of the block, respectively. We can easily see that the values of BG i and BP i do not depend on c in;i . A 2's complement adder can be designed using m blocks of the type of Fig. 5 , as shown in Fig. 6 [26], where a single AND-OR complex gate is utilized for forming the carry output of each block [26] . Due to their similarity to carry-select adders, these adders are called select-prefix adders [26] . The blocks used do not need to be of equal size; several combinations can be made. adder obviously depend on the choices made on the size of the used blocks. An algorithm for the selection of the sizes of the blocks with the aim of minimization of the delay of the adder has been given in [26] .
In the following, we at first show how an adder with a structure of Fig. 6 can be transformed into a modulo 2 n À 1 adder. Then, the same structure is transformed into a modulo 2 n þ 1 adder for diminished-one operands.
Proposed Modulo
where c out is the carry output of the binary addition of A, B. Therefore, a modulo 2 n À 1 adder can be derived from the adder of Fig. 6 by connecting the final carry output back to the carry input (end-around-carry adder). The direct connection of the carry output to the carry input, however, turns the adder in an asynchronous sequential circuit whose state depends on its previous state and the relative propagation delays in the adder [27] , [28] . The sequential behavior of the end-around carry adder is the cause of a practical timing problem. Due to the potential race between two stable states, the end-around carry adders may exhibit long delays [28] . The works in [27] , [28] are the first reported solutions to the problem of end-around-carry adder oscillations. Efficient adders that do not suffer from oscillations were proposed in [22] , [23] , [24] .
We can view the modulo 2 n À 1 addition as a two cycles operation. During the first cycle, (A + B) is computed with c in ¼ 0. During the second cycle, ðA þ B þ c in Þ mod 2 n is computed, where c in is the carry output of the first cycle. Therefore, the adder of Fig. 6 can be used to perform, in two cycles, the modulo 2 n À 1 addition. In the following analysis, we denote by BG i and BP i the generate and propagate outputs of the ith building block. During the first cycle for the adder in Fig. 6 , we have:
During the second cycle, we get: The above equations can be expressed by the following general formula:
The above relation shows that a block-based modulo 2 n À 1 adder can be designed by proper interconnections of the propagate and generate signals of the blocks. Consider, for example, the design of a modulo 2 dþfþg À 1 adder utilizing three blocks (m = 3) of the type of Fig. 5 of size d, f, and g, respectively. Then, from (1), we get:
The new architecture is given in Fig. 7 . Oscillations do not occur in the proposed architecture since the computation of BG 0 , BP 0 , BG 1 , BP 1 , BG 2 , and BP 2 is completely independent of the carry increment stages.
Note that the modulo 2 n À 1 adders designed according to the proposed procedure produce two representations of zero (all 0s and all 1s). For their application in an environment with a single representation of zero, minor modifications are required [22] , [23] . In such an environment, the proposed adders produce the all 1s representation of zero only when both input operands are complementary. Since, in this case, h i ¼ 1, for i ¼ 0; 1; . . . ; n À 1, and all the produced modulo 2 n À 1 addition carries are at 0, a modification that avoids the all 1s output is to produce 
Proposed Modulo 2 n þ 1 Adder Architecture
The diminished-one number representation is commonly used for modulo 2 n þ 1 operations. In this system, the value 0 is treated separately (for example, using an additional zero-indication bit). We will hereafter denote with X Ã the representation of X in the diminished-one number system, that is, X Ã ¼ X À 1. Let S denote the modulo 2 n þ 1 sum of A and B. It has been shown [24] 
, where c out is the carry output of A Ã þ B Ã . Therefore, a modulo 2 n þ 1 adder can be derived from the adder of Fig. 6 by connecting the inverted final carry output back to the carry input.
However, such an architecture suffers from oscillations. We can also, in this case, view the diminished-one modulo 2 n þ 1 addition of A, B as a two cycles operation.
During the first cycle, ðA Ã þ B Ã Þ takes place. In the second cycle, ðA Ã þ B Ã þ c in Þ mod 2 n is computed, where c in is the complement of the carry output of the first cycle, that is,
out . For the adder of Fig. 6 , for the first cycle, we have:
For the second cycle,
. . . BP 1 BP 0 BG 0 mÀ1 : The above relations can be expressed by the following general formula:
where
for i 6 ¼ À1:
Relation (2) indicates that a block-based modulo The new architecture is given in Fig. 8 . Note that also, in this case, no oscillations can occur since the computation of BG 0 , BP 0 , BG 1 , and BP 1 is completely independent from the carry increment stages.
All diminished-one adders suffer from the problem of the correct interpretation of the all zero output since it may either represent a valid zero output (that is, an addition with a result of 1) or a real zero output (that is, an addition with a result of 0). As an example, consider the diminishedone modulo 9 addition of A = 6 and C = 5 with B = 4. We then have:
The diminishedone modulo additions are presented below:
Result indicating real zero ¼ 000
The real zero output results only when the two inputs are complementary. Therefore, the circuit of Fig. 9 can be used in parallel with the adder to indicate the real zero result. Note that the Exclusive-OR gates required for the detector circuit are already present in the t u operators. Moreover, if p i is defined as p i ¼ h i ¼ a i È b i , then only a logic AND operation of the block propagate signals (signal BP i of Fig. 5 ) is required.
COMPARISONS
For realistic evaluation of the proposed architectures against those proposed in [22] , [23] , [24] , [25] , we modeled these architectures in HDL for n = 8, 16, 32, and 64. The results of [23] indicate that, for small to medium operands' lengths, the unbounded prefix algorithm of Fig. 1 leads to considerably more area efficient designs than the bounded one of Fig. 2 without a sacrifice in performance. We therefore used a Ladner-Fischer [18] prefix structure for the building block of the proposed adders. We mapped our designs to the VST Diplomat technology (0.25 m, up to 5-metal interconnection layers, 1.8/3.3 V) using Synopsys' Design Compiler. Each design was then recursively optimized for speed until the tool algorithm was unable to provide a faster design. As a last stage of each recursive run, the tool was instructed to recover as much area as possible. All the delay results are based on the assumption of worst-case process parameters and a wire model of up to 20K gates and five metal layers of interconnect and are given in ns, whereas all area results are expressed in m 2 . Tables 1 and 2 list the results gathered for modulo 2 n À 1 and diminished-one 2 n þ 1 designs, respectively. For the proposed architectures, we modeled more than one option when applicable. Each option is described as A x B, where A denotes the number of blocks and B the number of bits in each block. The shaded cells in the last three columns indicate the best achieved result.
As we can observe from Tables 1 and 2 , a CLA [22, 25] architecture is only area effective. The parallel-prefix architectures of [23] , [25] lead in nearly all cases to the fastest implementations. However, this is achieved at the cost of increased complexity. For the adders of [24] , only the results for the fastest parallel-prefix structure are listed in Tables 1 and 2 , although both bounded and unbounded fanout algorithms were modeled. We can see from Tables 1  and 2 that the proposed architecture for n > 8 is capable of offering, on average, a delay close to that of the faster architecture [23] , [25] with an area complexity close to the best. As a result, using as a metric the A Ã T 2 product, we can see that, for n > 8, the proposed architecture outperforms all previously reported ones.
Tables 1 and 2 also reveal that the proposed select-prefix modulo adders are, in all cases, slower than the parallelprefix ones, irrespective of the operands' length and the number of blocks that the operands are divided into. This does not contradict the findings of [26] since, in the case of modulo adders, the between blocks interconnection logic grows with the number of blocks that the operands are divided into. Tyagi in [26] has provided an algorithm for finding the sizes of the blocks that an adder should be divided into in order to attain the optimal delay. In a 2's complement adder, the carry output of a block only feeds the next more significant block. In our case of modulo 2 n AE 1 addition, the propagate and generate signals of each block drive the carry input of each other block in a cyclic manner. Therefore, to achieve an implementation with the best delay following the proposed architectures, one has to consider only select-prefix blocks of as equal size as possible.
CONCLUSIONS
We have, in this paper, introduced select-prefix architectures for modulo 2 n AE 1 adders. The adders designed following the proposed method achieve operating frequencies comparable to the fastest reported modulo 2 n AE 1 adders, while their area requirements are low. Therefore, they achieve the lowest A Ã T 2 product reported in the open literature. The latter, together with their regular structure and hierarchical nature, makes them very attractive for all applications of modulo 2 n AE 1 adders. 
