Introduction
An efficient Residue number system (RNS) consists of k pairwise relatively prime moduli . A number X is represented in as , where denotes X mod m. The cardinality or dynamic range of equals the product of the moduli [1] . Implementation of add, subtract, or multiplicationin such an RNS is distributed on k computing channels corresponding to the k moduli, thus replacing a relatively slow operation on a wide operand by k independent and concurrent operations of the same kind with smaller operands. Therefore, balancedlatency channels are naturally desirable. Improved RNS addition [2] , [3] , [4] , multiplication [5] , [6] , [7] , [8] , and conversion [9] , [10] , [11] , [12] algorithms, sometimes based on new encodings [13] , appear regularly in the literature. Applications of RNS include certain signal processing domains where the division operation is rarely used and the forward and reverse conversion overheads are more than offset by the attendant gain in speed. This compensation is more likely to materialize with certain special moduli. For example, the special 3-modulus number system has been of longstanding interest due to affinity of mod-( ) addition and multiplication with ordinary n-bit binary arithmetic [14] , [15] , [16] . Extensive research on adders has improved them to the point thatonly one n-bit carry-propagate addition takes place in the course of adding two mod-( ) residues [14] , [15] . In order to increase the dynamic range with minimal negative impact on arithmetic speed, one can add extra equally wide coprime moduli (e.g., ) to , where special moduloadder with only one n-bit adder in the critical path is offered in [17] . However, substantial effort may be needed for designing multiple arithmetic units for different values of (e.g., 3, 5, 7, for n = 4) from scratch, including the labor-intensive and error-prone optimization process for high speed and power economy in each case. A study by [2] divides the values in two classes, depending on 1s in the binary representation of .
We introduce an excess-representation for modresidues that leads to unified modular adders for all , where our designs use standard arithmetic components, such as carry-save and carry-propagate adders that have been extensively optimized for area, power, and a host of other composite figures of merit. Further advantages of a unified treatment, as opposed to a multiplicity of specialized schemes, include the applicability of a single strategy for fault tolerance. Both gate-level analyses and VLSI synthesis point to advantages in latency, area, and/or power compared with other proposed implementations in the literature.
The rest of this paper is organized as follows. After discussing mod-( ) adders in Section 2, we define our excess-representationin Section 3. Section 4 deals with the use of this representation for implementing modular adders. In Section 5, costperformance comparisons are offered via both gatelevel analysis and circuit synthesis. Conclusions and ideas for future research appear in Section 6.
Modulo-( ) Adders
General mod-m addition of n-bit mod-m residues and ( , and ) can be described as in Eqn. 1, where Z = A + B -m, and σ = 0(1), if Z≤ 0 (> 0).
(1)
Realization of Eqn. 1 entails three O(log n) operations, at best, for computing Z, σ, and Z + m. Hiasat [17] proposes an alternative formulation, similar to Eqn. 2, where and no zero detection (i.e., computation of σ) is required. Hiasat cites three prior works that also implement Eqn. 2: [18] uses an n-bit adder to compute , followed by another one to form ; [19] uses the same adder twice through a latch; and [20] uses one binary adder to evaluate and another one, preceded by a carry-save adder (CSA), to compute concurrently. Hiasat's scheme forms two different sets of propagate and generate signals in positions where the binary representation of has a 1. One of these sets is later selected via the carry-out of , which is computed by a carry-lookahead logic dedicated exclusively to compute the required carry-out, thus saving some area and delay. A recent realization of Eqn. 2 [2] is similar to that of [17] , but postpones the selection operation until the actual sum results are available, thus saving some area and time. The aforementioned designs all entail the undesirable fan-out of n due to one selection signal controlling n multiplexors.
Eqn. 2 can be transformed into Eqn. 3, where , and is the logical complement of (i.e., ).
Direct implementation of Eqn. 3 requires a CSA, followed by a carry-propagate addition stage and a conditional subtraction.
In what follows, we propose a new method and representation that obviates the need for the final subtraction. In the mod-( ) adders of [15] and [14] , corresponding to and , respectively, two O(log n) additions (similar to those in Eqn. 2) have been fused in one end-around-carry look-ahead addition. However, this two-in-one technique works only for carry lookahead schemes and cannot be applied to other addition schemes (e.g.,those with lower area/energy burden, such as ripple-carry addition). Neither can it be generalized to other values of in ( ). To remedy the latter problem, we endeavor to postpone the conditional subtraction of in Eqn. 3, fusing it with the next modular operation (or with the final conversion to binary). This would intuitively require flagging the interim sum with , as a reminder of the necessity of the compensating subtraction operation.
Excess-Representation
A natural encoding of mod-( ) residues uses n-bit words representing integers in . Implementation of Eqn. 3, based on this natural encoding entails a -weighted inverted end-around carry subtraction, which calls for a second n-bit carry-propagate operation. To avoid this undesirable time overhead, we propose a special representation.
Definition 1 (Excess-representation): A mod-( ) residue is encoded as a (flag, magnitude) pair ( ), such that , , and . The representations ( ) and ( ) denote the same residue . The overlap in this representation corresponds to the shaded interval in Fig. 1 .
Eqn. 4 represents a modified version of Eqn. 3, based on the new representation, where , , and the interim sum is computed based on Eqn. 5. 
Overlap interval
A Reconfigurable ModularAdder
Let represent the third term of in Eqn. 5, which has a value in . We can reasonably restrict as and thus represent it as an ( )-bit unsigned number . Consequently, can be encoded as the 1's-complement number . Eqn. set 6, which describes the bits of in terms of those of , and the input flags and , can be easily derived from the truth relations of Table II. ,
Eqn. 7 shows how is computed in terms of the bits of , of , and of the sign-extended .
We convert the term of Eqn. 7 to its carry-save representation by means of an n-bit CSA. The result is described by Eqn. 8, where and represent the sum and carry signals of a full adder that computes , for , and .
(8)
The term of Eqn. 8 can be computed by any n-bit adder, in a generic manner, including fast CLAs. Eqn. 9 describes the resultant interim sum, where represents the carry-out of the generic adder.
(9)
To compute , we note that because , the parenthesized expression in Eqn. 9 should have a value in {0, 1}. Therefore, . Eqn. 10 is the logical equation for based on the latter arithmetic equation and can be derived via a simple truth table.
(10) Fig. 2 depicts the proposed unified mod-( ) adder architecture that implements the computations described in Eqns. 6-10. Note that in parallel prefix realization of the n-bit adder, the two most significant carry signals are available at the same time. Therefore, the most significant sum bit and the sum flag are also available at the same time.
Our scheme has six advantages compared with other methods published in the literature: (1) The use of only standard off-the-shelf components (e.g., full and half adders, carry-lookahead logic, parallel prefix nodes), facilitates upgrading of the arithmetic circuits when new and superior versions become available; (2) The generic nature of the main n-bit adder expands the design space, allowing the best adder architecture to be chosen; (3) Latency is reduced by virtue of using only one normal n-bit adder on the critical path; (4) Unified design approach for all values of leads to identical circuitry for all RNS channels, reducing the effort for design, verification and testing, and enhancing regularity; (5) Reconfiguration is facilitated by placing the value in a writable register, allowing low-cost fault tolerance, as in [13] ; (6) The dynamic range can be adjusted with high resolution via selection of suitable set of values.
To elaborate on point 6 above, Table III provides maximum possible dynamic ranges from 10 to 38 bits corresponding to two to eight 5-bit RNS channels.
The advantages just listed notwithstanding, there is a drawback for RNS applications requiring very wide dynamic range that would necessitate a fairly large number of relatively small moduli: RNS to binary conversion is more difficult in such cases [21] , [22] , [23] , [24] than for comparable RNSs with a few moduli of the forms similar to those of . 
Synthesis and Comparisons
To compare the area and delay of the proposed adder with the latest relevant work [2] , and with the designs in [18] and [17] , we commence with the same gate-level evaluation of area and delay used in the references: we count each basic gate as having unit area and delay, with XOR gates counting double. The results are reported in Table IV, assuming that, as in the best design of [2] , the main n-bit adder is of the Kogge-Stone [25] parallel prefix (KSPP) variety. Each full adder in the CSA is assumed to imply 7 units of area and 4 units of delay [26] , with the first XOR delay overlapping with the computation in the box. Table V provides some detail on how the area and delay of the n-bit KSPP adder is evaluated, where pgh stands for the propagate ( ), generate ( ), and half-sum ( ) signals. There are n of the 3-gate pgh boxes, n XORs for the final sum bits, at most 0.5n bottom parallel-prefix nodes and leftmost nodes, each with 2 gates, and 1.5n + 1other nodes, each with 3 gates. Therefore, the unit gate area of the KSPP adder adds up to +1.5n + 3. The gate-level evaluation of Table IV indicates the same delay for our adder and the one in [2] . The area difference between the latter and our proposed design is expressed by Eqn. 11. The first term is negative for (e.g., 3, 5, 9, 17). 
It is not difficult to verify that for . However, the case n = 3 lacks practical significance. Also, synthesis results (Table VI) show that the difference for the more practical data paths with is only 0.7% in added area for the proposed design, while the power dissipation is 2.4% less.
For a more reliable evaluation, we described the designs in [18] , [17] , and [2] , along with our proposed adder, in VHDL code, validated them for , and used the codes as inputs to a 0.13 m CMOS technology synthesis process via the Synopsys design compiler. Table VI shows the synthesis results for the four chosen design points.
In the first design point of Table VI , all four bits in the binary representation of are 1s. In the second and fourth design points, has two 1s in its 4-bit and 7-bit encodings ( and ), respectively. Finally, in the third instance, six of the seven bits in the binary representation of are 1s. For the first and third design points, our proposed approach fares better in all three measures compared with the design of [2] . In particular, our design outperforms the latter interms of the power-delay product figure of merit (1635 versus 2035, or nearly 20% less). Figs. 3-5 also contrast the balance of our proposed design and that of various weights to the imbalance designs. Fig. 6 depicts the performance designs, using the composite powerfigure of merit. We see that, though n superior to other designs, our proposed well in most cases and may even be method in other cases when design un of verification/testing, and fault tolera into account. 
Conclusio
We have proposed a unified m modular adders that leads to designs, even when compared a are restricted to special moduli offers other advantages in desig standard arithmetic building blo efficiency of incorporating fault The regularity introduced should and also render FPGA-based im attractive. The standard building carry-save and carry-propagate ad extensively optimized for area, p other composite figures of merit continue to undergo even greate new designs appear in the literatu Our scheme was made pos representation of arbitrary residu bit indicates a possible postpone next RNS operation.
We presented quantitative c designs with those previously literature, using both detailed gat practical VLSI synthesis. The advantages in latency, are consumption, compared with oth appearing in the literature. We various possibilities for improvin particular attention to forward and processes.
mparisons of duct. on method for designing highly competitive against schemes that i. Our method also gn regularity, use of ocks, and ease and t tolerance schemes. d ease VLSI layout mplementations more blocks used include dders that have been power, and a host of over the years, and er improvements, as ure (e.g., [27] ). ssible by a special ues, in which a flag ed subtraction to the comparisons of our y published in the te-level analyses and results pointed to a, and/or power her implementations are now examining ng our designs, with d reverse conversion
