Introduction
The three moduli residue number system (RNS) possesses many advantages over other moduli sets. The residues are easily obtained, and the system can be efficiently scaled by any one of the chosen moduli [1]- [7] . As with all RNS arithmetic systems, the conversion from residue to binary is the most computationally intensive task.
Among all published algorithms for conversion from residue to binary, the algorithm given by Andraos and Ahmed [3] is generally acknowledged to be the most efficient.
In this paper we make substantial improvements to the algorithm proposed by Andraos and Ahmed [3] . As a result, a hardware saving in excess of 40% and a reduction of 45% in the critical path is obtained.
Andraos and Ahmed's algorithm
To convert a number, , from its residue representation into its binary counterpart, the Chinese remainder theorem of eqn.
(1) should be applied:
(1) where , is the dynamic range, is the multiplicative inverse of , and is the modulo operation. For the moduli , ,
, one has , , . Andraos and Ahmed show that the multiplicative inverses are given by , and , resulting in a simplified expression for the conversion as given in eqn. (2).
(2) proposed to use four 2 k bit binary adders, two of them in parallel, to evaluate this summation; this results in an efficient implementation.
Now we discuss improvements that can be made to this algorithm.
Improved Conversion Algorithm
First we show that the four numbers in the summation of eqn. (3) can be reduced to three numbers.
Let us consider the summation of the first two numbers
From the fact that: Assuming that is expressed in 2k bits, where the k-1 most significant bits are zeros. .
and (10) where the negative of a number modulo is the ones complement of that number.
We can rewrite the summation as shown in eqn. (11).
Assuming that the 2k bit expression of is given by:
then:
Thus
Let us first consider the summation of the second and third numbers on the right hand side.
Except for the k-th least significant bit position, for every bit position of these two numbers, there is only one non-zero bit. The summation of the k-th least significant bit is the summation of 1 and ; that yields a sum of and a carry of to the k+1-th bit position.
Since that bit of the second number is 1, the carry, , will propagate to the most significant bit position where, for the second number, this is . The summation of and is 1 and no carry is produced. The result of the summation of the second and third numbers is therefore , and so the k bits starting from the k-th least significant bit to the next most significant bit are all . Since and , i=0,1,...,k-1, can never be 1 at the same time, they can be combined with the k bits in the same range of the first number on the right hand side of eqn. (15). Now we can write:
where , i=0,1,...,k-1, ( denotes logic OR). Note that the k-th least significant bits of the second and the third numbers of the right hand side of eqn. (16) are zeros, and that, except for the most significant bit position, there is at least one zero bit in these two numbers for any bit position. Using the cyclic shifting property again, and proceeding as before, we can add these two number to obtain a single 2k bit number. Thus,
An example
To demonstrate the validity of the above reduction of three 2k bit numbers to two, we provide an example.
Let k=3, , ,
Using the new algorithm, , and
. Adding these two numbers together we obtain the correct result of 18.
Hardware implementation
To implement the modulo addition of three 2k-bit numbers efficiently, we may use 2k full adders as carry save adders (CSA) to convert the three 2k bit numbers into two.
From eqn. (8), the carry-out from the most significant bit is fed to the least significant bit position. Then a fast 2k-bit carry propagate adder (CPA), with its carry-out connected to its carry-in, is used to perform the modulo addition of the two numbers to yield the final result. The structure is shown in Fig. 1 . The same weighted bits from each number are the inputs to a full adder.
Removing redundancy
Using 2k bits to represent a number modulo , there are two representations for zero; i.e. and . This redundancy can be removed by eliminating the latter representation. The approach is shown in the following: 
Performance evaluation and comparison
In order to evaluate the cost of the converter, and to compare our technique with other approaches, we need to establish the CPA adder technology to be used. We have performed the following analysis assuming the availability of a fast carry look-ahead adder, such as discussed in reference [8] .
For a k-bit CPA adder, we evaluate the delay, , and the silicon area, , by:
and (19)
where and are the delay and the silicon area for a full adder, respectively.
The elimination of the redundant representation of zero is a common issue for all approaches we are going to compare. Therefore we have not included this step in the following estimation.
The delay of a modulo adder includes the delay of a 2k-bit CPA adder, , together with the delay of 2k-bit carry propagation, which we assume to be proportional to . Thus, the delay of a modulo adder is given by:
where the constant, , is less than 1. The silicon area for a modulo adder, as implemented by our technique, is identical to that for a 2k-bit CPA adder.
With the above analysis, we may readily obtain the delay and silicon area for the proposed converter as:
and (22) In the following, we evaluate the delay and silicon area for the converters given in [3] and [7] for comparison.
The converter described in [3] requires three 2k-bit CPA adders, two of which operate in parallel, one modulo adder, and 2k-1 AND gates, which are cascaded in stages to provide the input to the modulo adder. The delay is estimated by eqn. (23):
where is the delay of an AND gate.
We assume that the modulo adder in [3] is implemented with our technique, as shown in Figure 1 . Then the silicon area for the converter in [3] can be estimated by eqn.
(24):
where is the area of an AND gate.
The converter in [7] requires two modulo subtractors, one modulo subtractor, and one modulo adder. To implement a modulo or subtractor requires, as suggested in [7] , a k-bit adder followed by a k-bit carry corrector, as shown in Fig. 1 of [7] . This only requires the same hardware and delay as a k-bit adder. To improve the speed we assume that fast CPA adders are employed. Thus the delay and silicon area for the converter in [7] can be estimated by eqn. (25) and eqn. (26): [3] and [7] for , where the delay and area are normalized to the delay and area of a full adder, respectively. The delay and the area for the AND gates in eqn. (23) and eqn. (24) are ignored and we assume that . The Area Delay products in Table 1, give an overall performance comparison between the three converters. We can clearly observe a hardware saving of over 40% while almost halving the delay.
Conclusions
An efficient residue to binary converter for the modulus set is proposed. The efficiency is obtained by an improvement on an existing algorithm, and the introduction of a more efficient hardware implementation. Comparing to two converters, which are recognized as among the more efficient of the previously published art, our new converter architecture provides at least a 40% hardware saving and a 45% reduction in the delay.
Acknowledgments
The authors acknowledge support from the Natural Science and Engineering Research 
