In this paper we introduce a new algorithm for division in residue number system, which can be applied to any moduli set. Simulation results indicated that the algorithm is faster than the most competitive published work. To further improve this speed, we customize this algorithm to serve two specific moduli sets: (2 k , 2 k − 1, 2 k−1 − 1) and (2 k + 1, 2 k , 2 k − 1). The customization results in eliminating memory devices (ROMs), thus increasing the speed of operation. A semi-custom VLSI design for this algorithm for the moduli (2 k + 1, 2 k , 2 k − 1) has been implemented, fabricated and tested.
INTRODUCTION
The residue number system (RNS) has the advantage of carry-free arithmetic operations. Thus, using residue arithmetic would in principle increase the computer processing speed.
In particular, addition, subtraction and multiplication can be performed on each residue digit concurrently and independently. However, there are drawbacks associated with RNS. These drawbacks include the difficulty of residue operations like division, sign and overflow detection.
Generally speaking, all reported algorithms on division in RNS [3, 4, 5, 6, 7, 8, 9, 10, 11] have the disadvantage of lengthy arithmetic operations, large execution time and complex hardware requirements. The complexity of these algorithms is due to mixed-radix conversion (MRC) and performing difficult residue operations.
The moduli sets (2 k , 2 k −1, 2 k−1 −1) and (2 k +1, 2 k , 2 k − 1) are particularly important in applications which require a high degree of precision [12, 13, 14] . Some of the recursive digital filters require a high degree of precision in their computations in order to accurately control the frequency characteristics and to eliminate the occurrence of instabilities. Moduli sets like (2 k , 2 k −1, 2 k−1 −1) and (2 k + 1, 2 k , 2 k − 1) are among the very few systems that can deal with such critical situations [14] . The properties of these sets become more apparent in hardware considerations because most moduli are diminished or augumented powers of two. Residue addition for diminished powers of two is the carryadd type and multiplication by a power of two is equivalent to left rotation. Similarly, residue addition for augumented * Part of this paper is based on [1] . Another part of this paper is based on [2] . powers of two is the carry-subtract type. Therefore, they can play an increased role in implementing an RNS arithmetic unit for computers. The advances in VLSI technology have suggested novel approaches to the implementation of arithmetic units over finite rings. RNS supports the main VLSI design properties and features like simple connections, concurrency and modularity. Independence of residue digits eliminates complex interconnection patterns among different logic components. This independence leads to concurrency where an arithmetic operation can be carried on all residue digits concurrently. The similarity in processing architecture for each modulus offers functional and layout modularity.
In this paper, we present a new division algorithm. The main idea is based on selecting an approximate quotient that is guaranteed to produce a non-negative remainder, unless the division procedure is completed. The main features of this algorithm, as compared with others [3, 4, 5, 6, 7, 8, 9, 10, 11] , are: no sign determination, overflow detection, scaling or MRC is needed. Moreover, no need for base extension, auxiliary or redundant moduli. The algorithm speed is not dependent on the number of moduli, but on dynamic range.
Neverthless, this new algorithm is still based upon converting the residue representation of the dividend and the divisor to a weighted code in order to derive some information regarding the position of the most-significant non-zero bit contained in a residue number. The algorithm is then customized to serve the above mentioned moduli sets. This customization reduces hardware and time requirements associated with converting the residue representation to a weighted code. The customized structure has been designed and implemented using VLSI design tools. The layout has,
then, been fabricated and tested to verify the integrity and functionality of the design.
In order to evaluate the performance of this new design, it has to be compared with other RNS division algorithms. Chren's algorithm [5] has a slightly better performance compared to another algorithm [4] in terms of the mean of residue operations needed for each division problem. Nevertheless, it requires MRC and residue scaling. If lookup tables are used, then MRC and scaling would require (N − 1) and (2N − 1) memory cycles respectively [12] . Chren's algorithm also requires a redundant modulus which represents another drawback. Gamberger [6] presented an algorithm which does not use MRC. The number of iterations for each division problem is proportional to the magnitude of the divisor.
The mean of the number of iterations is, thus, very high. Moreover, the hardware implementation suggested by Gamberger is very complicated and expensive due to the utilization of auxiliary RNS. Hung and Parhami [11] introduced two RNS division algorithms based on the approximate-sign detection technique. The faster among the two [11] requires much more hardware than the other slower one. Hung and Parhami [11] indicated that intermediate to these algorithms are a number of choices that offer speed/cost tradeoffs. Although both the faster algorithm of Hung and Parhami [11] and the algorithm of Lu and Chiang [3] have the same time complexity, the latter one has a better hardware complexity.
The most competitive work, introduced by Lu and Chiang [3] , does not use MRC, however, it utilizes the idea of the fractional representation of X/M to detect the parity of a residue number and hence to check if an overflow has taken place. Lu and Chiang's algorithm requires 2 log 2 Q steps, where Q is the quotient. Each step consists of several residue additions and subtractions, one residue multiplication, two memory access cycles and one multi-operand addition (in fact, in parts II and IV of Lu and Chiang's algorithm, more than one multi-operand addition might be needed). Realization II of the new algorithm requires log 2 Q steps where each step consists of one residue multiplication (Q i Y ), one residue subtraction (X − Q i Y ) that is performed in parallel with one residue addition (Q = Q + Q i ), one multi-operand addition and two memory access cycles: one to get the fractional representations of residue digits, while the other is to obtain Q i . NOTE. Following the literature, the prevalent method of measuring execution time for residue arithmetic algorithms [4, 5, 12, 13] , the mean of the basic residue arithmetic operations needed by each algorithm, is computed. The basic residue operations are: addition, subtraction and multiplication.
The following notational convention has been adopted for this paper:
• {m 1 , m 2 , . . . , m N }, moduli set of N pairwise relatively prime positive integers.
. , the ceiling value of (.); that is the next integer greater than or equal to (.).
• . , the floor value of (.); that is the preceding integer less or equal to (.).
• Define a function h(I ) such that:
DIVISION ALGORITHM
Assume that X, Y and Q are non-negative integers such that Q = X/Y , Y = 0, then the following steps introduce the basic idea for division in RNS:
1. Set quotient Q to zero; Q = 0. 2. Find the position of the most-significant non-zero bit in
Find the position of the most-significant non-zero bit in the dividend X, say j , that is j = h(X).
Otherwise, Q is unchanged. In either case, end procedure. 6. If j < k, then Q is unchanged. End procedure.
This basic algorithm can be used effectively with RNS division arithmetic. An important feature of this algorithm is the selection of the quotient to be 2 j −k−1 , hence the quantity X − (2 j −k−1 * Y ) is guaranteed to be non-negative as long as X > Y . It should be emphasized that X and Y in the above algorithm are binary representations of nonnegative integers and that the algorithm is still correct when the fractional representation is adopted. The proofs of both cases, integer and fractional representations, are introduced in the next two subsections
Proof of correctness of the algorithm: integer representation
Before proving the algorithm, the following lemma has to be introduced.
LEMMA 1. For any residue integers X, Y ∈ [0, M), for which j
Proof. Since j = k, then X and Y can be expressed as:
Based on the above lemma, the proof of the algorithm is as follows:
• For the case j > k, and since Y < 2 k+1 and X ≥ 2 j then X/Y > 2 j −k−1 = Q i (i.e. Q i is the i th partial quotient). Hence the estimate of the quotient in each iteration is guaranteed to produce a positive remainder. Assume there are v iterations which satisfy the condition j > k, then the total partial quotients resulting from this case are Q, where:
possibilities are expected:
The procedure is then stopped.
X ≥ Y . This case is detected according to Lemma 1 by evaluating
• For the case j < k, it is obvious that X < Y , hence the procedure has to be stopped.
Therefore, the quotient would be:
, where:
Proof of correctness of the algorithm: fractional representation

LEMMA 2. In RNS, for any fractional representations X/M, Y/M where X, Y
Proof. Since X/M and Y/M are of the same order (i.e. j = k) that is:
Since the highest value of j is −1, then: j ≤ −2.
Note that for the special case,
Since X, Y ≤ M, then the maximum value of j is −1.
The proof of the algorithm for the case when the dividend and divisor are fractional quantities uses Lemma 2 and follows the same approach given in the proof of Realization I. The proof leads to the result that the quotient Q can be expressed as:
REALIZATION OF THE ALGORITHM IN RNS
Realization I
This realization is based on the integer representation outlined in the previous section. It is quite useful for small and medium dynamic ranges where all the bits of residue digits can be applied to a single RAM in order to evaluate h(I ). The proposed structure for Realization I, shown in Figure 1 , can be described as follows: by applying Y to a RAM 1, k = h(Y ) is evaluated, where k is expressed in r bits, r = log 2 (log 2 M) . Similarly, by applying X to RAM 1 j = h(X) is also evaluated, where j is also expressed in r bits. The partial quotient Q i is computed by applying j and k to RAM 2. A residue multiplier then multiplies the partial quotient with Y . The output of the multiplier, i.e. Q i Y , is subtracted from X to produce a new remainder. The procedure is repeated and the residue adder accumulates partial quotients, until j < k. RAM 1 accepts N 1 n i bits, where n i = log 2 m i address lines (i.e. bits of residue representation of X or Y ), thus it has a size of (2 N 1 n i × r ) bits, r = log 2 (log 2 M) . However, RAM 2 accepts the bits of j and k (a total of 2r bits), thus its size would be (2 2r × N 1 n i ). In the case that j − k is first evaluated before being applied to RAM 2, then RAM 2 would have the size (2 r+1 × N 1 n i ). This realization requires log 2 Q iterations. Each iteration consists of two consecutive memory cycles followed by two consecutive residue operations. The realization is very attractive for many digital-signal processing applications which utilize small and medium dynamic ranges.
Realization II
This realization is based on the fractional representation outlined in the previous section. It is quite useful for large dynamic ranges where bits of residue digits cannot be applied to a single RAM in order to evaluate h(I ).
Van Vu [15] developed a conversion technique based on the CRT. This technique uses fractional representation of weighted numbers. This technique is given by
where p is a non-negative integer. Hence, the value of X/M can be obtained by evaluating the right-hand side of (1). This is basically done by letting each x i address a The output of these N tables can be added using a multioperand adder. Any integer overflow resulting from this adder is disregarded since it represents the integer part of the summation result. The fractional value stored in tables should be expressed using t bits where t ≥ log 2 M N if M is odd and t ≥ log 2 M N − 1 otherwise [15] . The proposed structure for Realization I, shown in Figure 1 , can be described as follows: apply the residue digits of the divisor Y to the fractional representation circuit to obtain Y/M. This requires a memory cycle followed by a multi-operand addition. Y/M is applied to a priority encoder to evaluate k = h(Y/M). Next, and using the same approach, j = h(X/M) is evaluated. The bits of j and k (a total of 2r bits), or their difference, are applied to a RAM to produce the partial remainder Q i . This Q i is applied to a residue multiplier to compute Q i Y . The output of the multiplier (Q i Y ) is subtracted from X using a residue subtractor. The output of the residue subtractor is applied again to the fractional representation circuit as long as j ≥ k.
EVALUATION
The complexity of the proposed residue-based divider is highly dependent on the complexity of individual components being used in the design (e.g. residue multiplier, adder, etc.). Many residue-based multipliers can be found in the literature [16, 17, 18, 19] . These are different in their structures, area and time complexities (i.e. gate number, silicon area, time delay, etc.). The same statement can be made about other components in this proposed divider [12] . For example, the area and time complexities of the priority encoder presented in [20] are given by O(n) and O(log n), respectively. The proposed design in Figure 2 consists of different devices; namely RAMs, a multi-operand binary adder, a priority encoder followed by a RAM, a residue-based adder, subtractor and multiplier. For the first N RAMs, the size of the i th RAM is (2 n i × n), where n i = log 2 m i . The multi-operand adder accepts N operands of n bits each. The least significant (LS) n output bits of this adder constitute X/M. The encoder accepts these LS n bits of the adder and produces h(I ), expressed in r bits. The RAM following the encoder outputs the residue representation of the estimated quotient. This estimated quotient is, then, applied to different residue-based arithmetic components. Each of these components accepts N residue digits from each operand, where each residue digit is expressed in n i bits.
To compare the performances of this algorithm and Lu and Chiang's algorithm, computer programs simulating both algorithms have been developed to calculate the mean of residue operations (MORO). The mean of multi-operand additions (MOMA) has been calculated and compared, where it applies. For simulation purposes, three moduli sets were selected to serve different dynamic ranges. These sets are: M1 = (7, 11, 13, 15 Table 1 . For moduli set M1, the results are exact. All possible combinations of dividends and divisors were simulated. However, for M2 and M3, a sample of 200 million randomly generated numbers within the dynamic range defined by each moduli set was simulated. The simulation of the new algorithm is based on Figure 1 for M1, and on Figure 2 for M2 and M3. Simulation of Lu and Chiang's algorithm is based on the flowchart given in [3] .
For the moduli set M1, Table 1 indicates that the new algorithm is four times faster than Lu and Chiang's algorithm.
This conclusion applies to every moduli set where all the bits of residue digits can be applied simultaneously to a single RAM.
For other moduli sets like M2 and M3 which have very large dynamic ranges, the new algorithm is still four times faster regarding the number of basic residue operations. Moreover, the average number of multi-operand additions needed is almost half that needed by the other algorithm.
CUSTOMIZED DIVISION ALGORITHM
In Realization II, determining the position of the highest power of two contained in any residue number I, which we referred to as h(I ), is an important time-delay element in the operation of the proposed residue divider. In this section, we customize the same algorithm to serve two specific moduli sets: (2 k , 2 k − 1, 2 k−1 − 1) and (2 k + 1, 2 k , 2 k − 1). This customization results in eliminating the need of ROMs and thus reducing the delay contributed by evaluating h(I ).
Evaluating h(I ) for the moduli set
(2 k , 2 k − 1, 2 k−1 − 1) 
where FRAC(. . . ) denotes the fractional part of the operand. The circular shift property [13] states that modulo (2 p −1) multiplication of an integer by 2 n , where p and n are positive integers, is equivalent to n-bits circular left-shift (e.g. (3), we proceed as follows:
Assuming that the binary form of r 1 is given by:
Recalling that for mod2 k , only the LS k bits are sig- 
Assuming that the binary form of r 2 is given by:
Based on the circular left-shift property:
where b denotes the complement of the bit b.
On the other hand, the term 1/(2 k − 1) can be written as 2 −k /(1 − 2 −k ). Recall that any fraction in the form q/(1 − q), where |q| < 1, can be expanded in a power series form as:
Based on error analysis introduced in [15] , then the MS (3k + 1) bits are the only significant bits in our computations. Let R 2 represent the binary form of
where * implies that terms are concatenated.
Assuming that the binary form of r 3 is given in (k − 1) bits by:
On the other hand, the term 1/(2 k−1 − 1) can be written as 2 −(k−1) /(1 − 2 −(k−1) ). Thus, it can be expanded in a power series form as:
Therefore, (3) can be rewritten as In order to implement (7), one three-operand binary adder is needed. A carry-save adder (CSA) followed by a carrypropagate adder (CPA) can realize the addition of the three operands.
The residue decoder introduced by Sweidan and Hiasat [22] has the advantages of reduced hardware requirements and extremely wide fixed-point dynamic ranges since its upper bound is not limited by a memory size. Moreover, it requires 'only' a total of four 2k-bit binary adders, which makes it very attractive compared to other published decoders [23, 24] . In this paper, we propose a hardware layout that can decode residue digits of the moduli set (2 k + 1, 2 k , 2 k − 1) into binary equivalent. The new layout is an improvement of that presented in [21] . In this new contribution, we are reducing the number of adders needed for the decoding operation from 'four' 2k-bit binary adders into 'one' 2k-bit three-operand binary adder. It has been proved in [21] that
where:
Assuming that r 1 , r 2 and r 3 have the following binary format: Subtrac.
FIGURE 3.
Proposed implementation of the division algorithm customized for moduli sets:
expressed as [21] :
where: b 1x = b 10 OR b 1k . By redefining R 1 = A, R 2 = B − r 1 , and R 3 = C, then (8) can be rewritten as
• Case I: Since R 1 = A, then the binary representation is the same:
• Case II: Since R 2 = B − r 1 , then for the case r 1 < 2 k , and using the 2's complement notation, R 2 = B + (1's complement of r 1 ) + 1. Noting that the LS k bits of B are all ones, then the LS k bits of the result of the subtraction are simply the 1's complement of r 1 and an overflow of 1 at the (k + 1)th bit. Based on 2's complement, this overflow indicates that the result of subtraction is positive, hence it can be disregarded. However, when |r 1 | 2 k = r 2 = 0, then R 2 = |2 2k − 1| 2 2k −1 = 0. Therefore, R 2 can be expressed in binary format as
• Case III: r 1 = 2 k . In this case, the (k + 1)th bit of r 1 is 1, thus the values R 2 and B are the same because in the computation of R 2 we used the LS (k − 1) bits of r 1 , which are all zeros in this case. Therefore, the format of R 2 is not changed. However, to take care of this non-zero (k + 1)th bit of r 1 , it has to be subtracted from R 3 . Therefore
Equation (9) is simply accomplished by adding R 1 , R 2 and R 3 . The output should then be incremented by any output-carry. Nevertheless, a carry resulting from this adder can be neglected as long as the output does not have the value (2 n − 1), where 0 ≤ n ≤ 2k. This can be justified by the fact that h(I ) = h(I + 1) if I = (2 n − 1). However, if I = 2 n − 1, then the output carry would be significant and the priority encoder proposed in implementing the residue divider can take care of this special case.
A single carry-save adder can add these three operands. Few logic gates are also needed to detect the (k + 1)th bit of r 1 and to select the proper format of R 3 .
Using the formula X = X/2 k 2 k + r 2 , then the value of X can be obtained by concatenation of the k bits of r 2 to the 2k output bits of the three-operand adder. These 3k bits and the carry are applied to the priority encoder.
EXAMPLE. Consider the moduli set {17, 16, 15}. To find h(X) where X = (11, 11, 2) (i.e. X = 827), then: and (2 k + 1, 2 k , 2 k − 1). The operation of this divider is selfexplanatory. The propagation delay in Figure 3 , as compared with that in Figure 2 , has been reduced by a memory access cycle per iteration. Recalling that the memory access cycle is very significant compared with the delay of other components and that division is an iterative procedure, then this reduction will, eventually, be increasingly significant as the number of iterations per division problem is increased. This implies that the new proposed realization is much faster for these particular moduli sets. Moreover, the reduction in hardware requirements is another substantial improvement.
VLSI IMPLEMENTATION OF A RESIDUE-BASED ARITHMETIC DIVIDER
A pipelined design for a residue-based arithmetic divider for the moduli set (2 k + 1, 2 k , 2 k − 1) has been implemented, fabricated and tested. The detailed design of the implemented circuit is shown in Figure 4 . Data path sizes are also shown. The implementation was accomplished using Octtools-5.2 with a standard cell MSU2.3 library. For prototype purposes, k was selected to be four. Thus, the total number of input pins is 13 for each operand. The clock, an X/Y selector and reset are another three inputs. Similarly, the output quotient is expressed in 13 bits. Division-completed (DIVC) is a onebit output that goes high to validate the output quotient and sets the flag that the division process is completed. Divisionby-zero (DBZ) is another output bit which sets a flag if the divisor is zero. The design has an integrated circuit area of (1.792 × 1.675) mm 2 . The tiny padframe (40PC22 × 22) was used to accommodate this design. Test results showed that the design can run at a clock speed of 15 MHz. The number of clock cycles required for each division problem depends on both the dividend and the divisor. However, the average number over different division problems is eight clock cycles.
CONCLUSIONS
This paper has presented a new general division algorithm for RNS, which is faster than other previously proposed algorithms. The algorithms were then customized to serve two specific moduli sets: (2 k , 2 k − 1, 2 k−1 − 1) and (2 k + 1, 2 k , 2 k − 1). An RNS divider would then require a binary adder, a priority encoder, a ROM, a residue adder, a residue subtractor and a residue multiplier only. These reduced hardware requirements and processing time qualify the new realization to be very practical for many computing applications and therefore enable RNS to play an increased role in designing arithmetic logic units for general purpose computers. The proposed customized hardware has been implemented on silicon and test results have been presented.
