This paper presents a general theory for developing new Svoboda-Tung (or simply NST) division algorithms not su ering the drawbacks of the \classical" Svoboda-Tung (or simply ST) method. NST avoids the drawbacks of ST, by means of a proper recoding of the two most signi cant digits of the residual before selecting the most signi cant digit of this recoded residual as the quotient-digit. NST relies on the divisor being in the range 1; 1 + ), where is a positive fraction depending upon: (1) the radix, (2) the signed-digit set used to represent the residual, and (3) the recoding conditions of the two most signi cant digits of the residual. If the operands belong to the IEEE-Std range 1; 2), they have to be conveniently pre-scaled. In that case, NST produces the correct quotient but the nal residual is scaled by the same factor as the operands, therefore NST is not useful in applications where the unscaled residual is necessary. An analysis of NST shows that previously published algorithms can be derived from the general theory proposed in this paper. Moreover, NST reveals a spectrum of new possibilities for the design of alternative division units. For a given radix b, the number of di erent algorithms of this kind is b 2 =4.
Introduction
Digit-recurrence division algorithms obtain the quotient one digit at the time, most signi cant digit rst. In the very well known higher-radix SRT division 1] the quotient-digit is selected by inspecting a few of the most signi cant digits of the residual and the divisor. In 1963, Svoboda published an algorithm where the quotient-digit is estimated to be the same as the most signi cant digit of the residual (the divisor is ignored) and if such an estimate is not accurate a compensation is carried out. Tung in 2, 3] applied Svoboda's technique to a signed-digit number representation 4] of the operands. In doing this, Tung exploited both the simplicity of the quotient-digit selection, characteristic of Svoboda's algorithm, and the carry-propagation-free property of the signed-digit number system 4]. We call this algorithm Svoboda-Tung (or simply ST) division. In spite of the nice features, ST is not suitable for VLSI implementation because of two main reasons. First, the quotient-digit is actually selected from an over-redundant signed-digit set 5] which is di cult to realize in VLSI. And next, it is not valid for radices 2 and 4. This paper, based on 6], presents a general theory for developing a new class of ST (or simply NST) algorithms for division, not su ering the drawbacks of the \classical" ST methodology. NST avoids the drawbacks of ST, by means of a proper recoding of the two most signi cant digits of the residual. This recoding covers a part of the role of the quotient-digit selection function performed by \classical" digit-recurrence algorithms. The rest of the selection function is very simple because the divisor is pre-scaled so that the new divisor is in a reduced range of the IEEE-Std range 1, 2). In other words, the proposed NST algorithm eliminates the drawbacks of the ST method, by introducing a slightly more complex selection function, i.e., by moving closer to the \classical" digit-recurrence algorithms.
NST relies on the divisor being in the range 1; 1 + ), where is a positive fraction depending upon: (1) the radix, (2) the signed-digit set used to represent the residual, and (3) the recoding conditions of the two most signi cant digits of the residual. If the operands belong to the IEEEStd range 1; 2), they have to be conveniently pre-scaled. In that case, NST produces the correct quotient but the nal residual is scaled by the same factor as the operands, therefore NST is not useful in applications where the unscaled residual is necessary (e.g., residual arithmetic). A detailed analysis of NST reveals that previously published algorithms 7, 8, 9] can be derived from the general theory proposed in this paper. Moreover, NST reveals a spectrum of new possibilities for the design of alternative division units. The paper is structured as follows. Section 2 reviews very brie y the principles of ST division. In section 3, NST is presented and analyzed in detail. Section 4 explores the whole spectrum of radix 4 NST algorithms and compares the \optimum" radix 4 NST divider 9, 10] with other divider architectures 8]. Finally, section 5 draws our conclusions.
Svoboda-Tung division (ST)
Digit-recurrence algorithms obtain the quotient Q digit by digit, based on the recurrence: R j+1 = b R j ? q j+1 Y: (1) R j is the residual after the j th iteration, R 0 is the dividend X, R n?1 is the nal residual R (n is the number of digits of the quotient Q), b is the radix, and q j+1 is the quotient-digit selected at step j + 1 1] . In this equation, j = 0; 1; ; n ? 1 is the recursion index, i = 0; 1; ; n ? 1 (2) c) The quotient-digit selection function is independent of the divisor Y and can be mathematically de ned by: q j+1 = b r j 0 + r j 1 , where r j 0 2 f0; 1g.
d) The range of the divisor Y for which the algorithm is valid is 2, 11]:
Analysis of the Svoboda-Tung division
The ST division looks very attractive to be implemented in VLSI because of the several nice features it o ers. However, ST also presents some di culties. Nice features:
a) The subtraction of q j+1 Y from b R j is performed in constant time, independent of the length of the operands. This is possible due to the use of a redundant signed-digit set to represent the quotient and the residual. Therefore, the VLSI implementation of a fast divider based on the ST technique is feasible. This feature is not exclusive of ST, there are variations of the SRT division, which also use carry-save or redundant signed-digit notation for the residual.
b) The quotient-digit selection function is conceptually simple. The most signi cant digit r j 1 of the residual is selected as a \good" estimate of the quotient-digit q j+1 , if the estimate is not accurate, an over ow occurs and a compensation is carried out. This is possible due to the pre-scaling of the divisor Y . Higher-radix SRT division requires comparisons of the multiples of the divisor Y and the residual R j for choosing the quotient-digit q j+1 . This process is certainly more complex than taking the most signi cant digit of the residual as the quotient-digit and compensating in case of over ow, as is done in ST. ST is not the only type of division that scales the divisor to simplify the quotient-digit selection, other algorithms have been proposed which use pre-scaling for this purpose 12, 7, 13, 14, 5] . Drawbacks: a) Since the residual R j and the divisor Y are represented using a signed digit-set D <b: > , subtraction of q j+1 Y from b R j in constant time with a carry propagation limited to one position to the left is only possible for radices b 4 4] . Moreover, if radix 4 were used, according to Avizienis' de nition would be 3 and from (3) we would get 1 < Y < 1, which is impossible to achieve. In summary, ST is valid for radices b > 4. b) Since a compensation has to be carried out when an over ow occurs due to a wrong estimate of the quotient-digit, q j+1 is nally selected from an over-redundant digit-set 5] D <b: > = f0; 1; ; g, being = b+ ?1. For example, suppose drawback (a) is overcome so that ST is valid for radices b 2, if the radix 4 signed digit-set D <4:2> = f0; 1; 2g were used to represent the residual, the quotient-digit q j+1 would actually have to be selected from the over-redundant digit-set D <4:5> = f0; 1; ; 5g. It is obvious that the multiples 5 and 3 of the divisor Y cannot be generated by simple shift and/or invert operations.
c) The range of the divisor, for which the algorithm is valid, is smaller than the range of normalized signi cants speci ed by the IEEE- Std 15] . Therefore, the normalized divisor Y std and dividend X std both have to be previously scaled by the same factor K so that the scaled divisor (Y = K Y std ) ts into the reduced range. SRT division does not need this pre-processing step 16, 1] . The rst two drawbacks, not being valid for radices 2 and 4, and choosing the quotient-digit q j+1 from an over-redundant digit-set, make ST di cult to implement in VLSI. Notice that the arithmetic condition of NST remains the same as that of the ST algorithm (2).
Range of the divisor and correctness of the NST division
In this section, the range of the divisor Y for which NST is valid is deduced. The new quotientdigit selection function (q j+1 = r j 1a ) turns the digit-recurrence (1) into R j+1 = b R j ? r j 1a Y .
Substituting this new digit-recurrence in the arithmetic condition (2) , and considering that the arithmetic condition (2) must be satis ed for all R j independent of the divisor Y , we obtain:
where R j min and R j max stand for the minimum and maximum of the residual R j . Let us consider the selection r j 1a = t. The sign of t is not relevant in the following analysis, since both cases lead to the same result, therefore let us consider t > 0. In order to allow the selection r j 1a = t, it is necessary that R j R j max = 0 : t
Since, as it will be shown later in section 3.2, the two most signi cant digits of R j will be recoded, the minimum R j which could imply the selection r j 1a = t is R j min = 0 : t , where is a positive integer such that . Therefore, the following must hold,
Replacing the values of b R j max (6) and b R j min (7) in (5) we nd, 1 Y < 1 + ?
Clearly the worst case occurs for t = , thus the range of the divisor Y for which NST is valid is 1 Y < 1 + , where is a positive fraction (i.e., < ). Notice that there are no special constrains on the radix b (b 2). Therefore, the rst drawback of the ST division has been avoided.
Recoding conditions
We showed in the previous section that NST is valid only if < . This condition means that, if the signs of the two most signi cant digits r j 1 b ? ? 1 < : (8) Selecting the quotient-digit as q j+1 = r j 1a subject to the constraint of the range of the divisor Y to 1; 1 + ) guarantees that NST converges to the correct in nite precision quotient, i.e., the arithmetic condition (2) is always satis ed. However, there is one more aspect that needs to be taken into account in the selection of the quotient-digit. When the two most signi cant digits of the residual are r j 1 r j 2 = 0 , the quotient-digit is q j+1 = 0, and the designer might think that performing an addition or subtraction of 0 in recursion (1) should be equivalent. This is not so in hybrid arithmetic, the hybrid addition of this residual R j and zero produces an apparent over ow (not a subtraction) of the residual R j and zero should take place for avoiding the apparent over ow.
Pre-scaling of the operands
NST requires the divisor to be pre-scaled from the IEEE-Std 15] normalized range 1; 2) into the range 1; 1+ ), where is a fraction. Moreover, the dividend must be pre-scaled by the same factor also, so that the value of the quotient is preserved. In other words, the original division de ned by X std = Q Y std + R, where X std , Y std are IEEE normalized signi cants, is replaced by the equivalent division de ned by K X std = Q (K Y std ) + K R, where K is the scaling factor, K Y std is the scaled divisor Y and K X std is the scaled dividend X. Notice that the quotient Q is preserved but the resulting residual is K times the actual residual R. Pre-scaling of the operands has been analyzed in detail in 19, 12, 20, 21] and the techniques developed there can be used to pre-scale the operands as required by NST. Since NST requires the divisor to be non-redundant, the pre-scaling unit must compute Y in non-redundant form (i.e., selection of K, computation of K Y std , and assimilation of the result).
Analysis of the NST division
NST can generate a family of division algorithms whose number is a function of the radix b. For a given radix b, there exists a di erent NST algorithm for every pair of valid values of and .
Remember that, is a function of b, and is a function of and b. The radix b de nes the number of iterations required to obtain a W-bit quotient, de nes the digit-set D <b: > used to represent the residual, and de nes the recoding conditions. Higher the radix b, higher the number of possible NST algorithms; therefore, larger the set of solutions to be explored to nd the optimum NST division. Fig. 1 shows the spectrum of possible radix b NST algorithms organized in a tree. Table 1 summarizes the values of , and for these special algorithms, for power of two We should mention that all the foregoing references demonstrate the convergence of the respective division algorithms based on speci c analysis for the particular radix b and digit-set used.
However, no matter what the analysis is, the quotient-digit selection function comes up to be the same as that of the general development presented in this paper. An algorithm not listed in Table 1 . Assuming the operands are normalized according to the IEEE-Std 15] the following observations can be made. In general, the greater the the simpler is the scaling unit, because less number of bits have to be considered in order to decide which constant K the operands should be multiplied by. For a given radix b, the smaller the the smaller the redundancy factor 3 of the digit-set D <b: > and the simpler is the carry-free adder/subtractor that computes recursion (1), because less number of multiples of the divisor have to be selected. It is also clear that the quotient-digit selection unit is simpler if less number of bits of the residual have to be examined. In light of these remarks the following conclusions can be drawn:
a) The \MRmr" algorithm is not suitable for VLSI implementation since it has the smallest and its quotient-digit selection function requires the examination of two radix b signed-digits.
b) The \MRMR" algorithm is also not convenient because its quotient-digit selection function requires the examination of two full radix b digits, although it has the greatest . c) The \mr" and the \MROR" algorithms are the most cost e ective for radices b 8.
d) The \MROR" algorithm is the most suitable for radices b 16 , but intermediate solutions such as the digit-set D <16:10> could be more e cient.
Comparison
True quantitative comparison of the speed and area of the whole spectrum of possible NST and other division algorithms is not feasible since it demands layout implementations and circuit level simulations 20, 8] . Therefore, rst we illustrate the process of choosing the optimum NST algorithm for a given radix b, with a detailed analysis of of the whole spectrum of radix 4 NST algorithms.
Next, we quantitatively compare the \optimum" radix 4 NST algorithm with other radix 2 and radix 4 dividers, based on the simple comparison model used in 8].
Choosing the optimum NST algorithm for radix 4
From (10) we know that the number of radix 4 NST algorithms is 4. The possible values of are 2 and 3 (see relation (4)). These two values de ne the possible radix 4 signed-digit-sets, i.e., the minimally redundant D <4:2> and the maximally redundant D <4:3> . From (9) we nd that there are one radix 4 minimally redundant, and three radix 4 maximally redundant NST algorithms, which make the total of four radix 4 NST algorithms. For the radix 4 minimally redundant digit-set ( = 2), the only possible value of is 1 (\mr"). For the radix 4 maximally redundant digit-set ( = 3), the possible values of are 0 (MRMR), 1 (MROR), and 2 (MRmr) (see relation (8)). Table 2 summarizes the main characteristics of all four radix 4 NST algorithms which we explain in the following.
In general, the total delay for a division algorithm based on pre-scaling is given by, T div = (Scaling + Recursion + Conversion) T clk : 
T clk is the clock cycle delay, Scaling is the number of clock cycles for pre-scaling of the operands, Recursion is the number of times recurrence (1) is executed, and Conversion is the number of extra clock cycles needed for nishing the on-line-converion 25] of the redundant quotient to binary form. For illustration purposes, Fig. 2 shows the timing diagram of the radix 4 \MRxx" NST dividers. The three parameters , and provide the designer with important information for choosing the optimum radix 4 NST algorithm. All \MRxx" algorithms ( = 3) need extra hardware and one extra cycle, compared to \mr" ( = 2), to compute the multiple of the pre-scaled divisor 3Y (see Fig. 2 ) which is not a power-of-two. The simplest way to encode the digits of the digitset D <4:3> is to use two regular SBDs such that its value is given by z <4:3> = 2 d 1 + d 0 , where z <4:3> 2 f0; 1; 2; 3g and d 1 ; d 0 are SBDs. The SBDs are encoded with two bits such that its value is given by d = d ? d , therefore a z <4:3> digit is encoded with four bits 8]. The number of bits of the residual that have to be observed by the quotient-digit selection logic of a \MRxx" NST divider depends on the recoding conditions ( ), the maximum is 8 bits (two z <4:3> digits) for the \MRMR" and \MRmr" algorithms and the minimum is 6 bits (one z <4:3> digit plus one SBD) for the \MROR" algorithm. The digits of the digit-set D <4:2> can be encoded with three bits such that its value is given by z <4:2> = ?2 d 2 + d + d , where z <4:2> 2 f0; 1; 2g 9, 17]. Therefore, the number of bits of the residual that has to be observed by the quotient-digit selection logic of the \mr" NST divider is 6 (two z <4:2> digits).
The complexity of the pre-scaling unit is a function of two related issues, the number of bits of the divisor Y std that have to be observed to choose the proper scaling constant K, and the powers-of-two decomposition of the constant K. Both issues are a function of which in turn is a function of and . They are listed at the bottom of Table 2 .
The most important isssues to be considered in the selection of an NST division algorithm are the number of bits of the residual that have to be observed by the quotient-digit selection logic and the set of multiples of the divisor that have to be computed. Both issues have a major impact on the cycle time and area of the divider. Looking at the data in Table 2 , we can de nitely conclude that qualitatively the radix 4 minimally redundant NST algorithm \mr" is the \optimum" of all radix 4 NST algorithms. The only point where \MROR" is superior to \mr" is the number of bits of the divisor Y std that have to be observed to choose the proper scaling constant K, it is 4 for \MROR" and 5 for \mr", however the impact of this superiority on the clock cycle and area of the divider is negligible.
In summary, we are able to qualitatively characterize the spectrum of NST algorithms for a given radix b (e.g., radix 4 in Table 2 ), by maping the parameters , and into quantitative design criteria, without going into the details of the implementation. In this way, we have determined that \mr" is the \best" radix 4 NST algorithm. However, a quantitative comparison of the latency and area of \mr" and other algorithms does require the designer to go into the details of the implementation.
Quantitative comparison of \mr" with other dividers
In 8], a detailed comparison of \MROR" (Divider 2 in Table 3 ) with dividers (3) to (9), listed in Table 3 , was presented and it was shown that \MROR" has a speedup over all the other divider designs but it has an area disadvantage with regard to architectures which do not require prescaling. In this section, we compare the logic implementation of the \mr" division presented in section 8 of reference 9] 
Computation time
The comparison model considers, all x-input gates (such as and/or, nand/nor) to have a delay of x=2 units, x-to-1 multiplexers and x-input xor/xnor gates to have x unit delays, inverters to have 0.5 unit delay, and x-way fanout to introduce x=4 unit delay 8]. Table  3 ). The Speedup of \mr" over \MROR", which was shown in 8] to be the best choice in terms of speed, is 1.19.
Area requirement
\MROR" and \mr" both require W carry free adder cells (CFA). A CFA cell basically contains a full adder (FA) and a 3-input multiplexer (2-input multiplexer in the case of \mr"). For comparison purposes we assume that all radix 4 divider designs use the same CFA cells. In \MROR", each radix 4 digit of the residual is encoded with two SBDs (4 bits); therefore, two registers of length W bits (2W latches) are required to hold the residual after each recursion. In \mr" each radix 4 digit of the residual is encoded with three bits 9, 17], hence three registers of length W 2 bits (1:5W latches) are 5 T clk has been computed based on the implementation presented in 9] for the model used in this paper. Table 3 includes the comparison of the area requirement of \mr" with the other dividers ignoring the latches and the area required by the quotient-digit selection logic, which is independent of the word length W, as these are basically the same for all the designs. Excluding the radix 2 implementations (Dividers 7, 8 and 9 in Table 3 ), the \mr" divider is smaller than all other dividers (with and without pre-scaling). Notably, \mr" is smaller than \MROR" by W CFA cells and 0:5 W latches. It should be noted, that the main motivation for designing higher-radix (b > 2) implementations is higher speed with a reasonable increase in area.
In 9], a comparison of the computation time and area of actual combinational VLSI implementations of the \mr" and \MROR" algorithms, for wordlengths W 32 bits, was presented. The target technology was CMOS (1 m, 2 metal) and the layouts were automatically synthesized by GenOptim 29] based on a standard cell approach. The results published in 9] and reproduced in Table 4 , although not corresponding to IEEE-Std precisions, are valuable since they were obtained from actual implementations under the same conditions, i.e., same technology, same design style, same tool, and no human designer ability involved. These results con rm that \mr" is superior to \MROR", in terms of speed and area.
Conclusions
A new Svoboda-Tung division algorithm (NST) has been developed and analyzed in detail. The proposed NST algorithm avoids the drawbacks of the Svoboda-Tung division, by proper recoding of the two most signi cant digits of the residual. This recoding covers a part of the role of the quotient-digit selection function performed in the \classical" algorithms. 
